Tesseract限制匹配的字符集

来源：互联网发布：js的基本数据类型编辑：程序博客网时间：2024/05/17 06:31

在OCR过程中，我发现，如果能够限制匹配的字符集，那么肯定能够大大提高识别效率，但是能不能支持呢？我查了很多资料，都没有明确答案，最后，在stackoverflow终于找的了答案。

翻译过来。

Q：

Is it possible to limit the set of characters that tesseract is looking for (e.g. search only for letters a-z)? That would improve my results greatly.

A：

Create a config file (e.g "letters") in tessdata/configs directory - usually

在特定文件夹中，创建一个配置文件，文件夹通常在

/usr/share/tesseract/tessdata/configs

/usr/share/tesseract-ocr/tessdata/configs

And add this line to the config file:

填入一行配置，即你要加入的字符集

tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz

...or maybe [a-z] works.. dunno :-)

Then call tesseract similar to this:

然后在调用命令时候，要用以下形式：

tesseract input.tif output nobatch letters

That will limit tesseract to recognize only the wanted characters

这就把匹配字符集限制在你自定义的范围内了。

第一次做翻译，翻译完后才发现这难度，完全没必要，哈哈。不过，作为我的第一次翻译，意义还是有的。