Tesseract 3.02中文字库训练
来源:互联网 发布:万国数据张妮娜 编辑:程序博客网 时间:2024/05/16 02:54
Tesseract 3.02中文字库训练
下载chi_sim.traindata字库
下载tesseract-ocr-setup-3.02.02.exe
下载jTessBoxEditor用于修改box文件
0.准备
为了方便 tif文面命名格式[lang].[fontname].exp[num].tif
lang是语言 fontname是字体
比如我们要训练自定义字库 mjorcen字体名normal
那么我们把tif文件重命名 mjorcen.normal.exp0.jpg
图片 :
下面开始训练字库:
1、生成 .box文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makebox
把图片文件和box文件放在同一目录,
2、用jTessBoxEditor.jar打开tif文件,然后根据实际情况修改box文件
3、 生成 .tr文件
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 nobatch box.train
4、成一个unicharset文件
unicharset_extractor mjorcen.normal.exp0.box
5、新建一个font_properties文件
里面内容写入 normal 0 0 0 0 0 表示默认普通字体
6、运行命令
shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.trmftraining -F font_properties -U unicharset -O unicharset mjorcen.normal.exp0.trcntraining mjorcen.normal.exp0.tr
结果如下:
E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.trReading mjorcen.normal.exp0.tr ...Building master shape tableComputing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances... 0 1 2 3 4Stopped with 0 merged, min dist 0.365385Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple unichars = 0E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -Uunicharset -O unicharset mjorcen.normal.exp0.trRead shape table shapetable of 5 shapesReading mjorcen.normal.exp0.tr ...Done!E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.trReading mjorcen.normal.exp0.tr ...Clustering ...Writing normproto ...
7、把目录下的unicharset、inttemp、pffmtable、shapetable、normproto这五个文件前面都加上normal.
8、执行combine_tessdata normal.
9、把 normal.traineddata 复制到Tesseract-OCR 安装目录下的tessdata文件夹中
10、测试
tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normal
debug:
E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l chi_sim batch.nochop makeboxToo many unichars in ambiguity on line 22358424Too many unichars in ambiguity on line 22358424Too many unichars in ambiguity on line 14941344Tesseract Open Source OCR Engine v3.02 with LeptonicaE:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 nobatch box.trainTesseract Open Source OCR Engine v3.02 with LeptonicaAPPLY_BOXES: Boxes read from boxfile: 6 Found 6 good blobs.TRAINING ... Font name = normalGenerated training data for 2 wordsE:\data\Users\Administrator\Desktop\ocrBuider3>unicharset_extractor mjorcen.normal.exp0.boxExtracting unicharset from mjorcen.normal.exp0.boxWrote unicharset file ./unicharset.E:\data\Users\Administrator\Desktop\ocrBuider3>shapeclustering -F font_properties -U unicharset mjorcen.normal.exp0.trReading mjorcen.normal.exp0.tr ...Building master shape tableComputing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances... 0 1 2 3 4Stopped with 0 merged, min dist 0.365385Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple unichars = 0E:\data\Users\Administrator\Desktop\ocrBuider3>mftraining -F font_properties -Uunicharset -O unicharset mjorcen.normal.exp0.trRead shape table shapetable of 5 shapesReading mjorcen.normal.exp0.tr ...Done!E:\data\Users\Administrator\Desktop\ocrBuider3>cntraining mjorcen.normal.exp0.trReading mjorcen.normal.exp0.tr ...Clustering ...Writing normproto ...E:\data\Users\Administrator\Desktop\ocrBuider3>combine_tessdata normal.Combining tessdata filesTessdataManager combined tesseract data files.Offset for type 0 is -1Offset for type 1 is 140Offset for type 2 is -1Offset for type 3 is 489Offset for type 4 is 123081Offset for type 5 is 123134Offset for type 6 is -1Offset for type 7 is -1Offset for type 8 is -1Offset for type 9 is -1Offset for type 10 is -1Offset for type 11 is -1Offset for type 12 is -1Offset for type 13 is 123920Offset for type 14 is -1Offset for type 15 is -1Offset for type 16 is -1E:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp0 -l normalTesseract Open Source OCR Engine v3.02 with LeptonicaE:\data\Users\Administrator\Desktop\ocrBuider3>tesseract mjorcen.normal.exp0.jpg mjorcen.normal.exp1 -l chi_simToo many unichars in ambiguity on line 15280712Too many unichars in ambiguity on line 15280712Too many unichars in ambiguity on line 4324296Tesseract Open Source OCR Engine v3.02 with Leptonica
normal 结果
应收: 119
普通的中文结果:
应收= II苜
脚本(需要java环境):
目录结果如下:
脚本如下:
window
@echo off set "src=%1%" set "font_name=%2%"set "desc=%3%" if not defined src set /p src=" please pass your filename : "if not defined font_name set /p font_name=" please pass your font_name : "rem 判断参数的合法性if not defined src echo IllegalArgumentException arg1 must not be null & pause>nul & exitif not defined font_name echo IllegalArgumentException arg2 must not be null & pause>nul & exitif not defined desc set "desc=%src:~0,-4%" echo desc %desc%rem 如果目录下没有font_properties 文件创建 font_properties ,并写入文件if exist font_properties ( echo font_properties exist) else (ECHO %font_name% 0 0 0 0 0 >"font_properties")rem 删除原有文件 if exist %font_name%.unicharset ECHO DEL %font_name%.unicharset & DEL /Q names %font_name%.unicharsetif exist %font_name%.inttemp ECHO DEL %font_name%.inttemp & DEL /Q names %font_name%.inttempif exist %font_name%.pffmtable ECHO DEL %font_name%.pffmtable & DEL /Q names %font_name%.pffmtableif exist %font_name%.shapetable ECHO DEL %font_name%.shapetable & DEL /Q names %font_name%.shapetableif exist %font_name%.normproto ECHO DEL %font_name%.normproto & DEL /Q names %font_name%.normprotoif exist %font_name%.font_properties ECHO DEL %font_name%.font_properties & DEL /Q names %font_name%.font_properties rem makeboxtesseract %src% %desc% -l chi_sim batch.nochop makeboxjava -Xms128m -Xmx512m -jar jTessBoxEditor/jTessBoxEditor.jarECHO Please change your results , and press any key to continuepause>nul tesseract %src% %desc% nobatch box.trainunicharset_extractor %desc%.boxshapeclustering -F font_properties -U unicharset %desc%.trmftraining -F font_properties -U unicharset -O unicharset %desc%.trcntraining %desc%.trrem 配置新文件if exist unicharset ECHO rename unicharset %font_name%.unicharset & rename unicharset %font_name%.unicharsetif exist inttemp ECHO rename inttemp %font_name%.inttemp & rename inttemp %font_name%.inttempif exist pffmtable ECHO rename pffmtable %font_name%.pffmtable & rename pffmtable %font_name%.pffmtableif exist shapetable ECHO rename shapetable %font_name%.shapetable & rename shapetable %font_name%.shapetableif exist normproto ECHO rename normproto %font_name%.normproto & rename normproto %font_name%.normprotocombine_tessdata %font_name%.if exist font_properties ECHO rename font_properties %font_name%.font_properties & rename font_properties %font_name%.font_propertiesECHO press any key to continuepause>nul
调用:
注意: 参数1: 文件全名 , 参数2 字体名, 参数3 :输出文件名, 不填默认为文件名
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg normal
实例:
E:\data\Users\Administrator\Desktop\ocrBuider3>run.bat mjorcen.normal.exp0.jpg normaldesc mjorcen.normal.exp0 font_properties existToo many unichars in ambiguity on line 2188584Too many unichars in ambiguity on line 2188584Too many unichars in ambiguity on line 2686128Tesseract Open Source OCR Engine v3.02 with LeptonicaPlease change your results , and press any key to continueTesseract Open Source OCR Engine v3.02 with LeptonicaAPPLY_BOXES: Boxes read from boxfile: 6 Found 6 good blobs.TRAINING ... Font name = normalGenerated training data for 2 wordsExtracting unicharset from mjorcen.normal.exp0.boxWrote unicharset file ./unicharset.Reading mjorcen.normal.exp0.tr ...Building master shape tableComputing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances... 0Stopped with 0 merged, min dist 999.000000Computing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances...Stopped with 0 merged, min dist 999.000000Computing shape distances... 0 1 2 3 4Stopped with 0 merged, min dist 0.365385Master shape_table:Number of shapes = 5 max unichars = 1 number with multiple unichars = 0Read shape table shapetable of 5 shapesReading mjorcen.normal.exp0.tr ...Done!Reading mjorcen.normal.exp0.tr ...Clustering ...Writing normproto ...rename unicharset normal.unicharsetrename inttemp normal.inttemprename pffmtable normal.pffmtablerename shapetable normal.shapetablerename normproto normal.normprotoCombining tessdata filesTessdataManager combined tesseract data files.Offset for type 0 is -1Offset for type 1 is 140Offset for type 2 is -1Offset for type 3 is 489Offset for type 4 is 123081Offset for type 5 is 123134Offset for type 6 is -1Offset for type 7 is -1Offset for type 8 is -1Offset for type 9 is -1Offset for type 10 is -1Offset for type 11 is -1Offset for type 12 is -1Offset for type 13 is 123920Offset for type 14 is -1Offset for type 15 is -1Offset for type 16 is -1rename font_properties normal.font_properties
E:\data\Users\Administrator\Desktop\ocrBuider3>
linux (出自文档:http://tesseract-ocr.googlecode.com/svn/trunk/doc/combine_tessdata.1.asc) :
#!/bin/bash tesseract zzz.ocra.exp0.tif zzz.ocra.exp0 nobatch box.trainunicharset_extractor zzz.ocra.exp0.boxecho "ocra 0 0 1 0 0" >font_propertiesshapeclustering -F font_properties -U unicharset zzz.ocra.exp0.trmftraining -F font_properties -U unicharset -O zzz.unicharset zzz.ocra.exp0.trcntraining zzz.ocra.exp0.trcp normproto zzz.normprotocp inttemp zzz.inttempcp pffmtable zzz.pffmtablecp shapetable zzz.shapetablecombine_tessdata zzz.cp zzz.traineddata /home/youruserid/tessdata/.sudo cp zzz.traineddata /usr/share/tesseract-ocr/tessdata/.tesseract zzz.ocra.exp0.tif output -l zzz
转自http://www.cnblogs.com/mjorcen/p/3800739.html?utm_source=tuicool&utm_medium=referral
- Tesseract 3.02中文字库训练
- Tesseract 3.02中文字库训练
- Tesseract 3.02中文字库训练----整理
- Tesseract-OCR 训练自己的中文字库
- tesseract-ocr识别中文与字库训练
- tesseract 3.02 训练字库全解
- tesseract训练字库
- Tesseract-ocr训练字库
- tesseract-OCR字库训练
- Tesseract-OCR识别中文与训练字库实例
- Tesseract-OCR识别中文与训练字库实例
- Tesseract-OCR识别中文与训练字库实例
- tesseract-ocr字库训练图文讲解
- tesseract-ocr字库训练图文讲解
- tesseract训练中文备忘录
- Tesseract-OCR 训练中文
- 使用Tesseract破解验证码并训练字库的方法
- tesseract字符识别及训练字库(转)
- 浅谈设计模式之组合模式
- ORA-12916 cannot shrink permanent or dictionary managed tablespace
- eclipse修改注释中的@author和格式化
- jquery的一个插件scrollable.js做的注册三步骤,只有完成第一个才能进入下一步
- 测试机器-软件栈配置
- Tesseract 3.02中文字库训练
- C++关于cout的格式化输出
- Android多媒体(二) 多段Mp4文件拼接 我用双手成就你的梦想
- iOS多线程的初步研究(十)-- dispatch同步
- 一个activity+4个fragment的简单框架的实现
- hdu1062:Text Reverse
- logback 加载原理
- viewpager中动态添加、删除Fragment
- 三个著名加密算法(MD5、RSA、DES)的解析一