Tesseract 源码分析
来源:互联网 发布:爱是软肋也是铠甲知乎 编辑:程序博客网 时间:2024/05/03 13:46
之前的OCRus开发工作告一段落,后端OCR识别利用开源OCR引擎Tesseract。对于论文类型的文档,字体标准,大小一致,识别率很高,根据UNLV的测试结果,Tesseract的准确率都在90%以上,但对OCRus面向的手机照片,识别准确率并不高,对一些图片基本不可用。虽然OCRus做了一些图片预处理的工作,希望在将图片送入Tesseract之前能够使图片更清晰,更利于识别,但对识别结果不好的图片,经过预处理步骤对准确率提升并不大。最近一直在看Tesseract源代码,希望从Tesseract源码入手,利用Ceiling Analysis的方法确定Tesseract识别的主要瓶颈,未来有针对性地改进Tesseract。
目录
- 目录
- 源码分析环境部署
- 数据结构
- 源码分析
- Page Layout 分析步骤
- 二值化
- 预处理
- Remove vertical lines
- Remove images
- Filter connected component
- Finding candidate tab-stop components
- Finding the column layout
- Finding the regions
- Page Layout 分析步骤
- 接下来的工作
源码分析环境部署
Tesseract 3.02 提供了Visual Studio 2008的工程项目,部署过程如下:
- Setting up Tesseract-OCR
- Building Tesseract-OCR
Tesseract 同时提供了一个Java UI来显示中间结果,部署过程如下:
- ViewerDebugging
注意: Introduction里的piccolo
版本说明有Bug,ScrollView
源码里用的是1.2,用新版本piccolo
无法运行,Github上的Tesseract项目已经修复此Bug。
程序入口点在api/tesseractmain.cpp
中:
int main(int argc, char **argv) {...}
Tesseract设置了很多变量来控制中间结果是否输出,在以下代码:
if (!api.ProcessPages(image, NULL, 0, &text_out)) { fprintf(stderr, _("Error during processing.\n"));}
之前加入下面代码:
api.SetVariable("tessedit_dump_pageseg_images", "true"); //show no lines and no image pictureapi.SetVariable("textord_show_blobs", "true"); //show blobs resultapi.SetVariable("textord_show_boxes", "true"); //show blobs' bounding boxesapi.SetVariable("textord_tabfind_show_blocks", "true"); //show candidate tab-stops and tab vectorsapi.SetVariable("textord_tabfind_show_reject_blobs", "true"); //show rejected blobsapi.SetVariable("textord_tabfind_show_initial_partitions", "true"); //show initial partitionsapi.SetVariable("textord_tabfind_show_partitions", "1"); //show final partitionsapi.SetVariable("textord_tabfind_show_initialtabs", "true"); //show initial tab-stopsapi.SetVariable("textord_tabfind_show_finaltabs", "true"); //show final tab vectorsapi.SetVariable("textord_tabfind_show_images", "true"); //show image blobs
使Tesseract输出所有结果。
工程导入VS2008后会有11个项目,设置tesseract
为启动项目并设置命令行参数为 {image_path} {text_base} segdemo inter
数据结构
Page analysis result: `PAGE_RES` (ccstruct/pageres.h).Page analysis result contains a list of block analysis result field: `BLOCK_RES_LIST`.Block analysis result: `BLOCK_RES` (ccstruct/pageres.h).Block analysis result contains a list of row analysis result field: `ROW_RES_LIST`.Row analysis result: `ROW_RES` (ccstruct/pageres.h).Row analysis result contains a list of word analysis result field: `WERD_RES_LIST`.`WERD_RES`(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.
源码分析
Page Layout 分析步骤
二值化
- 算法: OTSU
- 调用栈:
main[api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages[api/baseapi.cpp] -> TessBaseAPI::ProcessPage[api/baseapi.cpp] -> TessBaseAPI::Recognize[api/baseapi.cpp] -> TessBaseAPI::FindLines[api/baseapi.cpp] -> TessBaseAPI::Threshold[api/baseapi.cpp] -> ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] -> ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]
OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,Wolf Jolion, 对包含有阴影的图片也有比较好的二值化结果,以下是一些对比图:(左为原图, 中间为用OTSU算法结果图, 右边为WolfJolion算法结果图):
预处理
Remove vertical lines
This step removes vertical and horizontal lines in the image.
- 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]LineFinder::FindAndRemoveLines [textord/linefind.cpp]
Remove images
This step remove images from the picture.
- 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]ImageFind::FindImages [textord/linefind.cpp]
I never try this function successfully. May be the image needs to satisfy some conditions.
Filter connected component
This step generate all the connected components and filter the noise blobs.
- 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] ->(i) Textord::find_components [textord/tordmain.cpp] ->{ extract_edges[textord/edgblob.cpp] //extract outlines and assign outlines to blobs assign_blobs_to_blocks2[textord/edgblob.cpp] //assign normal, noise, rejected blobs to TO_BLOCK_LIST for further filter blobs operations Textord::filter_blobs[textord/tordmain.cpp] -> Textord::filter_noise_blobs[textord/tordmain.cpp] //Move small blobs to a separate list}(ii) ColumnFinder::SetupAndFilterNoise [textord/colfind.cpp]
This step will generate the intermediate result like this:
The inner and outer outline of the connected component will be recognized. There will be a box area overlap the connected component. The potential small noise blobs will be marked as pink outlines, such as punctuation and dot in character “i”.
The large blobs will be marked as dark green color:
Finding candidate tab-stop components
- 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] ->TabFind::FindInitialTabVectors[textord/tabfind.cpp] ->TabFind::FindTabBoxes [textord/tabfind.cpp]
This step finds the initial candidate tab-stop CCs by a radial search starting at every filtered CC from preprocessing. The result will be like this:
Finding the column layout
- 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] ->ColumnFinder::FindBlocks (begin at line 369) [textord/ colfind.cpp]
This step finds the column layout of the page:
Finding the regions
- 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp]
This step recognizes the different type of blocks:
接下来的工作
- 找tab-stops及之后处理步骤的算法还不甚清楚,需要继续了解
- 识别字符部分还没开始看,这部分应该有涉及机器学习的多种算法,有时间需要继续学习
- Tesseract 源码分析
- Tesseract 源码分析
- tesseract源码解读0
- Tesseract试用过程及结果分析
- 最新Tesseract-OCR源码编译1——leptonica编译
- Tesseract-OCR入门使用(3)-VS2010编译源码
- tesseract源码Page Layout解读1( 二值化,otsu)
- tesseract源码Page Layout解读(倾斜矫正)
- 源码分析
- 源码分析
- 源码分析
- 源码分析
- 源码分析
- 源码分析
- 源码分析
- 源码分析
- 源码分析:SparseArray分析
- Tesseract-ocr 3.0.2源码 + VS2010项目工程 + 简单测试代码
- combox绑定枚举和读取枚举
- Android 常用动画之RotateAnimation
- [管理篇]VMWare搭建Openstack——Cinder完成对云硬盘的扩容
- Android签名功能的实现
- 技术文档博客集锦
- Tesseract 源码分析
- java学习路线分享,让你少走弯路
- Android多线程下安全访问数据库
- 分页存储过程
- Assertion failure in -[UICGColor encodeWithCoder:]
- Android studio十大常用快捷键
- Android 在项目中实现百度地图
- InputStream用法
- MarkDown语法