Tesseract 源码分析

来源:互联网 发布:爱是软肋也是铠甲知乎 编辑:程序博客网 时间:2024/05/03 13:46

之前的OCRus开发工作告一段落,后端OCR识别利用开源OCR引擎Tesseract。对于论文类型的文档,字体标准,大小一致,识别率很高,根据UNLV的测试结果,Tesseract的准确率都在90%以上,但对OCRus面向的手机照片,识别准确率并不高,对一些图片基本不可用。虽然OCRus做了一些图片预处理的工作,希望在将图片送入Tesseract之前能够使图片更清晰,更利于识别,但对识别结果不好的图片,经过预处理步骤对准确率提升并不大。最近一直在看Tesseract源代码,希望从Tesseract源码入手,利用Ceiling Analysis的方法确定Tesseract识别的主要瓶颈,未来有针对性地改进Tesseract。

目录

  • 目录
  • 源码分析环境部署
  • 数据结构
  • 源码分析
    • Page Layout 分析步骤
      • 二值化
      • 预处理
        • Remove vertical lines
        • Remove images
        • Filter connected component
        • Finding candidate tab-stop components
        • Finding the column layout
        • Finding the regions
  • 接下来的工作

源码分析环境部署

Tesseract 3.02 提供了Visual Studio 2008的工程项目,部署过程如下:

  • Setting up Tesseract-OCR
  • Building Tesseract-OCR

Tesseract 同时提供了一个Java UI来显示中间结果,部署过程如下:

  • ViewerDebugging
    注意: Introduction里的piccolo版本说明有Bug,ScrollView源码里用的是1.2,用新版本piccolo无法运行,Github上的Tesseract项目已经修复此Bug。

程序入口点在api/tesseractmain.cpp中:

int main(int argc, char **argv) {...}

Tesseract设置了很多变量来控制中间结果是否输出,在以下代码:

if (!api.ProcessPages(image, NULL, 0, &text_out)) {  fprintf(stderr, _("Error during processing.\n"));}

之前加入下面代码:

api.SetVariable("tessedit_dump_pageseg_images", "true");    //show no lines and no image pictureapi.SetVariable("textord_show_blobs", "true");  //show blobs resultapi.SetVariable("textord_show_boxes", "true");  //show blobs' bounding boxesapi.SetVariable("textord_tabfind_show_blocks", "true"); //show candidate tab-stops and tab vectorsapi.SetVariable("textord_tabfind_show_reject_blobs", "true");   //show rejected blobsapi.SetVariable("textord_tabfind_show_initial_partitions", "true"); //show initial partitionsapi.SetVariable("textord_tabfind_show_partitions", "1");    //show final partitionsapi.SetVariable("textord_tabfind_show_initialtabs", "true");    //show initial tab-stopsapi.SetVariable("textord_tabfind_show_finaltabs", "true");  //show final tab vectorsapi.SetVariable("textord_tabfind_show_images", "true"); //show image blobs

使Tesseract输出所有结果。

工程导入VS2008后会有11个项目,设置tesseract为启动项目并设置命令行参数为
{image_path} {text_base} segdemo inter

数据结构

Page analysis result: `PAGE_RES` (ccstruct/pageres.h).Page analysis result contains a list of block analysis result field: `BLOCK_RES_LIST`.Block analysis result: `BLOCK_RES` (ccstruct/pageres.h).Block analysis result contains a list of row analysis result field: `ROW_RES_LIST`.Row analysis result: `ROW_RES` (ccstruct/pageres.h).Row analysis result contains a list of word analysis result field: `WERD_RES_LIST`.`WERD_RES`(ccstruct/pageres.h) is a collection of publicly accessible members that gathers information about a word result.

源码分析

Page Layout 分析步骤

二值化

  • 算法: OTSU
  • 调用栈:
main[api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages[api/baseapi.cpp] -> TessBaseAPI::ProcessPage[api/baseapi.cpp] -> TessBaseAPI::Recognize[api/baseapi.cpp] -> TessBaseAPI::FindLines[api/baseapi.cpp] -> TessBaseAPI::Threshold[api/baseapi.cpp] -> ImageThresholder::ThresholdToPix[ccmain/thresholder.cpp] -> ImageThresholder::OtsuThresholdRectToPix [ccmain/thresholder.cpp]

OTSU 是一个全局二值化算法. 如果图片中包含阴影而且阴影不平均,二值化算法效果就会比较差。OCRus利用一个局部的二值化算法,Wolf Jolion, 对包含有阴影的图片也有比较好的二值化结果,以下是一些对比图:(左为原图, 中间为用OTSU算法结果图, 右边为WolfJolion算法结果图):
二值化算法对比图

预处理

Remove vertical lines

This step removes vertical and horizontal lines in the image.

  • 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]LineFinder::FindAndRemoveLines [textord/linefind.cpp]

Remove images

This step remove images from the picture.

  • 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp]ImageFind::FindImages [textord/linefind.cpp]

I never try this function successfully. May be the image needs to satisfy some conditions.

Filter connected component

This step generate all the connected components and filter the noise blobs.

  • 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> Tesseract::SetupPageSegAndDetectOrientation [ccmain/ pagesegmain.cpp] ->(i) Textord::find_components [textord/tordmain.cpp] ->{    extract_edges[textord/edgblob.cpp] //extract outlines and assign outlines to blobs    assign_blobs_to_blocks2[textord/edgblob.cpp] //assign normal, noise, rejected blobs to TO_BLOCK_LIST for further filter blobs operations    Textord::filter_blobs[textord/tordmain.cpp] ->    Textord::filter_noise_blobs[textord/tordmain.cpp] //Move small blobs to a separate list}(ii) ColumnFinder::SetupAndFilterNoise [textord/colfind.cpp]

This step will generate the intermediate result like this:
connected component analysis

The inner and outer outline of the connected component will be recognized. There will be a box area overlap the connected component. The potential small noise blobs will be marked as pink outlines, such as punctuation and dot in character “i”.
The large blobs will be marked as dark green color:
connedted component analysis

Finding candidate tab-stop components

  • 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] ->TabFind::FindInitialTabVectors[textord/tabfind.cpp] ->TabFind::FindTabBoxes [textord/tabfind.cpp]

This step finds the initial candidate tab-stop CCs by a radial search starting at every filtered CC from preprocessing. The result will be like this:
candidate tab-stops

Finding the column layout

  • 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp] ->ColumnFinder::FindBlocks (begin at line 369) [textord/ colfind.cpp]

This step finds the column layout of the page:
column layout analysis

Finding the regions

  • 调用栈
main [api/tesseractmain.cpp] -> TessBaseAPI::ProcessPages [api/baseapi.cpp] -> TessBaseAPI::ProcessPage [api/baseapi.cpp] -> TessBaseAPI::Recognize [api/baseapi.cpp] -> TessBaseAPI::FindLines [api/baseapi.cpp] -> Tesseract::SegmentPage [ccmain/pagesegmain.cpp] -> Tesseract::AutoPageSeg [ccmain/ pagesegmain.cpp] -> ColumnFinder::FindBlocks [textord/ colfind.cpp]

This step recognizes the different type of blocks:
find blocks

接下来的工作

  • 找tab-stops及之后处理步骤的算法还不甚清楚,需要继续了解
  • 识别字符部分还没开始看,这部分应该有涉及机器学习的多种算法,有时间需要继续学习
0 0
原创粉丝点击