centos上 java使用Tesseract进行ocr识别
来源:互联网 发布:mysql 缓存命中率 编辑:程序博客网 时间:2024/05/21 10:05
1、安装过程:
安装ocr
yum install tesseract-ocr
查找中文包
yum search tesseract-ocr | grep sim
安装中文包
yum install tesseract-langpack-chi_sim
安装版本信息:
? test-ugc-api01 tesseract tesseract -v
tesseract 3.04.00
leptonica-1.72
libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
2、java开发
注意版本匹配:3.04.00,采用tess4j
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>3.0.0</version></dependency>
简单测试代码
public String ocr(String url) { String datapath = "/usr/share/tesseract/"; String language="chi_sim"; //进行相关的检测 try { url = url.trim(); System.out.println("url is:"+url); URL targetUrl = new URL(url); BufferedImage image = ImageIO.read(targetUrl); ByteBuffer buf = ImageIOHelper.convertImageData(image); int bpp = image.getColorModel().getPixelSize(); int bytespp = bpp / 8; int bytespl = (int) Math.ceil(image.getWidth() * bpp / 8.0); System.out.println("bpp is:"+bpp+";bytespp is:"+bytespp+";bytespl is:"+bytespl); //初始化 ITessAPI.TessBaseAPI handle = TessAPI1.TessBaseAPICreate(); TessAPI1.TessBaseAPIInit3(handle, datapath, language); TessAPI1.TessBaseAPISetPageSegMode(handle, ITessAPI.TessPageSegMode.PSM_AUTO); Pointer utf8Text = TessAPI1.TessBaseAPIRect(handle, buf, bytespp, bytespl, 0, 0, image.getWidth(), image.getHeight()); String result = utf8Text.getString(0); TessAPI1.TessDeleteText(utf8Text); TessAPI1.TessBaseAPIDelete(handle); System.out.println("=============================================="); System.out.println("result is:"+result); System.out.println("=============================================="); if (result.equalsIgnoreCase("")){ System.out.println("no detected words!!"); } return result; }catch (Exception ex){ ex.printStackTrace(); } return "no detected words!!"; }
注意:datapath要设置在tessdata的上一级目录
3、yum安装所在目录查询相关命令
#查询相关包test-ugc-api01 tesseract rpm -qa|grep tesseract tesseract-langpack-chi_sim-3.04.00-3.el7.noarchtesseract-3.04.00-3.el7.x86_64#查询包具体安装位置test-ugc-api01 tesseract rpm -ql tesseract-3.04.00-3.el7.x86_64/usr/bin/ambiguous_words/usr/bin/classifier_tester/usr/bin/cntraining/usr/bin/combine_tessdata/usr/bin/dawg2wordlist/usr/bin/mftraining/usr/bin/set_unicharset_properties/usr/bin/shapeclustering/usr/bin/tesseract/usr/bin/text2image/usr/bin/unicharset_extractor/usr/bin/wordlist2dawg/usr/lib64/libtesseract.so.3/usr/lib64/libtesseract.so.3.0.4/usr/share/doc/tesseract-3.04.00/usr/share/doc/tesseract-3.04.00/AUTHORS/usr/share/doc/tesseract-3.04.00/ChangeLog/usr/share/doc/tesseract-3.04.00/NEWS/usr/share/doc/tesseract-3.04.00/README/usr/share/doc/tesseract-3.04.00/eurotext.tif/usr/share/doc/tesseract-3.04.00/phototest.tif/usr/share/licenses/tesseract-3.04.00/usr/share/licenses/tesseract-3.04.00/COPYING/usr/share/man/man1/ambiguous_words.1.gz/usr/share/man/man1/cntraining.1.gz/usr/share/man/man1/combine_tessdata.1.gz/usr/share/man/man1/dawg2wordlist.1.gz/usr/share/man/man1/mftraining.1.gz/usr/share/man/man1/shapeclustering.1.gz/usr/share/man/man1/tesseract.1.gz/usr/share/man/man1/unicharset_extractor.1.gz/usr/share/man/man1/wordlist2dawg.1.gz/usr/share/man/man5/unicharambigs.5.gz/usr/share/man/man5/unicharset.5.gz/usr/share/tesseract/usr/share/tesseract/tessdata/usr/share/tesseract/tessdata/configs/usr/share/tesseract/tessdata/configs/ambigs.train/usr/share/tesseract/tessdata/configs/api_config/usr/share/tesseract/tessdata/configs/bigram/usr/share/tesseract/tessdata/configs/box.train/usr/share/tesseract/tessdata/configs/box.train.stderr/usr/share/tesseract/tessdata/configs/digits/usr/share/tesseract/tessdata/configs/hocr/usr/share/tesseract/tessdata/configs/inter/usr/share/tesseract/tessdata/configs/kannada/usr/share/tesseract/tessdata/configs/linebox/usr/share/tesseract/tessdata/configs/logfile/usr/share/tesseract/tessdata/configs/makebox/usr/share/tesseract/tessdata/configs/pdf/usr/share/tesseract/tessdata/configs/quiet/usr/share/tesseract/tessdata/configs/rebox/usr/share/tesseract/tessdata/configs/strokewidth/usr/share/tesseract/tessdata/configs/unlv/usr/share/tesseract/tessdata/eng.cube.bigrams/usr/share/tesseract/tessdata/eng.cube.fold/usr/share/tesseract/tessdata/eng.cube.lm/usr/share/tesseract/tessdata/eng.cube.nn/usr/share/tesseract/tessdata/eng.cube.params/usr/share/tesseract/tessdata/eng.cube.size/usr/share/tesseract/tessdata/eng.cube.word-freq/usr/share/tesseract/tessdata/eng.tesseract_cube.nn/usr/share/tesseract/tessdata/eng.traineddata/usr/share/tesseract/tessdata/pdf.ttf/usr/share/tesseract/tessdata/tessconfigs/usr/share/tesseract/tessdata/tessconfigs/batch/usr/share/tesseract/tessdata/tessconfigs/batch.nochop/usr/share/tesseract/tessdata/tessconfigs/matdemo/usr/share/tesseract/tessdata/tessconfigs/msdemo/usr/share/tesseract/tessdata/tessconfigs/nobatch/usr/share/tesseract/tessdata/tessconfigs/segdemo
查看.so文件接口
nm -D xxx.so
阅读全文
1 0
- centos上 java使用Tesseract进行ocr识别
- 5 Tesseract-ocr 系列 : 使用 jTessBoxEditor,结合 tesseract-ocr-3.4 进行训练、识别
- python 使用tesseract-ocr , pytesseract , PIL进行验证码识别
- java利用tesseract-OCR对图像进行字符识别
- java 利用 tesseract-ocr 进行文字识别技术
- 使用Tesseract-OCR识别图片上的中文
- Tesseract-OCR 进行文字识别 VS2010
- 利用tesseract-ocr进行验证码识别
- tesseract-OCR字符识别引擎使用入门
- 使用Tesseract OCR Engine识别图片文字
- 使用Tesseract-OCR训练文字识别记录
- 使用Tesseract-OCR训练文字识别记录
- Java中文图像识别tesseract-ocr || tif
- java 调用tesseract-ocr识别图片
- Tesseract-OCR 进行文字识别 VS2010及不安装opencv,就可以使用opencv
- C++在Tesseract-OCR中使用自己训练的字库进行字体识别
- 使用tesseract-ocr进行简单的验证码识别和训练
- tesseract-ocr识别字符
- Android程序员学WEB前端(10)-JavaScript(1)-基础-Sublime
- Linux主要目录,和常用命令
- 编写优质嵌入式C程序
- Dijkstra算法JAVA实现
- 【02】vue.js — 简易留言板
- centos上 java使用Tesseract进行ocr识别
- 跟我学NodeJS(三)之回调函数
- Application的run方法分析
- 数据量很大的排序问题 大量数据如何排序,没有做测试
- JAVA学习
- java中的枚举(enum)
- Dubbo服务框架详解
- AsyncTask原理详解!
- Ubuntu14.04安装visual studio code