wxPython利用pytesser模块实现图片文字识别
来源:互联网 发布:香港免备案域名 编辑:程序博客网 时间:2024/06/06 02:09
Pytesser——OCR in Python using the Tesseract engine from Google
pytesser是谷歌OCR开源项目的一个模块,在python中导入这个模块即可将图片中的文字转换成文本。
链接:https://code.google.com/p/pytesser/
pytesser 调用了 tesseract。在python中调用pytesser模块,pytesser又用tesseract识别图片中的文字。
下面是整个过程的实现步骤:
1、首先要在code.google.com下载pytesser。https://code.google.com/p/pytesser/downloads/detail?name=pytesser_v0.0.1.zip
这个是免安装的,可以放在python安装文件夹的\Lib\site-packages\ 下直接使用
pytesser里包含了tesseract.exe和英语的数据包(默认只识别英文),还有一些示例图片,所以解压缩后即可使用。
可通过以下代码测试:
>>> from pytesser import *>>> image = Image.open('fnord.tif') # Open image object using PIL>>> print image_to_string(image) # Run tesseract.exe on imagefnord>>> print image_file_to_string('fnord.tif')fnord
from pytesser import * #im = Image.open('fnord.tif') #im = Image.open('phototest.tif') #im = Image.open('eurotext.tif')im = Image.open('fonts_test.png')text = image_to_string(im) print text
注:该模块需要PIL库的支持。
2、解决识别率低的问题
可以增强图片的显示效果,或者将其转换为黑白的,这样可以使其识别率提升不少:
enhancer = ImageEnhance.Contrast(image1)image2 = enhancer.enhance(4)
可以再对image2调用 image_to_string识别
3、识别其他语言
tesseract是一个命令行下运行的程序,参数如下:
tesseract imagename outbase [-l lang] [-psm N] [configfile...]
imagename是输入的image的名字
outbase是输出的文本的名字,默认为outbase.txt
-l lang 是定义要识别的的语言,默认为英文
详见http://tesseract-ocr.googlecode.com/svn-history/r725/trunk/doc/tesseract.1.html
通过以下步骤可以识别其他语言:
(1)、下载其他语言数据包:
https://code.google.com/p/tesseract-ocr/downloads/list
将语言包放入pytesser的tessdata文件夹下
接下来修改pytesser.py的参数,下面是一个例子:
"""OCR in Python using the Tesseract engine from Googlehttp://code.google.com/p/pytesser/by Michael J.T. O'KellyV 0.0.2, 5/26/08"""import Imageimport subprocessimport osimport StringIOimport utilimport errorstesseract_exe_name = 'dlltest' # Name of executable to be called at command linescratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible formatscratch_text_name_root = "temp" # Leave out the .txt extension_cleanup_scratch_flag = True # Temporary files cleaned up after OCR operation_language = "" # Tesseract uses English if language is not given_pagesegmode = "" # Tesseract uses fully automatic page segmentation if psm is not given (psm is available in v3.01)_working_dir = os.getcwd()def call_tesseract(input_filename, output_filename, language, pagesegmode): """Calls external tesseract.exe on input file (restrictions on types), outputting output_filename+'txt'""" current_dir = os.getcwd() error_stream = StringIO.StringIO() try: os.chdir(_working_dir) args = [tesseract_exe_name, input_filename, output_filename] if len(language) > 0: args.append("-l") args.append(language) if len(str(pagesegmode)) > 0: args.append("-psm") args.append(str(pagesegmode)) try: proc = subprocess.Popen(args) except (TypeError, AttributeError): proc = subprocess.Popen(args, shell=True) retcode = proc.wait() if retcode!=0: error_text = error_stream.getvalue() errors.check_for_errors(error_stream_text = error_text) finally: # Guarantee that we return to the original directory error_stream.close() os.chdir(current_dir)def image_to_string(im, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag): """Converts im to file, applies tesseract, and fetches resulting text. If cleanup=True, delete scratch files after operation.""" try: util.image_to_scratch(im, scratch_image_name) call_tesseract(scratch_image_name, scratch_text_name_root, lang, psm) result = util.retrieve_result(scratch_text_name_root) finally: if cleanup: util.perform_cleanup(scratch_image_name, scratch_text_name_root) return resultdef image_file_to_string(filename, lang = _language, psm = _pagesegmode, cleanup = _cleanup_scratch_flag, graceful_errors=True): """Applies tesseract to filename; or, if image is incompatible and graceful_errors=True, converts to compatible format and then applies tesseract. Fetches resulting text. If cleanup=True, delete scratch files after operation. Parameter lang specifies used language. If lang is empty, English is used. Page segmentation mode parameter psm is available in Tesseract 3.01. psm values are: 0 = Orientation and script detection (OSD) only. 1 = Automatic page segmentation with OSD. 2 = Automatic page segmentation, but no OSD, or OCR 3 = Fully automatic page segmentation, but no OSD. (Default) 4 = Assume a single column of text of variable sizes. 5 = Assume a single uniform block of vertically aligned text. 6 = Assume a single uniform block of text. 7 = Treat the image as a single text line. 8 = Treat the image as a single word. 9 = Treat the image as a single word in a circle. 10 = Treat the image as a single character.""" try: try: call_tesseract(filename, scratch_text_name_root, lang, psm) result = util.retrieve_result(scratch_text_name_root) except errors.Tesser_General_Exception: if graceful_errors: im = Image.open(filename) result = image_to_string(im, cleanup) else: raise finally: if cleanup: util.perform_cleanup(scratch_image_name, scratch_text_name_root) return result if __name__=='__main__': im = Image.open('phototest.tif') text = image_to_string(im, cleanup=False) print text text = image_to_string(im, psm=2, cleanup=False) print text try: text = image_file_to_string('fnord.tif', graceful_errors=False) except errors.Tesser_General_Exception, value: print "fnord.tif is incompatible filetype. Try graceful_errors=True" #print value text = image_file_to_string('fnord.tif', graceful_errors=True, cleanup=False) print "fnord.tif contents:", text text = image_file_to_string('fonts_test.png', graceful_errors=True) print text text = image_file_to_string('fonts_test.png', lang="eng", psm=4, graceful_errors=True) print text
这个是source里面提供的,其实若只要识别其他语言只要添加一个language参数就行了,下面是我的例子:
"""OCR in Python using the Tesseract engine from Googlehttp://code.google.com/p/pytesser/by Michael J.T. O'KellyV 0.0.1, 3/10/07"""import Imageimport subprocessimport utilimport errorstesseract_exe_name = 'tesseract' # Name of executable to be called at command linescratch_image_name = "temp.bmp" # This file must be .bmp or other Tesseract-compatible formatscratch_text_name_root = "temp" # Leave out the .txt extensioncleanup_scratch_flag = True # Temporary files cleaned up after OCR operationdef call_tesseract(input_filename, output_filename, language):"""Calls external tesseract.exe on input file (restrictions on types),outputting output_filename+'txt'"""args = [tesseract_exe_name, input_filename, output_filename, "-l", language]proc = subprocess.Popen(args)retcode = proc.wait()if retcode!=0:errors.check_for_errors()def image_to_string(im, cleanup = cleanup_scratch_flag, language = "eng"):"""Converts im to file, applies tesseract, and fetches resulting text.If cleanup=True, delete scratch files after operation."""try:util.image_to_scratch(im, scratch_image_name)call_tesseract(scratch_image_name, scratch_text_name_root,language)text = util.retrieve_text(scratch_text_name_root)finally:if cleanup:util.perform_cleanup(scratch_image_name, scratch_text_name_root)return textdef image_file_to_string(filename, cleanup = cleanup_scratch_flag, graceful_errors=True, language = "eng"):"""Applies tesseract to filename; or, if image is incompatible and graceful_errors=True,converts to compatible format and then applies tesseract. Fetches resulting text.If cleanup=True, delete scratch files after operation."""try:try:call_tesseract(filename, scratch_text_name_root, language)text = util.retrieve_text(scratch_text_name_root)except errors.Tesser_General_Exception:if graceful_errors:im = Image.open(filename)text = image_to_string(im, cleanup)else:raisefinally:if cleanup:util.perform_cleanup(scratch_image_name, scratch_text_name_root)return textif __name__=='__main__':im = Image.open('phototest.tif')text = image_to_string(im)print texttry:text = image_file_to_string('fnord.tif', graceful_errors=False)except errors.Tesser_General_Exception, value:print "fnord.tif is incompatible filetype. Try graceful_errors=True"print valuetext = image_file_to_string('fnord.tif', graceful_errors=True)print "fnord.tif contents:", texttext = image_file_to_string('fonts_test.png', graceful_errors=True)print text
在调用image_to_string函数时,只要加上相应的language参数就可以了,如简体中文最后一个参数即为 chi_sim, 繁体中文chi_tra,
也就是下载的语言包的 XXX.traineddata 文件的名字XXX,如下载的中文包是 chi_sim.traineddata, 参数就是chi_sim :
text = image_to_string(self.im, language = 'chi_sim')
至此,图片识别就完成了。
额外附加一句:有可能中文识别出来了,但是乱码,需要相应地将text转换为你所用的中文编码方式,如:
text.decode("utf8")就可以了
- wxPython利用pytesser模块实现图片文字识别
- wxPython利用pytesser模块实现图片文字识别
- 利用pytesser模块实现图片文字识别
- wxPython:调用OCR模块实现图片识别
- pytesser,图片文本识别工具
- Python识别验证码的模块--- pytesser
- 利用python pytesser 识别简单验证码
- Ubentu安装pytesser,图片文本识别
- python pytesser 识别图片验证码
- 利用office实现文字识别需求
- pytesser模块的安
- Python验证码识别:利用pytesser识别简单图形验证码
- Python验证码识别:利用pytesser识别简单图形验证码
- Python利用Face++实现身份证件图片识别
- 利用Google Object Detection模块识别图片中的物体
- 图片文字识别
- 图片文字识别
- C#图片文字识别
- 利用backtrace和backtrace_symbols打印函数的调用关系
- time
- multiple definition of `main' /(text+0x0): first defined here
- 数据库结构算法三:选择排序
- 通过IIS访问webservice不能删除文件,而在本地可以删除
- wxPython利用pytesser模块实现图片文字识别
- (挑战编程_2_1)Jolly Jumpers
- 关于网页皮肤切换
- js传参数到后台乱码
- POJ 2492 Knights of the Round Table
- JDBC 模糊查询 传参问题
- (挑战编程_2_2)
- MVC3 在IIS6的部署 记录
- 安卓升级版本新方法