Tesseract安装使用

来源：互联网发布：淘宝店铺买家采集编辑：程序博客网时间：2024/06/07 15:22

mac

brew install Tesseract

This is a short writeup of the working process I came up with for command-line OCR of a non-OCR’d PDF with searchable PDF output on OS X, after running into a thousand little gotchas. ¹

Software Installation

Install homebrew (if you haven’t already).

Install ImageMagick with TIFF and Ghostscript support:

brew install --with-libtiff --with-ghostscript imagemagick

Install Tesseract with all languages:
```
brew install --all-languages tesseract
```
Install pdftk server from the package installer.

Processing Workflow

I’m going to assume you have a non-OCR’d PDF you want to convert into a searchable PDF.

Split and convert the PDF with ImageMagick convert:

convert -density 300 input.pdf -type Grayscale -compress lzw -background white +matte -depth 32 page_%05d.tif

OCR the pages with Tesseract: ² ³

for i in page_*.tif; do echo $i; tesseract $i $(basename $i .tif) pdf; done

Join your individual PDF files into a single, searchable PDF with pdftk: ⁴
```
pdftk page_*.pdf cat output merged.pdf
```

convert 9.png -resize 3000% -type Grayscale input9.tif （因为像素low所以要转）

tesseract input9.tif output9 -l eng

tesseract input9.png output9 (默认是eng英文)

0 0

Tesseract安装使用

↳ Command-Line OCR with Tesseract on Mac OS X

Software Installation

Processing Workflow