imagemagick图片识别技术&数据抓取(转自:http://michael-roshen.iteye.com/blog/1982817)
来源:互联网 发布:淘宝童装店铺招牌图片 编辑:程序博客网 时间:2024/06/06 10:17
安装:
sudo apt-get install imagemagick
ImageMagick 是一个用来创建、编辑、合成图片的软件。它可以读取、转换、写入多种格式的图片。图片切割、颜色替换、各种效果的应用,图片的旋转、组合,文本,直线, 多边形,椭圆,曲线,附加到图片伸展旋转。ImageMagick是免费软件:全部源码开放,可以自由使用,复制,修改,发布。支持大多数的操作系统。
检测是否支持指定格式 identify -list format | grep PNG
sudo apt-get install tesseract-ocr
OCR(Optical Character Recognition):光学字符识别,是指对图片文件中的文字进行分析识别,获取的过程。
Tesseract:开源的OCR识别引擎,初期Tesseract引擎由HP实验室研发,后来贡献给了开源软件业,后经由Google进行改进,消除bug,优化,重新发布
gem install mini_magick
mini_magick 是用ruby对ImageMagick命令的封装,可以这样简单的使用
image = MiniMagick::Image.open("input.jpg")image.resize "100x100"image.write "output.jpg"
image = MiniMagick::Image.open("http://www.google.com/images/logos/logo.png")image.resize "5x5"image.format "gif"image.write "localcopy.gif"
https://github.com/minimagick/minimagick
gem install rtesseract
rtesseract对了tesseract-ocr进行了封装
识别验证码:
RTesseract.new(img.path).to_s
在 ubuntu12.04上安装如果出现如下问题,请安装 libmagickwand-dev
sudo apt-get install libmagickwand-dev
zhaol-a@ubuntu:~/gcj_project$ gem install rtesseract
/usr/share/ruby-rvm/rubies/ruby-1.9.3-p392/lib/ruby/1.9.1/yaml.rb:56:in `<top (required)>':
It seems your ruby installation is missing psych (for YAML output).
To eliminate this warning, please install libyaml and reinstall your ruby.
Building native extensions. This could take a while...
ERROR: Error installing rtesseract:
ERROR: Failed to build gem native extension.
/usr/share/ruby-rvm/rubies/ruby-1.9.3-p392/bin/ruby extconf.rb
checking for Ruby version >= 1.8.5... yes
checking for gcc... yes
checking for Magick-config... no
Can't install RMagick 2.13.2. Can't find Magick-config in /usr/share/ruby-rvm/gems/ruby-1.9.3-p392/bin:/usr/share/ruby-rvm/gems/ruby-1.9.3-p392@global/bin:/usr/share/ruby-rvm/rubies/ruby-1.9.3-p392/bin:/usr/share/ruby-rvm/gems/ruby-1.9.3-p392/bin:/usr/share/ruby-rvm/gems/ruby-1.9.3-p392@global/bin:/usr/share/ruby-rvm/rubies/ruby-1.9.3-p392/bin:/usr/share/ruby-rvm/bin:/usr/lib/lightdm/lightdm:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
*** extconf.rb failed ***
Could not create Makefile due to some reason, probably lack of
necessary libraries and/or headers. Check the mkmf.log file for more
details. You may need configuration options.
Provided configuration options:
--with-opt-dir
--without-opt-dir
--with-opt-include
--without-opt-include=${opt-dir}/include
--with-opt-lib
--without-opt-lib=${opt-dir}/lib
--with-make-prog
--without-make-prog
--srcdir=.
--curdir
--ruby=/usr/share/ruby-rvm/rubies/ruby-1.9.3-p392/bin/ruby
使用mechanize,nokogiri进行数据抓取,验证码识别的例子:
- #encoding: utf-8
- require 'mechanize'
- require 'nokogiri'
- require 'spreadsheet'
- require 'rtesseract'
- require 'mini_magick'
- require 'open-uri'
- require 'uri'
- require 'iconv'
- require 'active_support/all'
- #调整图片的大小来提高图片的识别准确度
- def parse_img_to_str(img_url)
- img = MiniMagick::Image.open(img_url)
- img.resize '200%x200%' # 放大
- img.colorspace("GRAY") # 灰度化
- img.monochrome # 去色
- str = RTesseract.new(img.path).to_s # 识别
- File.unlink(img.path) # 删除临时文件
- if str.nil?
- return nil
- else
- return str.strip.to_f
- end
- end
- #使用异常处理,可以减少因为网络延迟导致timeout的问题,但是不能完全避免
- #分析请求的url, 和参数
- def get_json_data(agent, url, params)
- begin
- response = agent.post(url, params)
- json_data = JSON.parse(response.body)
- return json_data
- rescue
- retry # restart from beginning
- end
- end
- p "waiting................."
- agent = Mechanize.new
- agent.open_timeout = 15
- file_path = File.dirname(__FILE__)user
- #用户登陆要抓取的网站后,用firefox导出cookies信息到cookies.txt
- cookies_file_path = "/home/username/project/cookies.txt"
- #导入cookie信息,跳过登陆
- agent.cookie_jar.load_cookiestxt(cookies_file_path)
- #抓取年份信息的例子
- title_params = {"area" => area,"province" => "北京"}
- years_info = get_json_data(agent, gov_mat_ttile_url, title_params)["result"]
- imagemagick图片识别技术&数据抓取(转自:http://michael-roshen.iteye.com/blog/1982817)
- 需求分析 转自:http://fangang.iteye.com/blog/1345099
- IOS 设置启动图标 和 启动图片(转载自 http://justsee.iteye.com/blog/2123545)
- 【转】http://sjsky.iteye.com/blog/1142177
- 学习selenium,转载自http://jarvi.iteye.com/blog/1448025
- 学习C#:Attribute与Property(转自:http://jhxk.iteye.com/blog/481730)
- TCP/IP 协议介绍(转自)http://zsxxsz.iteye.com/blog/568250
- 利用Java生成静态HMTL页面的方法----------转自http://playfish.iteye.com/blog/150386
- scroll事件(转自:http://flare.iteye.com/blog/161858)
- Java每日一题01(转自http://jythoner.iteye.com/blog/322336)
- Jsp页面传值的方法(转自http://jzgl-javaeye.iteye.com/blog/372349#)
- PHP中的魔术方法,转自:http://4nail.iteye.com/blog/604913
- dbunit使用(转自:http://ttitfly.iteye.com/blog/248680)
- hbase 介绍(转自:http://jimi68.iteye.com/blog/983059)
- PostgreSQL中使用SQL查询表结构(转自:http://deepfuture.iteye.com/blog/588758)
- Excel之POI转自http://xiayingjie.iteye.com/blog/803682
- 五年软件开发的一点自我总结 转自http://runfeel.iteye.com/blog/1873170
- Android中的JSON详细总结 本文转自:http://shazhuzhu1.iteye.com/blog/974758
- Word 2007 XML 解压缩格式
- 广播注册过程分析
- 图的存储结构
- fastdfs
- ios发布应用程序到App Store
- imagemagick图片识别技术&数据抓取(转自:http://michael-roshen.iteye.com/blog/1982817)
- linux下的apache安装
- Libvirt几个重要概念
- UVALive - 2678 Subsequence 推理
- 关于border。
- 无锁编程实战演练
- 年度总结:谈谈我的2014
- POJ 1936 All in All【暴搜】
- pat1087 All Roads Lead to Rome