python scrapy 爬虫 未完待续

来源:互联网 发布:菊水 清酒 知乎 编辑:程序博客网 时间:2024/05/17 06:05

0. 爬虫

Scrapy 轻松定制网络爬虫

0.1 爬虫的两部分:

1.下载Web页面

  • 最大程度的利用本地带宽
  • 调度针对不同站点的Web请求以减轻对方服务器的负担
  • DNS查询
  • 遵循一些行规(如robots.txt)

2.对网页的处理

  • 获取动态内容
  • Spider Trap
  • 内容去重

1.scrapy

1.1 安装scrapy

pip install scrapy
pip install service_identity

不装service_identity会出现警告:

warning::0: UserWarning: You do not have a working installation of the service_identitymodule: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.
Traceback (most recent call last):  File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url    r = opener.open(req, timeout=timeout)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open    response = self._open(req, data)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open    '_open', req)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain    result = func(*args)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open    return self.do_open(httplib.HTTPConnection, req)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open    raise URLError(err)URLError: <urlopen error timed out>URLError:<urlopen error [errno 10051]
solution:https://github.com/scrapy/scrapy/issues/1054DOWNLOAD_HANDLERS = {  's3': None,}

1.2 Twisted

scrapy使用Twisted这个异步网络库来处理网络通讯,整体架构如下图:

这里写图片描述
绿线是数据流向,首先从初始 URL 开始,Scheduler 会将其交给 Downloader 进行下载,下载之后会交给 Spider 进行分析,Spider 分析出来的结果有两种:一种是需要进一步抓取的链接,例如之前分析的“下一页”的链接,这些东西会被传回 Scheduler ;另一种是需要保存的数据,它们则被送到 Item Pipeline 那里,那是对数据进行后期处理(详细分析、过滤、存储等)的地方。另外,在数据流动的通道里还可以安装各种中间件,进行必要的处理。

2. 第一个scrapy爬虫

Scrapy Tutorial

#在你想建爬虫的地方(如D:/xx/yy)shift+右键,调出命令行,输入:scrapy startproject projectname

3. 模拟浏览器解析js

Click a Button in Scrapy
selenium with scrapy for dynamic page

from selenium import webdriverclass northshoreSpider(Spider):    name = 'xxx'    allowed_domains = ['www.example.org']    start_urls = ['https://www.example.org']    def __init__(self):        self.driver = webdriver.Firefox()    def parse(self,response):            self.driver.get('https://www.example.org/abc')            while True:                try:                    next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]')                    url = 'http://www.example.org/abcd'                    yield Request(url,callback=self.parse2)                    next.click()                except:                    break            self.driver.close()    def parse2(self,response):        print 'you are here!'

4.模拟登录 模块 FormRequest(没用到)

5.爬京东的商品评论

start_urls=["http://item.jd.com/1217499.html",]#好像通过ajax接口直接爬评论的url的时候,要先去上面那个主页逛一圈?也许是有个Referer:http://item.jd.com/1217499.htmlunicode(response.body.decode(response.encoding)).encode('utf-8')  #一般用这条语句将字符编码转到utf-8

未完待续

参考:
passing selenium response url to scrapy
下载器中间件(Downloader Middleware)
Scrapy Tutorial
初窥scrapy

stack overflow上
scrapy解析ajax:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
Scrapy follow pagination AJAX Request - POST
using scrapy to scrap asp.net website with javascript buttons and ajax requests

0 0
原创粉丝点击