python scrapy 爬虫未完待续

来源：互联网发布：菊水清酒知乎编辑：程序博客网时间：2024/05/17 06:05

0. 爬虫

Scrapy 轻松定制网络爬虫

0.1 爬虫的两部分：

1.下载Web页面

最大程度的利用本地带宽
调度针对不同站点的Web请求以减轻对方服务器的负担
DNS查询
遵循一些行规（如robots.txt）

2.对网页的处理

获取动态内容
Spider Trap
内容去重

1.scrapy

1.1 安装scrapy

pip install scrapy
pip install service_identity

不装service_identity会出现警告：

warning：:0: UserWarning: You do not have a working installation of the service_identitymodule: 'No module named service_identity'.  Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied.  Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification.  Many valid certificate/hostname mappings may be rejected.

Traceback (most recent call last):  File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url    r = opener.open(req, timeout=timeout)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open    response = self._open(req, data)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open    '_open', req)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain    result = func(*args)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open    return self.do_open(httplib.HTTPConnection, req)  File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open    raise URLError(err)URLError: <urlopen error timed out>URLError:<urlopen error [errno 10051]

solution:https://github.com/scrapy/scrapy/issues/1054DOWNLOAD_HANDLERS = {  's3': None,}

1.2 Twisted

scrapy使用Twisted这个异步网络库来处理网络通讯，整体架构如下图：

这里写图片描述
绿线是数据流向，首先从初始 URL 开始，Scheduler 会将其交给 Downloader 进行下载，下载之后会交给 Spider 进行分析，Spider 分析出来的结果有两种：一种是需要进一步抓取的链接，例如之前分析的“下一页”的链接，这些东西会被传回 Scheduler ；另一种是需要保存的数据，它们则被送到 Item Pipeline 那里，那是对数据进行后期处理（详细分析、过滤、存储等）的地方。另外，在数据流动的通道里还可以安装各种中间件，进行必要的处理。

2. 第一个scrapy爬虫

Scrapy Tutorial

#在你想建爬虫的地方（如D:/xx/yy）shift+右键，调出命令行，输入：scrapy startproject projectname

3. 模拟浏览器解析js

Click a Button in Scrapy
selenium with scrapy for dynamic page

from selenium import webdriverclass northshoreSpider(Spider):    name = 'xxx'    allowed_domains = ['www.example.org']    start_urls = ['https://www.example.org']    def __init__(self):        self.driver = webdriver.Firefox()    def parse(self,response):            self.driver.get('https://www.example.org/abc')            while True:                try:                    next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]')                    url = 'http://www.example.org/abcd'                    yield Request(url,callback=self.parse2)                    next.click()                except:                    break            self.driver.close()    def parse2(self,response):        print 'you are here!'

4.模拟登录模块 FormRequest(没用到)

5.爬京东的商品评论

start_urls=["http://item.jd.com/1217499.html",]#好像通过ajax接口直接爬评论的url的时候，要先去上面那个主页逛一圈？也许是有个Referer:http://item.jd.com/1217499.htmlunicode(response.body.decode(response.encoding)).encode('utf-8')  #一般用这条语句将字符编码转到utf-8

未完待续

参考：
passing selenium response url to scrapy
下载器中间件(Downloader Middleware)
Scrapy Tutorial
初窥scrapy

stack overflow上
scrapy解析ajax：
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
Scrapy follow pagination AJAX Request - POST
using scrapy to scrap asp.net website with javascript buttons and ajax requests

0 0

python scrapy 爬虫 未完待续