python scrapy 爬虫 未完待续
来源:互联网 发布:菊水 清酒 知乎 编辑:程序博客网 时间:2024/05/17 06:05
0. 爬虫
Scrapy 轻松定制网络爬虫
0.1 爬虫的两部分:
1.下载Web页面
- 最大程度的利用本地带宽
- 调度针对不同站点的Web请求以减轻对方服务器的负担
- DNS查询
- 遵循一些行规(如robots.txt)
2.对网页的处理
- 获取动态内容
- Spider Trap
- 内容去重
1.scrapy
1.1 安装scrapy
pip install scrapy
pip install service_identity
不装service_identity会出现警告:
warning::0: UserWarning: You do not have a working installation of the service_identitymodule: 'No module named service_identity'. Please install it from <https://pypi.python.org/pypi/service_identity> and make sure all of its dependencies are satisfied. Without the service_identity module and a recent enough pyOpenSSL to support it, Twisted can perform only rudimentary TLS client hostname verification. Many valid certificate/hostname mappings may be rejected.
Traceback (most recent call last): File "C:\Users\2015\Anaconda\lib\site-packages\boto\utils.py", line 210, in retry_url r = opener.open(req, timeout=timeout) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 431, in open response = self._open(req, data) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 449, in _open '_open', req) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 409, in _call_chain result = func(*args) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1227, in http_open return self.do_open(httplib.HTTPConnection, req) File "C:\Users\2015\Anaconda\lib\urllib2.py", line 1197, in do_open raise URLError(err)URLError: <urlopen error timed out>URLError:<urlopen error [errno 10051]
solution:https://github.com/scrapy/scrapy/issues/1054DOWNLOAD_HANDLERS = { 's3': None,}
1.2 Twisted
scrapy使用Twisted这个异步网络库来处理网络通讯,整体架构如下图:
绿线是数据流向,首先从初始 URL 开始,Scheduler 会将其交给 Downloader 进行下载,下载之后会交给 Spider 进行分析,Spider 分析出来的结果有两种:一种是需要进一步抓取的链接,例如之前分析的“下一页”的链接,这些东西会被传回 Scheduler ;另一种是需要保存的数据,它们则被送到 Item Pipeline 那里,那是对数据进行后期处理(详细分析、过滤、存储等)的地方。另外,在数据流动的通道里还可以安装各种中间件,进行必要的处理。
2. 第一个scrapy爬虫
Scrapy Tutorial
#在你想建爬虫的地方(如D:/xx/yy)shift+右键,调出命令行,输入:scrapy startproject projectname
3. 模拟浏览器解析js
Click a Button in Scrapy
selenium with scrapy for dynamic page
from selenium import webdriverclass northshoreSpider(Spider): name = 'xxx' allowed_domains = ['www.example.org'] start_urls = ['https://www.example.org'] def __init__(self): self.driver = webdriver.Firefox() def parse(self,response): self.driver.get('https://www.example.org/abc') while True: try: next = self.driver.find_element_by_xpath('//*[@id="BTN_NEXT"]') url = 'http://www.example.org/abcd' yield Request(url,callback=self.parse2) next.click() except: break self.driver.close() def parse2(self,response): print 'you are here!'
4.模拟登录 模块 FormRequest(没用到)
5.爬京东的商品评论
start_urls=["http://item.jd.com/1217499.html",]#好像通过ajax接口直接爬评论的url的时候,要先去上面那个主页逛一圈?也许是有个Referer:http://item.jd.com/1217499.htmlunicode(response.body.decode(response.encoding)).encode('utf-8') #一般用这条语句将字符编码转到utf-8
未完待续
参考:
passing selenium response url to scrapy
下载器中间件(Downloader Middleware)
Scrapy Tutorial
初窥scrapy
stack overflow上
scrapy解析ajax:
Can scrapy be used to scrape dynamic content from websites that are using AJAX?
Scrapy follow pagination AJAX Request - POST
using scrapy to scrap asp.net website with javascript buttons and ajax requests
- python scrapy 爬虫 未完待续
- Scrapy爬虫(未完)
- python爬虫——爬取链家房价信息(未完待续)
- Python爬虫Scrapy
- python爬虫scrapy
- Python 爬虫框架 scrapy
- python scrapy爬虫
- Python 启动 Scrapy爬虫
- python爬虫+scrapy安装
- python scrapy 爬虫
- Python+Scrapy 爬虫
- Python+Scrapy 爬虫配置
- python+scrapy+selenium爬虫
- Python爬虫Scrapy实践
- python爬虫scrapy安装
- Python爬虫:scrapy安装
- Python Scrapy爬虫入门
- [python]爬虫库scrapy
- 自定义Dialog
- ADB端口被占用,adb server is out of date
- Flying to the Mars
- Linux下新建分区步骤
- [DLX精确覆盖] hdu 1603 A Puzzling Problem
- python scrapy 爬虫 未完待续
- The resource identified by this request is only capable of generating responses with characteristics
- 收集常用的.net开源项目
- leetCode 83.Remove Duplicates from Sorted List(删除排序链表的重复) 解题思路和方法
- 提升linux下tcp服务器并发连接数限制
- java中的开方Math.sqrt(n)函数和平方{a的b次方Math.pow(a, b)}
- 图片折叠效果:Layer的contentsRect属性和渐变层
- EXITTHRD.C--示范ExitThread()
- oracle 数据汞导入导出dmp文件