scrapy实战一

来源:互联网 发布:linux无法定位软件包 编辑:程序博客网 时间:2024/06/03 14:51

scrapy是什么?

“Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。”–官方说法。
个人理解:爬取网页数据,并将抓到的数据结构化,你只需关心你自己的爬取逻辑和页面数据的提取逻辑,其他的事情,框架都帮你做了。

安装scrapy

yum -y updateyum groupinstall -y developmentyum install -y zlib-dev openssl-devel sqlite-devel bzip2-devel libffi-devel python-devel libxslt-develcd /pkgwget --no-check-certificate https://pypi.python.org/packages/source/s/setuptools/setuptools-1.4.2.tar.gztar -xvf setuptools-1.4.2.tar.gzcd setuptools-1.4.2python setup.py installcurl https://raw.github.com/pypa/pip/master/contrib/get-pip.py | python -pip install scrapy

执行scrapy version

[root@jianzhi-dev ~]# scrapy version2015-12-15 09:04:30 [scrapy] INFO: Scrapy 1.0.3 started (bot: scrapybot)2015-12-15 09:04:30 [scrapy] INFO: Optional features available: ssl, http112015-12-15 09:04:30 [scrapy] INFO: Overridden settings: {}

看到类似上面的代码,说明装好了。

框架结构简介

执行scrapy startproject pn,其中pn是项目名,你可以根据你的实际项目取你自己的名字,执行完以后,在当前目录下,会生成pn目录,目录结构如下
.
├── pn
│ ├── init.py
│ ├── items.py //定义数据结构的地方,你抓取页面,将数据结构化,就用到它
│ ├── pipelines.py //抓取最后变成的item,都会经过pipelines处理,比方说将item保存到mysql
│ ├── settings.py //系统配置,比如控制抓取的速度等
│ └── spiders //写抓取逻辑的地方
│ └── init.py
└── scrapy.cfg

开始第一个简单的例子

该例子完成的任务是抓取搜索刘德华后的百科页面中得url和标题

定义数据结构,在items.py文件里

class PnItem(scrapy.Item):    title = scrapy.Field()    url = scrapy.Field()

创建自己的爬去逻辑类

cd spiders/vim pn_spider.py
# -*- coding: UTF-8 -*-import scrapyfrom pn.items import PnItemclass PnSpider(scrapy.spiders.Spider):    name = "pn"    start_urls = [        "http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8"    ]    def parse(self, response):        for sel in response.xpath("//dl[@class='search-list']/dd"):            item = PnItem()            item['title'] = sel.xpath('a/text()').extract()[0]            item['url'] = sel.xpath('a/@href').extract()[0]            yield item

保存之后,执行scrapy crawl pn进行爬去作业,得到如下的爬去结果

2015-12-15 10:04:23 [scrapy] INFO: Scrapy 1.0.3 started (bot: pn)2015-12-15 10:04:23 [scrapy] INFO: Optional features available: ssl, http112015-12-15 10:04:23 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'pn.spiders', 'SPIDER_MODULES': ['pn.spiders'], 'BOT_NAME': 'pn'}2015-12-15 10:04:49 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState2015-12-15 10:04:49 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats2015-12-15 10:04:49 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware2015-12-15 10:04:49 [scrapy] INFO: Enabled item pipelines:2015-12-15 10:04:49 [scrapy] INFO: Spider opened2015-12-15 10:04:49 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)2015-12-15 10:04:49 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:60232015-12-15 10:04:55 [scrapy] DEBUG: Crawled (200) <GET http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8> (referer: None)2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/subview/1758/18233157.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u6768\u4e3d\u5a1f(', 'url': u'http://baike.baidu.com/subview/872134/8550376.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u56db\u5927\u5929\u738b(\u9999\u6e2f\u56db\u5927\u5929\u738b)_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/subview/20129/5747579.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u6f14\u5531\u4f1a99_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/view/757747.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'ALways', 'url': u'http://baike.baidu.com/view/10726576.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u5144\u5f1f\u4e4b\u751f\u6b7b\u540c\u76df_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/view/1182768.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u6797\u5bb6\u680b_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/view/19592.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u6768\u4e3d\u5a1f\u4e8b\u4ef6_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/view/1047445.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u5200\u5251\u7b11(1994\u5e74\u9ec4\u6cf0\u6765\u6267\u5bfc\u7535\u5f71)_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/subview/1064013/6839067.htm'}2015-12-15 10:04:55 [scrapy] DEBUG: Scraped from <200 http://baike.baidu.com/searchword/?pic=1&fr=tieba&word=%E5%88%98%E5%BE%B7%E5%8D%8E&ie=utf-8>{'title': u'\u81f3\u5c0a\u65e0\u4e0a\u2161\u4e4b\u6c38\u9738\u5929\u4e0b_\u767e\u5ea6\u767e\u79d1', 'url': u'http://baike.baidu.com/view/3908825.htm'}2015-12-15 10:04:55 [scrapy] INFO: Closing spider (finished)2015-12-15 10:04:55 [scrapy] INFO: Dumping Scrapy stats:{'downloader/request_bytes': 281, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 6822, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2015, 12, 15, 2, 4, 55, 242394), 'item_scraped_count': 10, 'log_count/DEBUG': 12, 'log_count/INFO': 7, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2015, 12, 15, 2, 4, 49, 176917)}2015-12-15 10:04:55 [scrapy] INFO: Spider closed (finished)

程序详解

class PnSpider(scrapy.spiders.Spider):
定义爬虫类

name = “pn”
给这个爬虫命名,和执行scrapy crawl pn时,名字对应

start_urls
初始爬取的页面

def parse(self, response):
爬取完页面,回调的方法,在这里面对页面进行处理

“//dl[@class=’search-list’]/dd”
这是xpath的语法,可以自己了解下xpath,简单来说,就是页面查询语言,通过这种方式找到你要的数据

item[‘title’] = sel.xpath(‘a/text()’).extract()[0]
item[‘url’] = sel.xpath(‘a/@href’).extract()[0]
将你要得数据,保存到你之前设定好的数据结构里

1 0