关于scrapy爬虫框架

来源：互联网发布：淘宝vr眼镜效果怎么样编辑：程序博客网时间：2024/06/05 19:50

一、选择一个网站

假设要从Mininova网站中提取所有今天添加的文件的url,name,description和size

网址为 http://www.mininova.org/today

二、定义数据

定义要抓取的数据，通过 Scrapy Items 来实现

例子：（BT文件--bit torrent，比特洪流）

【Python】

三、撰写蜘蛛

1、查看初始网址的源代码

2、查找url的规律（例子：http://www.mininova.org/tor/+数字，可以利用正则表达式 "/tor/\d+" 来提取所有文件的url地址）

3、构建一个Xpath去选择我们需要的数据,name, description 和size

【HTML 源码】

<h1>Darwin - The Evolution Of An Exhibition</h1>
<h2>Description:</h2>
<div id="description">
Short documentary made for Plymouth City Museum and Art Gallery regarding the setup of an exhibit about Charles Darwin in conjunction with the 200th anniversary of his birth.
...
<div id="specifications">
<p>
<strong>Category:</strong>
<a href="/cat/4">Movies</a> > <a href="/sub/35">Documentary</a>
</p>
<p>
<strong>Total size:</strong>
150.62 megabyte</p>

从上面代码中，可以发现name在<h1>里面

它的Xpath表达式为：//h1/text()

description在id="description"的div标签里

它的Xpath表达式为：//div[@id='description']

size它在id="specifications"的div标签中的第2个p标签里

它的Xpath表达式为：//div[@id='specification']/p[2]/text()[2]

最后，爬虫的代码如下（python）

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
class MininovaSpider(CrawlSpider):
name = 'mininova'
allowed_domains = ['mininova.org']
start_urls = ['http://www.mininova.org/today']
rules = [Rule(SgmlLinkExtractor(allow=['/tor/\d+']), 'parse_torrent')]
def parse_torrent(self, response):
sel = Selector(response)
torrent = TorrentItem()
torrent['url'] = response.url
torrent['name'] = sel.xpath("//h1/text()").extract()
torrent['description'] = sel.xpath("//div[@id='description']").extract()
torrent['size'] = sel.xpath("//div[@id='specification']/p[2]/text()[2]").extract()
return torrent

四、执行爬虫提取数据

将爬取得到的数据，以json格式保存到scraped_data.json文件中

scrapy crawl mininova -o scraped_data.json -t json

这里用feed export来生成json文件

【Scrapy自带了Feed输出，并且支持多种序列化格式(serialization format)及存储方式(storage backends)。】

五、回顾抓取数据

Selectors 返回的是一个列表(lists)

阅读全文

0 0