Scrapy入门实例

来源：互联网发布：伦铜库存数据编辑：程序博客网时间：2024/06/06 06:47

文档地址：https://doc.scrapy.org/en/latest/intro/tutorial.html

1.创建工程：

C:\Users\Gunner>scrapy startproject tutorial

自动在当前目录下创建工程文件。

2.定义蜘蛛spider：

在tutorial/spiders目录下，添加quotes_spider.py文件。

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

def start_requests(self):

urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

for url in urls:

yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):

page = response.url.split("/")[-2]

filename = 'quotes-%s.html' % page

with open(filename, 'wb') as f:

f.write(response.body)

self.log('Saved file %s' % filename)

其中：

l name用于定义spider的名称；

l start_requests()方法定义开始爬取得网址并回调解析函数parse；

l Parse函数用于处理爬取网页的响应；

3.运行spider：

scrapy crawl quotes

4.提取数据：

提取数据最好的学习方式是使用scrapy shell，例如：

scrapy shell “http://quotes.toscrape.com/page/1/”

2017-05-30 16:46:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

[s] Available Scrapy objects:

[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s] crawler <scrapy.crawler.Crawler object at 0x0412DD10>

[s] item {}

[s] request <GET http://quotes.toscrape.com/page/1/>

[s] response <200 http://quotes.toscrape.com/page/1/>

[s] settings <scrapy.settings.Settings object at 0x0412DCD0>

[s] spider <DefaultSpider 'default' at 0x43de570>

[s] Useful shortcuts:

[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s] fetch(req) Fetch a scrapy.Request and update local objects

[s] shelp() Shell help (print this help)

[s] view(response) View response in a browser

>>>

可以使用response的css方法获取不同元素：

>>> response.css('title')

[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

>>> response.css('title').extract()

[u'<title>Quotes to Scrape</title>']

>>> response.css('title').extract_first()

u'<title>Quotes to Scrape</title>'

>>> response.css('title')[0].extract()

u'<title>Quotes to Scrape</title>'

>>>

此外，re()方法可以使用正则表达式进行提取。

>>> response.css('title::text').re(r'Quotes.*')

[u'Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

[u'Quotes']

>>>

浏览器查看response的方法：

>>> view(response)

True

5.使用xpath。

除了支持css，scrapy还支持使用xpath进行数据的定位提取。实际上css也是在内部被转成xpath进行使用的。

>>> response.xpath('//title')

[<Selector xpath='//title' data=u'<title>Quotes to Scrape</title>'>]

6.使用yeild进行抽取数据，如quotes1.py：

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes1"

start_urls = [

'http://quotes.toscrape.com/page/1/',

'http://quotes.toscrape.com/page/2/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').extract_first(),

'author': quote.css('small.author::text').extract_first(),

'tags': quote.css('div.tags a.tag::text').extract(),

}

运行scrapy crawl quotes1时会打印提取信息：text、author、tags。

7.保存数据：

scrapy crawl quotes1 -o quotes.json

或者保存为流式的json lines格式：

scrapy crawl quotes1 -o quotes.jl

8.获取下一个链接：

response.css('li.next a::attr(href)').extract_first()

attr用于获取标签的属性。

9.获取下一页链接，quotes2.py

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes2"

start_urls = [

'http://quotes.toscrape.com/page/1/',

]

def parse(self, response):

for quote in response.css('div.quote'):

yield {

'text': quote.css('span.text::text').extract_first(),

'author': quote.css('small.author::text').extract_first(),

'tags': quote.css('div.tags a.tag::text').extract(),

}

next_page = response.css('li.next a::attr(href)').extract_first()

if next_page is not None:

next_page = response.urljoin(next_page)

yield scrapy.Request(next_page, callback=self.parse)

还可以使用resopnse.follow(next_page,callback=self.parse)

阅读全文

0 0