Scrapy入门实例

来源:互联网 发布:伦铜库存数据 编辑:程序博客网 时间:2024/06/06 06:47

文档地址:https://doc.scrapy.org/en/latest/intro/tutorial.html

1.创建工程:

C:\Users\Gunner>scrapy startproject tutorial

自动在当前目录下创建工程文件。

2.定义蜘蛛spider

tutorial/spiders目录下,添加quotes_spider.py文件。

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes"

 

    def start_requests(self):

        urls = [

            'http://quotes.toscrape.com/page/1/',

            'http://quotes.toscrape.com/page/2/',

        ]

        for url in urls:

            yield scrapy.Request(url=url, callback=self.parse)

 

    def parse(self, response):

        page = response.url.split("/")[-2]

        filename = 'quotes-%s.html' % page

        with open(filename, 'wb') as f:

            f.write(response.body)

        self.log('Saved file %s' % filename)

其中:

l name用于定义spider的名称;

l start_requests()方法定义开始爬取得网址并回调解析函数parse

l Parse函数用于处理爬取网页的响应;

3.运行spider

scrapy crawl quotes

4.提取数据:

提取数据最好的学习方式是使用scrapy shell,例如:

scrapy shell “http://quotes.toscrape.com/page/1/”

2017-05-30 16:46:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)

[s] Available Scrapy objects:

[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)

[s]   crawler    <scrapy.crawler.Crawler object at 0x0412DD10>

[s]   item       {}

[s]   request    <GET http://quotes.toscrape.com/page/1/>

[s]   response   <200 http://quotes.toscrape.com/page/1/>

[s]   settings   <scrapy.settings.Settings object at 0x0412DCD0>

[s]   spider     <DefaultSpider 'default' at 0x43de570>

[s] Useful shortcuts:

[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)

[s]   fetch(req)                  Fetch a scrapy.Request and update local objects

[s]   shelp()           Shell help (print this help)

[s]   view(response)    View response in a browser

>>>

可以使用responsecss方法获取不同元素:

>>> response.css('title')

[<Selector xpath=u'descendant-or-self::title' data=u'<title>Quotes to Scrape</title>'>]

>>> response.css('title').extract()

[u'<title>Quotes to Scrape</title>']

>>> response.css('title').extract_first()

u'<title>Quotes to Scrape</title>'

>>> response.css('title')[0].extract()

u'<title>Quotes to Scrape</title>'

>>>

此外,re()方法可以使用正则表达式进行提取。

>>> response.css('title::text').re(r'Quotes.*')

[u'Quotes to Scrape']

>>> response.css('title::text').re(r'Q\w+')

[u'Quotes']

>>>

浏览器查看response的方法:

>>> view(response)

True

5.使用xpath

除了支持cssscrapy还支持使用xpath进行数据的定位提取。实际上css也是在内部被转成xpath进行使用的。

>>> response.xpath('//title')

[<Selector xpath='//title' data=u'<title>Quotes to Scrape</title>'>]

6.使用yeild进行抽取数据,如quotes1.py

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes1"

    start_urls = [

        'http://quotes.toscrape.com/page/1/',

        'http://quotes.toscrape.com/page/2/',

    ]

 

    def parse(self, response):

        for quote in response.css('div.quote'):

            yield {

                'text': quote.css('span.text::text').extract_first(),

                'author': quote.css('small.author::text').extract_first(),

                'tags': quote.css('div.tags a.tag::text').extract(),

            }

运行scrapy crawl quotes1时会打印提取信息:textauthortags

7.保存数据:

scrapy crawl quotes1 -o quotes.json

或者保存为流式的json lines格式:

scrapy crawl quotes1 -o quotes.jl

8.获取下一个链接:

response.css('li.next a::attr(href)').extract_first()

attr用于获取标签的属性。

9.获取下一页链接,quotes2.py

import scrapy

 

class QuotesSpider(scrapy.Spider):

    name = "quotes2"

    start_urls = [

        'http://quotes.toscrape.com/page/1/',

    ]

 

    def parse(self, response):

        for quote in response.css('div.quote'):

            yield {

                'text': quote.css('span.text::text').extract_first(),

                'author': quote.css('small.author::text').extract_first(),

                'tags': quote.css('div.tags a.tag::text').extract(),

            }

 

        next_page = response.css('li.next a::attr(href)').extract_first()

        if next_page is not None:

            next_page = response.urljoin(next_page)

            yield scrapy.Request(next_page, callback=self.parse)

还可以使用resopnse.follow(next_page,callback=self.parse)

原创粉丝点击