scrapy 学习2

来源：互联网发布：java有项目培训吗编辑：程序博客网时间：2024/06/06 13:04

接着上篇文章，自定义了一个spider，scrapy 的schedules（调度器）调用了自定义的spider的start_requests 方法，该方法

会返回一个response类。后面定义的parse()方法是继承的父类方法，所以他是一个自动调用的回掉函数。不需要手动调用。

今天内容：

1、你可以在shell里直接调用scrapy来访问url

    scrapy shell "http://quotes.toscrape.com/page/1/"

在该shell可以使用css 选择器来定位到要选择的元素（此刻使用response这个对象）

>>> response.css('title')

如果想提取出title中的数据 可以加入一个extract()方法 如：response.css('title'). extract()

还支持正则匹配 re() 如：response.css('title::text').re('')

2、数据的存储

 最简单的方式使用json来存储 scrapy crawl *** -o ***.json 这样最后会生成一个json文件来存储提取到

的数据。

 问题来了，下次再执行该命令行的时候会导致新的数据写到该json文件中，如何在该json下面接着写入呢

使用 json lines 就可以了

scrapy crawl *** -o ***.jl
if you want to perform more complex things with the scraped items, you can write an Item Pipeline.
如果你想用scrapy items执行更为复杂的事情 你可以使用Item Pipeline了。
3、提取出链接 href
<ul class="pager">    <li class="next">        <a href="/page/2/">Next <span aria-hidden="true">&rarr;</span></a>    </li></ul>
可以使用 response.css('li.next a::attr(href)').extract_first()这样提取出来的就是 ‘/page/2/’了
示例代码
import scrapyclass QuotesSpider(scrapy.Spider):    name = "quotes"    start_urls=[        'http://quotes.toscrape.com/page/1/',    ]    def parse(self, response):        for quote in response.css('div.quote'):            yield {                'text':quote.css('span.text::text').extract_first(),                'author':quote.css('small.author::text').extract_first(),                'tags':quote.css('div.tags a.tag::text').extract(),            }        next_page=response.css('li.next a::attr(href)').extract_first()        if next_page is not None:            next_page=response.urljoin(next_page)            yield scrapy.Request(next_page,callback=self.parse)

大致内容为再解析第一个界面的时候，把下一页的链接提取出来，然后通过Request再次发起请求
这里要说一下，因为提取出来的为相对url所以这里使用了response.urljoin（）相当于把提取出的url拼接进去。
在文档中介绍scrapy的链接机制说到
谷歌翻译：你在这里看到的是Scrapy的以下链接的机制：当你在回调方法中产生一个请求时，
Scrapy会调度要发送的请求，并注册一个回调方法，以在该请求完成时执行。
好了，以上就是实现的循环遍历下一页直到最后一页。其实前提是每页的格式一样才行，不一样的话还
需要自己定义相应的提取数据的法则。
最后来个格式不一样的，说明一下，开始获取的是这个页面
我们点击查看作者的详细信息 就是名字后面的（about）打开的页面是这样滴
我们要提取出上面我画框的信息，你看这两个页面的布局不同吧，所以要自己定义解析方式(ps 我所举的例子
都是官网上给出的，如果嫌我说的不明白，大家可以直接去看文档)
import scrapyclass AuthorSpider(scrapy.Spider):    name = 'author'    start_urls = ['http://quotes.toscrape.com/']    def parse(self, response):        # follow links to author pages        for href in response.css('.author + a::attr(href)').extract():            yield scrapy.Request(response.urljoin(href),                                 callback=self.parse_author)        # follow pagination links        next_page = response.css('li.next a::attr(href)').extract_first()        if next_page is not None:            next_page = response.urljoin(next_page)            yield scrapy.Request(next_page, callback=self.parse)    def parse_author(self, response):        def extract_with_css(query):            return response.css(query).extract_first().strip()        yield {            'name': extract_with_css('h3.author-title::text'),            'birthdate': extract_with_css('.author-born-date::text'),            'bio': extract_with_css('.author-description::text'),        }

有意思的是，在请求页面的时候他不会重复请求同一个页面，这样邮箱的避免了数据的重叠。他可以这样配置
 This can be configured by the settingDUPEFILTER_CLASS.

0 0