随笔1-scrapy

来源:互联网 发布:俞兆林内裤怎么样知乎 编辑:程序博客网 时间:2024/06/10 05:40
  1. Creating a new Scrapy project
  2. Writing a spider to crawl a site and extract data
  3. Exporting the scraped data using the command line
  4. Changing spider to recursively follow links
  5. Using spider arguments
1.创建一个scrapy的项目  
scrapy startproject tutorial
2.写一个爬虫,用来爬取网站和扩展数据  
quotes_spider.py
3.使用命令行导出爬取的数据
scrapy crawl quotes
4.修改爬虫迭代链接(???)
5.使用爬虫参数(????)
scrapy shell 'http://quotes.toscrape.com/page/1/'
爬虫代码:
import scrapyclass QuotesSpider(scrapy.Spider):    name = "quotes"    def start_requests(self):        url = 'http://quotes.toscrape.com/'        tag = getattr(self, 'tag', None)        if tag is not None:            url = url + 'tag/' + tag        yield scrapy.Request(url, self.parse)    def parse(self, response):        for quote in response.css('div.quote'):            yield {                'text': quote.css('span.text::text').extract_first(),                'author': quote.css('small.author::text').extract_first(),            }        next_page = response.css('li.next a::attr(href)').extract_first()        if next_page is not None:            yield response.follow(next_page, self.parse)