scrapy爬虫框架的详细用法

来源：互联网发布：开农村淘宝靠什么赚钱编辑：程序博客网时间：2024/06/11 05:16

Scrapy，Python开发的一个快速、高层次的屏幕抓取和web抓取框架，用于抓取web站点并从页面中提取结构化的数据。Scrapy用途广泛，可以用于数据挖掘、监测和自动化测试。
Scrapy吸引人的地方在于它是一个框架，任何人都可以根据需求方便的修改。它也提供了多种类型爬虫的基类，如BaseSpider、sitemap爬虫等，最新版本又提供了web2.0爬虫的支持。
Scrapy框架的安装，详见博客http://blog.csdn.net/qq_29186489/article/details/78736945，本文不再赘述。
本文以scrapy抓取一个网站为例子，来详细演示scrapy框架的详细用法。
目标网站：Quotes to Scrape，网址为：http://quotes.toscrape.com/

流程框架

抓取第一个页面
请求第一页的URL并得到源代码，进行下一分析
获取内容和下一页的链接
分析源代码，提取翻页内容，获取下一页链接等待进一步爬取
翻页爬取
请求下一页的信息，分析内容并请求下一页的链接
保存运行结果
将爬取文件存为特定的格式的文件或者存入数据库

具体代码实现及命令

相关命令
1：生成scrapy工程，工程的名称为quotetutorial
scrapy startproject quotetutorial
2：进入生成的工程目录
cd quotetutorial
3：生成爬虫，名称为quotes ，爬取的网址为：quotes.toscrape.com
scrapy genspider quotes quotes.toscrape.com
4:运行命令：scrapy crawl quotes，运行结果如下：
这里写图片描述
具体代码实现
1：定义item，具体代码实现如下：

import scrapyclass QuotetutorialItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    text=scrapy.Field()    author=scrapy.Field()    tags=scrapy.Field()1
2
3
4
5
6
7
8
9

2：编写spider，对返回的网页代码进行处理和遍历

# -*- coding: utf-8 -*-import scrapyfrom quotetutorial.items import QuotetutorialItemclass QuotesSpider(scrapy.Spider):    name = 'quotes'    allowed_domains = ['quotes.toscrape.com']    start_urls = ['http://quotes.toscrape.com/']    #对请求返回的网页HTML代码进行处理    def parse(self, response):        quotes=response.css(".quote")        for quote in quotes:            item=QuotetutorialItem()            text=quote.css(".text::text").extract_first()            author = quote.css(".author::text").extract_first()            tags = quote.css(".tags .tag::text").extract()            item["text"]=text            item["author"]=author            item["tags"]=tags            yield item        next=response.css(".pager .next a::attr(href)").extract_first()        url=response.urljoin(next)        yield scrapy.Request(url=url,callback=self.parse)1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

3：实现pipline，对返回的ITEM进行处理

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlfrom scrapy.exceptions import DropItemimport pymongoclass QuotetutorialPipeline(object):    def __init__(self):        self.limit=50    def process_item(self, item, spider):        if item['text']:            if len(item["text"])>self.limit:                item["text"]=item["text"][0:self.limit].rstrip()+"..."            return item        else:            return DropItem("Missing Text")class MongoPipeline(object):    def __init__(self,mongo_uri,mongo_db):        self.mongo_uri=mongo_uri        self.mongo_db=mongo_db    @classmethod    def from_crawler(cls,crawler):        return cls(            mongo_uri=crawler.settings.get("MONGO_URI"),            mongo_db=crawler.settings.get("MONGO_DB")        )    def open_spider(self,spider):        self.client=pymongo.MongoClient(self.mongo_uri)        self.db=self.client[self.mongo_db]    def process_item(self,item,spider):        name=item.__class__.__name__        self.db[name].insert(dict(item))        return item    def close_spider(self,spider):        self.client.close()1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44

4：在settings文件中，启用pipline

# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {    'quotetutorial.pipelines.QuotetutorialPipeline': 300,    'quotetutorial.pipelines.MongoPipeline': 301}1
2
3
4
5
6

完整代码的下载地址如下：https://gitee.com/TianYaBenXiong/scrapy_code/tree/master

顶: 0

踩: 0

阅读全文

0 0