Scrapy框架初探

来源：互联网发布：趣零免费域名编辑：程序博客网时间：2024/05/29 14:39

BZ记性不好，写过的scrapy都忘了咋写，
于是，在经历第二次从头开始后，决定写下本文作为记录。
适合白到不能再白的小白

首先，本文跳过安装python、scrapy，直接创建新项目

scrapy startproject SCF       # SCF为新创建的项目名称

除了自己定义的爬虫文件，下面这些，scrapy都会为你生成好。

scrapy框架基本结构

－－ SCF　　　－－ spiders　　　　　　－－ \__init__.py　　　　　　　　　　　　－－ func_spider.py      # 编写自己的爬虫，名称自定义　　　－－ \__init__.py　　　－－ items.py                 # 定义对象　　　－－ middlewares.py　　　－－ pipelines.py             # 处理爬取到的item的信息　　　－－ settings.py              # 配置文件

item.py

　　python是面向对象的语言，很简单，你要爬取的对象，就是item，item.py的用途，就是定义你要爬取的对象的字段。一个class就是一种类型的item。

# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass ScfItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    function = scrapy.Field()    includes = scrapy.Field()

pipelines.py

# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport jsonclass ScfPipeline(object):    def __init__(self):        self.file = open('StandardC.txt', 'wb')    # process_item()是实际上处理item的部分，此处是以json格式写入文件    def process_item(self, item, spider):        line = json.dumps(dict(item)) + "\n"        self.file.write(line)        return item    # close_spider是在关闭爬虫时做的工作，通常为释放资源    def close_spider(self, spider):        self.file.close()

　　pipeline需要在配置文件中进行设置，数字代表优先级，数字小的pipeline优先执行，也即可以对item进行多层pipeline的处理。

# settings.py# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.htmlITEM_PIPELINES = {   'SCF.pipelines.ScfPipeline': 1,  # 数字小的先执行}

第一个爬虫：func_spider.py

很无趣的就是想看看，标准C库函数对应的函数头文件都是啥。。。
于是就想从man page上扒下来。。。
开始

import scrapy, refrom SCF.items import ScfItem        # 导入定义好的item(model)class FuncSpider(scrapy.Spider):    name = 'function'                # 爬虫的名称，在执行爬虫的时候用到    allowed_domains = ['man7.org']   # 爬取的网页的url范围    start_urls = ['http://man7.org/linux/man-pages/dir_section_3.html']           # 定义爬虫起点网址    # 每个网页的爬虫，都从parse函数开始    def parse(self, response):        # 可以通过selector、xpath等工具定位网页        funcNames = response.xpath("//table[1]//td[@valign='top']/a/@href").extract()        for i in funcNames:            # 相对地址补全为绝对地址            url = 'http://man7.org/linux/man-pages' + i[1:]            # 进入二级页面的爬取，callback定义调用的函数            yield scrapy.Request(url=url, callback=self.parse_url)    # 二级页面的爬虫函数    def parse_url(self, response):        item = ScfItem()            # 创建对象        lines = response.xpath("//pre//text()").extract()        res = []        for i in lines:            try:                re_func = re.compile("#include\s+<(\w+[\/\w+]*\.h)>")                temp_str = re_func.findall(i)                if temp_str:                    res.extend(temp_str)            except:                continue        res = list(set(res))        item['function'] = response.url.split("/")[-1].split(".")[0]        item['includes'] = " ".join(res)        yield item                  # 返回item，进入pipelines

本文重点不在selector和xpath的定位语法上，在这个地方就不赘述了。

运行爬虫

scrapy通过命令行运行爬虫

scrapy crawl function（爬虫名称）

那么，有时候使用ide调试怎么执行爬虫呢？
scrapy提供了对应的函数，帮助我们。

新建一个debug.py脚本

# !/usr/bin/pythonfrom scrapy.cmdline import executeexecute()            # cmdline中的execute()函数作用就是执行爬虫

配置debug，将crawl function作为参数输入，就可以使用ide的调试功能了。
这里写图片描述

阅读全文

0 0