Scrapy爬虫框架

来源:互联网 发布:淘宝推广视频教学 编辑:程序博客网 时间:2024/06/06 00:43

Scrapy项目的目录结构

创建一个项目,会自动生成一个项目文件夹

scrapy startproject firstpjt

查看文件结构

.└── firstpjt    ├── firstpjt  //核心目录    │   ├── __init__.py  //项目初始化信息    │   ├── items.py  //数据容器文件    │   ├── middlewares.py       │   ├── pipelines.py  //对items里面的数据进行进一步加工    │   ├── __pycache__    │   ├── settings.py  //设置信息    │   └── spiders    │       ├── __init__.py  //对爬虫初始化    │       └── __pycache__    └── scrapy.cfg  //配置文件5 directories, 7 files

Scrapy项目管理

scrapy startproject命令控制参数

使用scrapy startproject -h查看帮助

Usage=====  scrapy startproject <project_name> [project_dir]Create new projectOptions=======--help, -h              show this help message and exitGlobal Options----------------logfile=FILE          log file. if omitted stderr will be used--loglevel=LEVEL, -L LEVEL                        log level (default: DEBUG)--nolog                 disable logging completely--profile=FILE          write python cProfile stats to FILE--pidfile=FILE          write process ID to FILE--set=NAME=VALUE, -s NAME=VALUE                        set/override setting (may be repeated)--pdb                   enable pdb on failure
  1. –logfile=FILE:
    用来指定系统的日志文件,FILE为日志文件地址
scrapy startproject --logfile="../logf.text"

会在当前目录的上级生成日志文件
2. –loglevel=LEAVEL, -L, LEAVEL:
主要用来控制日志信息的等级,默认为DEBUG模式

等级名 含义 CRITICAL 发生最严重的错误 ERROR 发生了一些必须立即处理的错误 WARNING 出现警告信息,即潜在错误 INFO 输出信息提示 DEBUG 输出调试信息,用于开发阶段
scrapy startproject --loglevel=WARNING

常用工具命令

全局命令

全局命令是不需要项目就可以直接运行的,scrapy -h可以查看所有全局命令,使用scrapy <command> -h可以具体查看某一个命令

fetch

用来显示爬虫爬取的过程

runspider

该命令可以实现不依托Scrapy爬虫项目,直接运行一个爬虫文件

settings

查看Scrapy对应的配置信息

shell

启动Scrapy的交互式终端

startproject

用于创建项目

version

显示版本相关信息

view

该命令实现下载某个网站并用浏览器查看

项目命令

基于项目的命令,必须要进入项目才可以使用,可以在项目中使用scrapy -h查看在项目中可用的命令

Bench

测试本地硬件的性能

Gensipider

可以基于爬虫模板直接生成一个新的爬虫文件

scrapy startprojecr first   //创建爬虫项目cd first    //进入项目文件scrapy -h   //查看项目中可以使用的命令Usage:  scrapy <command> [options] [args]Available commands:  bench         Run quick benchmark test  check         Check spider contracts  crawl         Run a spider  edit          Edit spider  fetch         Fetch a URL using the Scrapy downloader  genspider     Generate new spider using pre-defined templates  list          List available spiders  parse         Parse URL (using its spider) and print the results  runspider     Run a self-contained spider (without creating a project)  settings      Get settings values  shell         Interactive scraping console  startproject  Create new project  version       Print Scrapy version  view          Open URL in browser, as seen by Scrapy

下面使用genspider命令按照模板快速创建爬虫

scrapy genspider -h //查看具体命令scrapy genspider -l //查看目前可以使用的模板//可用模板如下Available templates:  basic  crawl  csvfeed  xmlfeedscrapy genspider -t base xiaoyao .com   //依据模板创建爬虫//重新查看目录.├── first│   ├── __init__.py│   ├── items.py│   ├── middlewares.py│   ├── pipelines.py│   ├── __pycache__│   │   ├── __init__.cpython-35.pyc│   │   └── settings.cpython-35.pyc│   ├── settings.py│   └── spiders│       ├── __init__.py│       ├── __pycache__│       │   └── __init__.cpython-35.pyc│       └── xiaoyao.py└── scrapy.cfg

Check

对某个爬虫文件进行合同检查

Crawl

用于启动某个爬虫

scrapy crawl xiaoyao

List

列出当前可用的爬虫

Edit

直接打开爬虫文件

Parse

获取指定的URL网址,并使用对应的爬虫文件进行处理和分析

编写Item

打开item.py, 如下:

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass FirstItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    pass  //往其中填写属性    urlname = scrapy.Field()    urladdr = scrapy.Field()

item.py是用来装数据的,有属性名和属性表,感觉跟数据库很想;xpath是直接提取标签的内容,作用跟正则差不多吧

>>> import scrapy>>> class person(scrapy.Item):    name = scrapy.Field()>>> xiaoyaogege = person(name = 'lixiaoyao')>>> print(xiaoyaogege){'name': 'lixiaoyao'}>>> print(xiaoyaogege['name'])lixiaoyao>>> print(type(xiaoyaogege))<class '__main__.person'>>>> 

编写Spider

打开spider.py文件,如下:
xiaoyao爬虫将要爬去新浪的三个页面,返回网页的标题、地址

# -*- coding: utf-8 -*-import scrapyfrom first.items import FirstItemclass XiaoyaoSpider(scrapy.Spider):    name = 'xiaoyao'    allowed_domains = ['sina.com.cn']    start_urls = ['http://mil.news.sina.com.cn',                  'http://edu.sina.com.cn/gaokao',                  'http://tech.sina.com.cn',]        #军事栏目、教育栏目、科技栏目    def parse(self, response):        item = FirstItem()  #构造一个item        item['urlname'] = response.xpath('/html/head/title/text()')        print(item['urlname'])

在终端用crawl命令运行爬虫

scrapy crawl xiaoyao --nolog    //不需要日志whc@whc-ThinkPad-E455:~/code/7.27/first$ scrapy crawl xiaoyao --nolog[<Selector xpath='/html/head/title/text()' data='军事频道_最多军迷首选的军事门户_新浪网'>][<Selector xpath='/html/head/title/text()' data='新浪科技_新浪网'>][<Selector xpath='/html/head/title/text()' data='2017高考_2017高考政策_高考频道_新浪教育_新浪网'>]

爬虫爬取的URL默认是从start_urls列表中提取的,如果我们把想要爬去的URL放到别的列表中,就需要重写start_requests()方法,否则该方法只会到默认的地方取找URL

 urls = ['http://www.baidu.com',            'http://www.jd.com',            'http://www.sina.com.cn']    #重写start_requests()方法    def start_requests(self):        for url in self.urls:            yield self.make_requests_from_urls(url)

XPath

<html>    <head>        <title>爬虫</title>    </head>    <body>        <h1>什么是爬虫?</h1>        <p>首先,.......</p>        <p>其次,.......</p>    </body></html>
  1. 用/选择某个标签,例如:
    /html/head/title/text()
    text()是获取标签的文本信息,提取的是“爬虫”
  2. 使用//可以提取某个标签的所有信息,例如:
    //p
    提取所有的p标签
  3. 获取特定属性为特定值的标签的内容:
    //z[@x=”y”]
    获取所有x属性为y的z标签

Spider类参数传递

在Spider类中可以通过-a选项实现参数的传递。首先需要重写构造函数init(),在构造函数中设置一个变量用于接收参数,也可以通过参数传递的方式进行多个网页的爬取,代码如下:

#重写构造函数    def __init__(self, myurl = None, *args, **kwargs):        super(XiaoyaoSpider, self).__init__(*args, **kwargs)        myurllist = myurl.split('|')        for i in myurllist:            print("要爬取的网站为: %s" %i)        self.urls = myurllistwhc@whc-ThinkPad-E455:~/code/7.27/first$ scrapy crawl xiaoyao -a myurl="http://www.baidu.com" --nolog要爬取的网站为: http://www.baidu.com

用XMLFeedSpider来分析XML源

学会使用CSVFeedSpider

Scrapy爬虫多开技能

创建项目,在项目下依据模板创建三只爬虫

whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy genspider -t basic xiaoyao_1 sina.com.cnCreated spider 'xiaoyao_1' using template 'basic' in module:  multispd.spiders.xiaoyao_1whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy genspider -t basic xiaoyao_2 sina.com.cnCreated spider 'xiaoyao_2' using template 'basic' in module:  multispd.spiders.xiaoyao_2whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy genspider -t basic xiaoyao_3 sina.com.cnCreated spider 'xiaoyao_3' using template 'basic' in module:  multispd.spiders.xiaoyao_3

通过修改crawl源代码的方式进行实现,crawl源代码可以在http://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py找到

import osfrom scrapy.commands import ScrapyCommandfrom scrapy.utils.conf import arglist_to_dictfrom scrapy.utils.python import without_none_valuesfrom scrapy.exceptions import UsageErrorclass Command(ScrapyCommand):    requires_project = True    def syntax(self):        return "[options] <spider>"    def short_desc(self):        return "Run a spider"    def add_options(self, parser):        ScrapyCommand.add_options(self, parser)        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",                          help="set spider argument (may be repeated)")        parser.add_option("-o", "--output", metavar="FILE",                          help="dump scraped items into FILE (use - for stdout)")        parser.add_option("-t", "--output-format", metavar="FORMAT",                          help="format to use for dumping items with -o")    def process_options(self, args, opts):        ScrapyCommand.process_options(self, args, opts)        try:            opts.spargs = arglist_to_dict(opts.spargs)        except ValueError:            raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)        if opts.output:            if opts.output == '-':                self.settings.set('FEED_URI', 'stdout:', priority='cmdline')            else:                self.settings.set('FEED_URI', opts.output, priority='cmdline')            feed_exporters = without_none_values(                self.settings.getwithbase('FEED_EXPORTERS'))            valid_output_formats = feed_exporters.keys()            if not opts.output_format:                opts.output_format = os.path.splitext(opts.output)[1].replace(".", "")            if opts.output_format not in valid_output_formats:                raise UsageError("Unrecognized output format '%s', set one"                                 " using the '-t' switch or as a file extension"                                 " from the supported list %s" % (opts.output_format,                                                                  tuple(valid_output_formats)))            self.settings.set('FEED_FORMAT', opts.output_format, priority='cmdline')   # 要指定运行哪个爬虫,要运行所有爬虫,关键是修改该方法    def run(self, args, opts):        if len(args) < 1:            raise UsageError()        elif len(args) > 1:            raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")        spname = args[0]        self.crawler_process.crawl(spname, **opts.spargs)        self.crawler_process.start()

在项目目录下新建mucmd目录,在其中创建mycrawl.py和init.py文件

.├── multispd│   ├── __init__.py│   ├── items.py│   ├── middlewares.py│   ├── mycmd│   │   ├── __init__.py│   │   ├── mycrawl.py│   │   └── __pycache__│   │       ├── __init__.cpython-35.pyc│   │       └── mycrawl.cpython-35.pyc│   ├── pipelines.py│   ├── __pycache__│   │   ├── __init__.cpython-35.pyc│   │   └── settings.cpython-35.pyc│   ├── settings.py│   └── spiders│       ├── __init__.py│       ├── __pycache__│       │   ├── __init__.cpython-35.pyc│       │   ├── xiaoyao_1.cpython-35.pyc│       │   ├── xiaoyao_2.cpython-35.pyc│       │   └── xiaoyao_3.cpython-35.pyc│       ├── xiaoyao_1.py│       ├── xiaoyao_2.py│       └── xiaoyao_3.py└── scrapy.cfg

将crawl源代码复制到mycrawl.py中,并修改如下:

        # 主要修改以下部分    def run(self, args, opts):        # if len(args) < 1:        #    raise UsageError()        # elif len(args) > 1:        #     raise UsageError("running 'scrapy crawl' with more than one spider is no longer supported")        # spname = args[0]        #        # self.crawler_process.crawl(spname, **opts.spargs)        # self.crawler_process.start()        spd_loader_list = self.crawler_process.spider_loader.list()        # 遍历爬虫        for spdname in spd_loader_list or args:            self.crawler_process.crawl(spdname, **opts.spargs)            print("此时启动的爬虫是:" + spdname)        self.crawler_process.start()

修改settings.py,讲命令添加到项目中

COMMANDS_MODULE = 'multispd.mycmd'

在终端查看命令scrapy -h找到mycrawl命令,并运行

whc@whc-ThinkPad-E455:~/code/7.28/multispd$ scrapy mycrawl --nolog此时启动的爬虫是:xiaoyao_1此时启动的爬虫是:xiaoyao_2此时启动的爬虫是:xiaoyao_3

避免被禁止

禁止Cookie

有的网站会通过用户的Cookie信息对用户信息进行识别,通过禁用本地cookie可以让网站无法识别会话信息

直接对settings.py进行设置就可以了

# Disable cookies (enabled by default)#COOKIES_ENABLED = False# 这两行代码就是设置cookie的,只要把第二行注释去掉就可以# Disable cookies (enabled by default)COOKIES_ENABLED = False

设置下载延时

有的网站会根据我们爬去网站的频率进行分析,如果爬取过快,就会判断为爬虫,通过设置下载延时可以解决这个问题

打开settings.py进行相关设置

# Configure a delay for requests for the same website (default: 0)# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs#DOWNLOAD_DELAY = 3# 修改最后一行可以设置延迟时间,单位是秒,默认3秒

使用IP池

步骤:

  1. 在项目中建立一个下载中间件
  2. 在下载中间件中设置IP选择规则
  3. 在settings.py中配置好下载中间件,并配置好IP池

代码:

先写一个爬取IP的爬虫

import reimport urllib.requestpattern = '<td>((\d+\.)+\d+).*?(\d+)</td>'headers = ('User-Agent', 'Mozilla/5.0 (X11; Linux x86_64) Ap'\                         'pleWebKit/537.36 (KHTML, like Gecko) '\                         'Chrome/59.0.3071.115 Safari/537.36')opener = urllib.request.build_opener()opener.addheaders = [headers]urllib.request.install_opener(opener)def getip():    url = 'http://www.xicidaili.com/nn/1'    html = urllib.request.urlopen(url).read()    html = str(html)    try:        result = re.compile(pattern, re.S).findall(html)        striplist = []        for i in result:            strip = i[0] + ':' + i[2]            striplist.append(strip)        return striplist    except Exception as e:        print(str(e))
  1. 修改settings.py
from multispd import get_ipIPPOOL = []iplist = get_ip()for i in iplist:    dict = {}    dict['ipaddr'] = i    IPPOOL.append(dict)
  1. 编写中间件文件
from scrapy import signalsimport randomfrom multispd.settings import IPPOOLfrom scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddlewareclass IPPOOLS(HttpProxyMiddleware):    def __init__(self, ip =''):        self.ip = ip    def process_request(self, request, spider):        thisip = random.choice(IPPOOL)        print("当前使用的IP是:" + thisip["ipaddr"])        request.meta['proxy'] = 'http://' + thisip["ipaddr"]
  1. 再次设置settings.py
DOWNLOADER_MIDDLEWARES = {#    'multispd.middlewares.MyCustomDownloaderMiddleware': 543,    'scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware':123,    'multispd.middlewares.IPPOOLS':125}

使用用户代理池

其他方法,比如分布式爬虫

原创粉丝点击