Scrapy 学习:创建我的第一个工程

来源:互联网 发布:赚零花钱的软件 编辑:程序博客网 时间:2024/06/05 16:35

原文发表在sinablog,搬家到此。本文地址:http://blog.csdn.net/zhanh1218/article/details/21460139

本文由@The_Third_Wave原创。不定期更新,有错误请指正。

Sina微博关注:@The_Third_Wave 

如果这篇博文对您有帮助,为了好的网络环境,不建议转载,建议收藏!如果您一定要转载,请带上后缀和本文地址。

还没有安装的请看这:点击打开链接

一、 进入DOS窗口,查看scrapy命令帮助信息:


用以下命令创建工程:

进入工程文件夹,你会看到以下内容:
tutorial_f/
    scrapy.cfg
    tutorial_f/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
• scrapy.cfg: the project configuration file 工程配置文件
• tutorial_f/: the project’s python module, you’ll later import your code from here. # 工程模块包
• tutorial_f/items.py: the project’s items file. 工程项目文件
• tutorial_f/pipelines.py: the project’s pipelines file. 工程管道文件
• tutorial_f/settings.py: the project’s settings file. 工程设置文件
• tutorial_f/spiders/: a directory where you’ll later put your spiders. 自己写的爬虫包

二、Defining our Item 定义我们的项目

编辑items.py
# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlfrom scrapy.item import Item, Fieldclass TutorialFItem(Item):    # define the fields for your item here like:    # name = Field()    paseclass SinaItem(Item):    title = Field()    link = Field()    desc = Field()
    可能现在感觉很复杂,远比你上自己写的爬虫清晰,但是,这种定义允许您使用scrapy其他方便的组件,而这些组件需要items.py的定义。

三、Our first Spider 第一个爬虫(以sina news为例)

    Spiders are user-written classes used to scrape information from a domain (or group of domains).
    我们会使用到类scrapy.spider.Spider。以下三个必须定义:
    name: identifies the Spider. It must be unique, that is, you can’t set the same name for different Spiders.
    start_urls: is a list of URLs where the Spider will begin to crawl from. So, the first pages downloaded will be those listed here. The subsequent URLs will be generated successively from data contained in the start URLs.
    parse() :is a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and only argument.
    该方法负责分析响应数据并提取数据。(提到了Request,可能有cookie连接池的功能!)
    The parse() method is in charge of processing the response and returning scraped data (as Item objects) and more URLs to follow (as Request objects).
    下面开始新建第一个spider。在tutorial_f/spiders目录下新建文件sina_spider.py。代码如下:
备注:Scrapy Documentation Release 0.21.0 Scrapy developers January 09, 2014 出错!错误如下
    from scrapy.spider import Spider 时候出错,原因是我用的是scrapy 0.20.2版本!类名没修改为0.21.0里面的Spider,还是BaseSpider!以下是help信息:

>>> help (scrapy.spider)
Help on module scrapy.spider in scrapy:

NAME
    scrapy.spider - Base class for Scrapy spiders

FILE
    c:\python27\lib\site-packages\scrapy-0.20.2-py2.7.egg\scrapy\spider.py

DESCRIPTION
    See documentation in docs/topics/spiders.rst

CLASSES
    __builtin__.object
        ObsoleteClass
    scrapy.utils.trackref.object_ref(__builtin__.object)
        BaseSpider
    
    class BaseSpider(scrapy.utils.trackref.object_ref)
     |  Base class for scrapy spiders. All spiders must inherit from this
     |  class.
     |  
     |  Method resolution order:
     |      BaseSpider
     |      scrapy.utils.trackref.object_ref
     |      __builtin__.object
     |  
     |  Methods defined here:
     |  
     |  __init__(self, name=None, **kwargs)
     |  
     |  __repr__ = __str__(self)
     |  
     |  __str__(self)
     |  
     |  log(self, message, level=10, **kw)
     |      Log the given messages at the given log level. Always use this
     |      method to send log messages from your spider
     |  
     |  make_requests_from_url(self, url)
     |  
     |  parse(self, response)
     |  
     |  set_crawler(self, crawler)
     |  
     |  start_requests(self)
     |  
     |  ----------------------------------------------------------------------
     |  Class methods defined here:
     |  
     |  handles_request(cls, request) from __builtin__.type
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)
     |  
     |  crawler
     |  
     |  settings
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
     |  name = None
     |  
     |  ----------------------------------------------------------------------
     |  Static methods inherited from scrapy.utils.trackref.object_ref:
     |  
     |  __new__(cls, *args, **kwargs)
    
    class ObsoleteClass(__builtin__.object)
     |  Methods defined here:
     |  
     |  __getattr__(self, name)
     |  
     |  __init__(self, message)
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

DATA
    spiders =
>>> 
代码如下:
#! /usr/bin/python# -*- coding: utf-8 -*-try:    from scrapy.spider import Spider as Spider # 0.21.0上版本except:    from scrapy.spider import BaseSpider as Spider # 兼容0.20.2版本/以前版本未测试class SinaSpider(Spider):    name = "sina"    allowed_domains = ["sina.com.cn"]    start_urls = ["http://www.sina.com.cn/"]    def parse(self, response):        filename = response.url.split("/")[-2]        open(filename, 'wb').write(response.body)
Crawling: 开始爬取
    进入D:\Users\admin\workspace\tutorial_f\文件夹
    执行 scrapy crawl sina
    果断出现编码错误:

注意:1、对缩进超级敏感!注意注释的缩进。。。
      2、错误解决方法:
错误:UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xb0 in position 1: ordinal not in range(128)
一劳永逸的方法是:参考了http://www.trumanliu.com/windows-scrapy-install/
在python的Lib\site-packages文件夹下新建一个sitecustomize.py(sitecustomize.py is a special script; Python will try to import it on startup, so any code in it will be run automatically),输入:
import syssys.setdefaultencoding('gb2312') # sina编码为gb2312
这种方法我觉得并不可取!不是所有编码都是gb2312,正确做法应该是:
在我的文件里加上:
import syssys.setdefaultencoding('gb2312')
也就是说代码变为:
#!/usr/bin/python# -*- coding: utf-8 -*-import sysreload(sys)sys.setdefaultencoding("gb2312") # eclipse会报错!坑了我好久啊!try:    from scrapy.spider import Spider as Spider # 0.21.0上版本except:    from scrapy.spider import BaseSpider as Spider # 兼容0.20.2版本/以前版本未测试class SinaSpider(Spider):    name = "sina"    allowed_domains = ["sina.com.cn"]    start_urls = ["http://www.sina.com.cn/"]    def parse(self, response):        filename = response.url.split("/")[-2]        open(filename, 'wb').write(response.body)
运行结果如下图:

你会在工程目录里看到以下信息:

里面存的是sina首页源代码。

四、深入学习

1、Trying Selectors in the Shell(需要先安装IPython
进入工程顶部目录,使用命令:
scrapy shell "http://www.sina.com.cn/",得到如下结果:

    response.body查看源代码(很坑,好多),response.headers查看标头。
    由上面的信息知:sel为xpath解析入口,下面简单试用下。如下图所示:

注:代码编写完成之后,自然就是抓取了,然而在抓取过程中需要注意不能让爬虫肆意的抓取,在不限制速度情况下,Scrapy对服务器会造成很大的压力,这样有可能使得服务器禁止爬虫访问。限速的一个简单方法就是随机化的延迟发出页面请求的时间,减小爬虫被服务器拒绝的概率。方法是在项目目录下的settings.py文件中增加两个配置项:

DOWNLOAD_DELAY=0.5
RANDOMIZE_DOWNLOAD_DELAY=True

其中DOWNLOAD_DELAY = 0.5表示延迟请求500ms,而RANDOMIZE_DOWNLOAD_DELAY = True表示随机化延迟时间,效果是使延迟时间变成DOWNLOAD_DELAY的0.5~1.5倍,所以这两个配置项是需要配套使用的,如果DOWNLOAD_DELAY没有设定,那么是起不到效果的。

2、Extracting the data (提取数据)
    可以用type(response.body)查看源代码类型
    下面是代码
#!/usr/bin/python# -*- coding: utf-8 -*-import sysreload(sys)sys.setdefaultencoding("gb2312")try:    from scrapy.spider import Spider as Spider # 0.21.0上版本except:    from scrapy.spider import BaseSpider as Spider # 兼容0.20.2版本/以前版本未测试from scrapy.selector import Selectorclass SinaSpider(Spider):    name = "sina"    allowed_domains = ["sina.com.cn"]    start_urls = ["http://www.sina.com.cn/"]    def parse(self, response):        #filename = response.url.split("/")[-2]        #open(filename, 'wb').write(response.body)        sel = Selector(response)        sites = sel.xpath('//ul/li')        for site in sites:        try:           title = site.xpath('a/text()').extract()[0]           link = site.xpath('a/@href').extract()[0]           desc = site.xpath('text()').extract()[0]           print title, link, desc        except:        pass
你会看到结果如下图所示(输出结果太多了,只给最后的):

3、Using our item
    Item objects are custom python dicts; you can access the values of their fields (attributes of the class we defined earlier) using the standard dict syntax like:
>>> item = SinaItem()
>>> item[’title’] = ’Example title’
>>> item[’title’]
’Example title’
下面是代码(一如既往,我已经作了修改,并给出简单注释):
#!/usr/bin/python# -*- coding: utf-8 -*-import sysreload(sys)sys.setdefaultencoding("gb2312")try:    from scrapy.spider import Spider as Spider # 0.21.0上版本except:    from scrapy.spider import BaseSpider as Spider # 兼容0.20.2版本/以前版本未测试from scrapy.selector import Selector # page 15from tutorial_f.items import SinaItem # page 15 class SinaSpider(Spider):    name = "sina"    allowed_domains = ["sina.com.cn"]    start_urls = ["http://www.sina.com.cn/"]    def parse(self, response):        #filename = response.url.split("/")[-2]        #open(filename, 'wb').write(response.body)        sel = Selector(response)        sites = sel.xpath('//ul/li')        items = [] # define items == list        for site in sites:        try:        item = SinaItem()        item['title'] = site.xpath('a/text()').extract()[0]        item['link'] = site.xpath('a/@href').extract()[0]        item['desc'] = site.xpath('text()').extract()[0]        items.append(item) # items is a list         except:        pass        return items
运行结果如下图:

从结果可以看出,已经存在dicts里面了。
4、Storing the scraped data(存储爬取的数据)
    The simplest way to store the scraped data is by using the Feed exports, with the following command:
    scrapy crawl dmoz -o items.json -t json

五、结语

    接下来要学习Item Pipeline以及其他东西了!本文到此结束。有不懂的请留言。
    本文工程打包下载地址:http://share.weiyun.com/3e424a67623aef7e33b86aa59896f149
    Mr. zhan原创!转载请注明出处,带上此链接。 

本文由@The_Third_Wave原创。不定期更新,有错误请指正。

Sina微博关注:@The_Third_Wave 

如果这篇博文对您有帮助,为了好的网络环境,不建议转载,建议收藏!如果您一定要转载,请带上后缀和本文地址。


0 0
原创粉丝点击