Learning Scrapy 1

来源：互联网发布：手机拨号软件编辑：程序博客网时间：2024/06/02 01:30

ipython 是一个强化python的命令终端，具有语法高亮，自动补全，内置函数等。
pip install ipython
XPath从1开始不是0， …[1]
控制获取数量 scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90

UR2IM process

基本爬虫步骤: UR2IM (URL, Request, Response, Items, More URLs)

URL
scrapy shell是一个scrapy命令终端工具，用来快速测试scrapy。
通过scrapy shell 'http://scrapy.org'启动
返回对象，通过ipython操作

>>>$ scrapy shell 'http://scrapy.org' --nolog[s] Available Scrapy objects:[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s]   crawler    <scrapy.crawler.Crawler object at 0x101ade4a8>[s]   item       {}[s]   request    <GET http://scrapy.org>[s]   response   <200 https://scrapy.org/>[s]   settings   <scrapy.settings.Settings object at 0x1028b09e8>[s]   spider     <DefaultSpider 'default' at 0x102b531d0>[s] Useful shortcuts:[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)[s]   fetch(req)                  Fetch a scrapy.Request and update local objects[s]   shelp()           Shell help (print this help)[s]   view(response)    View response in a browser

request and response
对response进行操作，
输出response前50字符
>>> $ response.body[:50]
The item
提取出response的数据放进对应的item。使用XPath提取。

一个页面如下：具有logo,search boxes,buttons等等元素。
需要的是具体信息，比如姓名，电话等。
通过定位，提取（复制XPath，简化XPath）

使用
response.xpath('//h1/text()').extract()
提取当前所有h1元素,

使用 //h1/text()，只提取文本信息
这里假设只有一个h1元素，一个网站最好只有一个h1元素，为了SEO(Search Engine Optimization) 探索引擎优化策略。

如果页面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通过//*[@itemprop="name"][1]/text()提取
XPath的从1开始不是0

css选择器

response.css('.ad-price')

一般选择需求

Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src

A Scrapy Project

scrapy startproject properties
目录结构：

├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg

item规划

规划需要的数据，不一定全部要用到，feel free to add fileds。

from scrapy.item import Item, Field   class PropertiesItem(Item):       # Primary fields       title = Field()       price = Field()       description = Field()       address = Field()       image_urls = Field()       # Calculated fields       images = Field()       location = Field()       # Housekeeping fields       url = Field()       project = Field()       spider = Field()       server = Field()       date = Field()

爬虫编写

新建爬虫 scrapy genspider mydomain mydomain.com
默认：

import scrapyclass BasicSpider(scrapy.Spider):    name = 'basic'    allowed_domains = ['web']    start_urls = ['http://web/']    def parse(self, response):        pass

修改后如下：
start_urls 目标url
self 使用内置函数， log()方法输出所有
self.log("response.xpath('//@src').extract())

 import scrapy   class BasicSpider(scrapy.Spider):       name = "basic"       allowed_domains = ["web"]       start_urls = (           'http://web:9312/properties/property_000000.html',       )       def parse(self, response):           self.log("title: %s" % response.xpath(               '//*[@itemprop="name"][1]/text()').extract())           self.log("price: %s" % response.xpath(               '//*[@itemprop="price"][1]/text()').re('[.0-9]+'))           self.log("description: %s" % response.xpath(                '//*[@itemprop="description"][1]/text()').extract())           self.log("address: %s" % response.xpath(               '//*[@itemtype="http://schema.org/'               'Place"][1]/text()').extract())           self.log("image_urls: %s" % response.xpath(               '//*[@itemprop="image"][1]/@src').extract())

在终端目录通过scrapy crawl启动
或者可以使用scrapy parse
parse 获取给定的URL并使用相应的spider分析处理

填充item

在爬虫basic.py中，导入item
导入from properties.items import PropertiesItem
把item各项接收对应返回

item = PropertiesItem()item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

完整如下

import scrapyfrom helloworld.items import PropertiesItemclass BasicSpider(scrapy.Spider):    name = 'basic'    allowed_domains = ['web']    start_urls = ['https://www.iana.org/domains/reserved']    def parse(self, response):        item = PropertiesItem()        item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()

保存文件

运行爬虫时保存文件，指定格式和路径
scrapy crawl basic -o items.json json格式
scrapy crawl basic -o items.xml xml格式
scrapy crawl basic -o items.csv csv格式
scrapy crawl basic -o "ftp://user:pass@ftp.scrapybook.com/items.j1" j1格式
scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"

item loader 简化parse

ItemLoader(item,resonse) 接收item，和XPath

    def parse(self, response):        l = ItemLoader(item=PropertiesItem(), response=response)        l.add_xpath('title', '//*[@itemprop="name"][1]/text()')

还有各种处理器
join 多种合一
MapCompose 使用python函数
MapCompose(unicode.strip) Removes leading and trailing whitespace characters.
MapCompose(unicode.strip, unicode.title) Same as Mapcompose, but also gives title cased results.
MapCompose(float) Converts strings to numbers.
MapCompose(lambda i: i.replace(',',''), float) Converts strings to numbers, ignoring possible ‘,’ characters.
MapCompose(lambda i: urlparse.urljoin(response.url, i)) 把相对路径转化为绝对路径url

add_value个item添加当个具体信息

def parse(self, response):       l.add_xpath('title', '//*[@itemprop="name"][1]/text()',                   MapCompose(unicode.strip, unicode.title))       l.add_xpath('price', './/*[@itemprop="price"][1]/text()',                   MapCompose(lambda i: i.replace(',', ''), float),                   re='[,.0-9]+')       l.add_xpath('description', '//*[@itemprop="description"]'                   '[1]/text()', MapCompose(unicode.strip), Join())       l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()',                   MapCompose(unicode.strip))       l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',                   MapCompose(lambda i: urlparse.urljoin(response.url, i)))       l.add_value('url', response.url)       l.add_value('project', self.settings.get('BOT_NAME'))       l.add_value('spider', self.name)       l.add_value('server', socket.gethostname())       l.add_value('date', datetime.datetime.now())

完整爬虫如下：

from scrapy.loader.processors import MapCompose, Joinfrom scrapy.loader import ItemLoaderfrom properties.items import PropertiesItemimport datetimeimport urlparseimport socketimport scrapy   class BasicSpider(scrapy.Spider):       name = "basic"       allowed_domains = ["web"]       # Start on a property page       start_urls = (           'http://web:9312/properties/property_000000.html',       )       def parse(self, response):           """ This function parses a property page.           @url http://web:9312/properties/property_000000.html           @returns items 1           @scrapes title price description address image_urls           @scrapes url project spider server date           """           # Create the loader using the response           l = ItemLoader(item=PropertiesItem(), response=response)           # Load fields using XPath expressions           l.add_xpath('title', '//*[@itemprop="name"][1]/text()',                       MapCompose(unicode.strip, unicode.title))           l.add_xpath('price', './/*[@itemprop="price"][1]/text()',                       MapCompose(lambda i: i.replace(',', ''),                       float),                       re='[,.0-9]+')           l.add_xpath('description', '//*[@itemprop="description"]'                       '[1]/text()',                       MapCompose(unicode.strip), Join())           l.add_xpath('address',                       '//*[@itemtype="http://schema.org/Place"]'                       '[1]/text()',                       MapCompose(unicode.strip))           l.add_xpath('image_urls', '//*[@itemprop="image"]'                       '[1]/@src', MapCompose(                       lambda i: urlparse.urljoin(response.url, i)))           # Housekeeping fields           l.add_value('url', response.url)           l.add_value('project', self.settings.get('BOT_NAME'))           l.add_value('spider', self.name)           l.add_value('server', socket.gethostname())           l.add_value('date', datetime.datetime.now())           return l.load_item()

多个URLs

当一个页面出现多页码时，
多个url可以手动一个个输入

 start_urls = (       'http://web:9312/properties/property_000000.html',       'http://web:9312/properties/property_000001.html',       'http://web:9312/properties/property_000002.html',)

可以把url放在文件里，然后读取

start_urls = [i.strip() for i in   open('todo.urls.txt').readlines()]

爬虫爬取有两种方向：
- 横向：从index页面顺序到另一个页面，页面布局基本一样
- 纵向：从index页面选中一个具体的item页面，页面布局改变，比如从列表页面到具体的产品页面。

urlparse.urljoin(base, URL)Python语法连接两个url

找出url变量集合，横向爬取

urls = response.xpath('//*[@itemprop="url"]/@href').extract()//[u'property_000000.html', ... u'property_000029.html']

通过urljoin结合

[urlparse.urljoin(response.url, i) for i in urls]//[u'http://..._000000.html', ... /property_000029.html']

urls = response.xpath('//*[@itemprop="url"]/@href').extract()[urlparse.urljoin(response.url, i) for i in urls]

横纵向爬取

获得页码的url和产品的url
只是获得不同url，组合。

def parse(self, response):    # 获取index页面url    next_selector = response.xpath('//*[contains(@class,'                                      '"next")]//@href')    for url in next_selector.extract():        yield Request(urlparse.urljoin(response.url, url))    #获取产品url    item_selector = response.xpath('//*[@itemprop="url"]/@href')    for url in item_selector.extract():        yield Request(urlparse.urljoin(response.url, url),                      callback=self.parse_item)

scrapy genspider -t crawl webname web.org
生成爬虫

  ...   class EasySpider(CrawlSpider):       name = 'easy'       allowed_domains = ['web']       start_urls = ['http://www.web/']       rules = (           Rule(LinkExtractor(allow=r'Items/'),   callback='parse_item', follow=True),       )       def parse_item(self, response):           ...

其中rules中，设置callback可以忽略前面url，而执行parse_item

   Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),   Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),            callback='parse_item'))

阅读全文

0 0