Learning Scrapy 1
来源:互联网 发布:手机拨号软件 编辑:程序博客网 时间:2024/06/02 01:30
- ipython 是一个强化python的命令终端,具有语法高亮,自动补全,内置函数等。
pip install ipython
- XPath从1开始不是0, …[1]
- 控制获取数量
scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90
UR2IM process
基本爬虫步骤: UR2IM (URL, Request, Response, Items, More URLs)
- URL
scrapy shell是一个scrapy命令终端工具,用来快速测试scrapy。
通过scrapy shell 'http://scrapy.org'
启动
返回对象,通过ipython操作
>>>$ scrapy shell 'http://scrapy.org' --nolog[s] Available Scrapy objects:[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s] crawler <scrapy.crawler.Crawler object at 0x101ade4a8>[s] item {}[s] request <GET http://scrapy.org>[s] response <200 https://scrapy.org/>[s] settings <scrapy.settings.Settings object at 0x1028b09e8>[s] spider <DefaultSpider 'default' at 0x102b531d0>[s] Useful shortcuts:[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)[s] fetch(req) Fetch a scrapy.Request and update local objects[s] shelp() Shell help (print this help)[s] view(response) View response in a browser
request and response
对response进行操作,
输出response前50字符>>> $ response.body[:50]
The item
提取出response的数据放进对应的item。使用XPath提取。
一个页面如下:具有logo,search boxes,buttons等等元素。
需要的是具体信息,比如姓名,电话等。
通过定位,提取(复制XPath,简化XPath)
使用 response.xpath('//h1/text()').extract()
提取当前所有h1元素,
使用 //h1/text()
,只提取文本信息
这里假设只有一个h1元素,一个网站最好只有一个h1元素,为了SEO(Search Engine Optimization) 探索引擎优化策略。
如果页面元素是<h1 itemprop="name" class="space-mbs">...</h1>
也可以通过//*[@itemprop="name"][1]/text()
提取
XPath的从1开始不是0
css选择器
response.css('.ad-price')
一般选择需求
Primary fields | XPath expression
title | //*[@itemprop=”name”][1]/text()
price | //*[@itemprop=”price”][1]/text()
description | //*[@itemprop=”description”][1]/text()
address | //*[@itemtype=”http://schema.org/Place”][1]/text()
image_urls | //*[@itemprop=”image”][1]/@src
A Scrapy Project
scrapy startproject properties
目录结构:
├── properties
│ ├── init.py
│ ├── items.py
│ ├── pipelines.py
│ ├── settings.py
│ └── spiders
│ └── init.py
└── scrapy.cfg
item规划
规划需要的数据,不一定全部要用到,feel free to add fileds。
from scrapy.item import Item, Field class PropertiesItem(Item): # Primary fields title = Field() price = Field() description = Field() address = Field() image_urls = Field() # Calculated fields images = Field() location = Field() # Housekeeping fields url = Field() project = Field() spider = Field() server = Field() date = Field()
爬虫编写
新建爬虫 scrapy genspider mydomain mydomain.com
默认:
import scrapyclass BasicSpider(scrapy.Spider): name = 'basic' allowed_domains = ['web'] start_urls = ['http://web/'] def parse(self, response): pass
修改后如下: start_urls
目标url self
使用内置函数, log()
方法 输出所有 self.log("response.xpath('//@src').extract())
import scrapy class BasicSpider(scrapy.Spider): name = "basic" allowed_domains = ["web"] start_urls = ( 'http://web:9312/properties/property_000000.html', ) def parse(self, response): self.log("title: %s" % response.xpath( '//*[@itemprop="name"][1]/text()').extract()) self.log("price: %s" % response.xpath( '//*[@itemprop="price"][1]/text()').re('[.0-9]+')) self.log("description: %s" % response.xpath( '//*[@itemprop="description"][1]/text()').extract()) self.log("address: %s" % response.xpath( '//*[@itemtype="http://schema.org/' 'Place"][1]/text()').extract()) self.log("image_urls: %s" % response.xpath( '//*[@itemprop="image"][1]/@src').extract())
在终端目录通过scrapy crawl
启动
或者可以使用scrapy parse
parse 获取给定的URL并使用相应的spider分析处理
填充item
在爬虫basic.py中,导入item
导入from properties.items import PropertiesItem
把item各项接收对应返回
item = PropertiesItem()item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()
完整如下
import scrapyfrom helloworld.items import PropertiesItemclass BasicSpider(scrapy.Spider): name = 'basic' allowed_domains = ['web'] start_urls = ['https://www.iana.org/domains/reserved'] def parse(self, response): item = PropertiesItem() item['title'] = response.xpath('//*[@id="main_right"]/h1').extract()
保存文件
运行爬虫时 保存文件, 指定格式和路径 scrapy crawl basic -o items.json
json格式 scrapy crawl basic -o items.xml
xml格式 scrapy crawl basic -o items.csv
csv格式 scrapy crawl basic -o "ftp://user:pass@ftp.scrapybook.com/items.j1"
j1格式 scrapy crawl basic -o "s3://aws_key:aws_secret@scrapybook/items.json"
item loader 简化parse
ItemLoader(item,resonse) 接收item,和XPath
def parse(self, response): l = ItemLoader(item=PropertiesItem(), response=response) l.add_xpath('title', '//*[@itemprop="name"][1]/text()')
还有各种处理器 join
多种合一 MapCompose
使用python函数 MapCompose(unicode.strip)
Removes leading and trailing whitespace characters. MapCompose(unicode.strip, unicode.title)
Same as Mapcompose, but also gives title cased results. MapCompose(float)
Converts strings to numbers. MapCompose(lambda i: i.replace(',',''), float)
Converts strings to numbers, ignoring possible ‘,’ characters. MapCompose(lambda i: urlparse.urljoin(response.url, i))
把相对路径转化为绝对路径url
add_value
个item添加当个具体信息
def parse(self, response): l.add_xpath('title', '//*[@itemprop="name"][1]/text()', MapCompose(unicode.strip, unicode.title)) l.add_xpath('price', './/*[@itemprop="price"][1]/text()', MapCompose(lambda i: i.replace(',', ''), float), re='[,.0-9]+') l.add_xpath('description', '//*[@itemprop="description"]' '[1]/text()', MapCompose(unicode.strip), Join()) l.add_xpath('address','//*[@itemtype="http://schema.org/Place"][1]/text()', MapCompose(unicode.strip)) l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src', MapCompose(lambda i: urlparse.urljoin(response.url, i))) l.add_value('url', response.url) l.add_value('project', self.settings.get('BOT_NAME')) l.add_value('spider', self.name) l.add_value('server', socket.gethostname()) l.add_value('date', datetime.datetime.now())
完整爬虫如下:
from scrapy.loader.processors import MapCompose, Joinfrom scrapy.loader import ItemLoaderfrom properties.items import PropertiesItemimport datetimeimport urlparseimport socketimport scrapy class BasicSpider(scrapy.Spider): name = "basic" allowed_domains = ["web"] # Start on a property page start_urls = ( 'http://web:9312/properties/property_000000.html', ) def parse(self, response): """ This function parses a property page. @url http://web:9312/properties/property_000000.html @returns items 1 @scrapes title price description address image_urls @scrapes url project spider server date """ # Create the loader using the response l = ItemLoader(item=PropertiesItem(), response=response) # Load fields using XPath expressions l.add_xpath('title', '//*[@itemprop="name"][1]/text()', MapCompose(unicode.strip, unicode.title)) l.add_xpath('price', './/*[@itemprop="price"][1]/text()', MapCompose(lambda i: i.replace(',', ''), float), re='[,.0-9]+') l.add_xpath('description', '//*[@itemprop="description"]' '[1]/text()', MapCompose(unicode.strip), Join()) l.add_xpath('address', '//*[@itemtype="http://schema.org/Place"]' '[1]/text()', MapCompose(unicode.strip)) l.add_xpath('image_urls', '//*[@itemprop="image"]' '[1]/@src', MapCompose( lambda i: urlparse.urljoin(response.url, i))) # Housekeeping fields l.add_value('url', response.url) l.add_value('project', self.settings.get('BOT_NAME')) l.add_value('spider', self.name) l.add_value('server', socket.gethostname()) l.add_value('date', datetime.datetime.now()) return l.load_item()
多个URLs
当一个页面出现多页码时,
多个url可以手动一个个输入
start_urls = ( 'http://web:9312/properties/property_000000.html', 'http://web:9312/properties/property_000001.html', 'http://web:9312/properties/property_000002.html',)
可以把url放在文件里,然后读取
start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]
爬虫爬取有两种方向:
- 横向:从index页面顺序到另一个页面,页面布局基本一样
- 纵向:从index页面选中一个具体的item页面,页面布局改变,比如从列表页面到具体的产品页面。
urlparse.urljoin(base, URL)
Python语法连接两个url
找出url变量集合,横向爬取
urls = response.xpath('//*[@itemprop="url"]/@href').extract()//[u'property_000000.html', ... u'property_000029.html']
通过urljoin
结合
[urlparse.urljoin(response.url, i) for i in urls]//[u'http://..._000000.html', ... /property_000029.html']
urls = response.xpath('//*[@itemprop="url"]/@href').extract()[urlparse.urljoin(response.url, i) for i in urls]
横纵向爬取
获得页码的url和产品的url
只是获得不同url,组合。
def parse(self, response): # 获取index页面url next_selector = response.xpath('//*[contains(@class,' '"next")]//@href') for url in next_selector.extract(): yield Request(urlparse.urljoin(response.url, url)) #获取产品url item_selector = response.xpath('//*[@itemprop="url"]/@href') for url in item_selector.extract(): yield Request(urlparse.urljoin(response.url, url), callback=self.parse_item)
scrapy genspider -t crawl webname web.org
生成爬虫
... class EasySpider(CrawlSpider): name = 'easy' allowed_domains = ['web'] start_urls = ['http://www.web/'] rules = ( Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True), ) def parse_item(self, response): ...
其中rules
中,设置callback
可以忽略前面url,而执行parse_item
Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')), Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'), callback='parse_item'))
- 《Learning Scrapy》1 Scrapy介绍
- Learning Scrapy 1
- Learning Scrapy 笔记(1)
- 《Learning Scrapy》0 作者简介 序言
- scrapy-1-初窥scrapy
- Learning Scrapy 0 HTML and XPath
- Learning Scrapy.pdf 英文原版 免费下载
- scrapy-1
- Scrapy学习笔记(1)初探Scrapy
- Python 关于 Scrapy 1
- scrapy研究探索1
- Scrapy学习-1
- Scrapy教程1--7
- Scrapy学习日记1
- scrapy 学习1
- 1、初识scrapy
- Scrapy 入门记录(1)
- Scrapy爬取1
- Django学习笔记 --1 Hello world!
- bzoj4997: [Usaco2017 Feb]Why Did the Cow Cross the Road III
- MySQL技术问答-下篇
- pat 乙级 1012. 数字分类 (20)
- 阿里部分面试题整理
- Learning Scrapy 1
- TLD目标跟踪算法详解(二)跟踪器与检测器的设计
- oracle(02): PL/SQL基本概念,关系运算符,顺序结构,分支结构,循环结构
- 顺序容器及其应用 (一)
- ORA-00328 ORA-00334
- 【剑指Offer】二维数组中的查找
- matplotlib如何画横轴是文本的散点图
- dubbo系统学习(二)--zookeeper的安装配置
- 最后一天