Scrapy学习笔记IV-Spiders
来源:互联网 发布:flv windows播放器 编辑:程序博客网 时间:2024/05/16 14:06
spider定义如何从站点爬取
class MySpider(scrapy.Spider): name = 'myspider' def start_requests(self): return [scrapy.FormRequest("http://www.example.com/login", formdata={'user': 'john', 'pass': 'secret'}, callback=self.logged_in)] def logged_in(self, response): # here you would extract links to follow and return Requests for # each of them, with another callback pass
name #定义spider的名字,并搜索到它,必须是唯一的
allowed_domains #允许的域名
start_urls #定义爬取的URL地址,无特别指明,从这个序列地址开始
custom_setting # 优先于项目的默认设置
- crawler # from_crawler 类中的方法
setting #配置
logger #日志
from_crawler(crawler, *args, **kwargs) #创建spider
start_requests() #访问start_url中的地址,并返回Response给回调函数(callback) ,默认执行
parse(response) # 默认的回调函数,也可定义其他的回调函数 ,对返回的Response进行处理,提取数据等
log(message[,level,component]) # logger中的日志
closed(reason) #当关闭spider时调用
example :
import scrapyclass MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): self.logger.info('A response from %s just arrived!', response.url)
import scrapyclass MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] start_urls = [ 'http://www.example.com/1.html', 'http://www.example.com/2.html', 'http://www.example.com/3.html', ] def parse(self, response): for h3 in response.xpath('//h3').extract(): yield {"title": h3} for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
import scrapyfrom myproject.items import MyItemclass MySpider(scrapy.Spider): name = 'example.com' allowed_domains = ['example.com'] def start_requests(self): yield scrapy.Request('http://www.example.com/1.html', self.parse) yield scrapy.Request('http://www.example.com/2.html', self.parse) yield scrapy.Request('http://www.example.com/3.html', self.parse) def parse(self, response): for h3 in response.xpath('//h3').extract(): yield MyItem(title=h3) for url in response.xpath('//a/@href').extract(): yield scrapy.Request(url, callback=self.parse)
0 0
- Scrapy学习笔记IV-Spiders
- scrapy学习--Spiders
- scrapy学习--内置Spiders简介
- 【Scrapy】Spiders爬虫
- Scrapy spiders介绍
- Scrapy-spiders(爬虫)
- scrapy学习笔记--scrapy命令
- Scrapy:一次性运行多个Spiders
- Scrapy:一次性运行多个Spiders
- Scrapy 学习笔记(一)
- scrapy学习笔记--Items
- Scrapy框架学习笔记
- Scrapy-学习笔记
- scrapy学习笔记
- scrapy学习笔记
- Scrapy学习笔记一
- Scrapy学习笔记
- Scrapy学习笔记
- 让bootstrap轮播图支持手滑效果的解决方案
- 怎么在MySQL官网下载java连接MySQL数据库的驱动jar包
- ssl
- 字符串移动,还需要多写题
- 1008. Elevator (20)
- Scrapy学习笔记IV-Spiders
- java学习笔记7——static和final关键字
- pytorch学习笔记(八):PytTorch可视化工具 visdom
- javascript之函数
- ie css 滤镜
- argc和argv
- 不正经运维狗的文档5
- 鸡汤
- 解题报告:HDU 4090 GemAnd Prince 搜索