scrapy.Selector的使用探索
来源:互联网 发布:中国等级观念知乎 编辑:程序博客网 时间:2024/05/17 01:04
声明
本文使用的例子来自Scrapy的官方文档,读者可以先行查看:
https://doc.scrapy.org/en/0.14/topics/selectors.html
开始
打开终端,输入:
# scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
将会生成如下信息:
2017-04-30 13:51:21 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)2017-04-30 13:51:21 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}2017-04-30 13:51:21 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats']2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled downloader middlewares:['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats']2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled spider middlewares:['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware']2017-04-30 13:51:22 [scrapy.middleware] INFO: Enabled item pipelines:[]2017-04-30 13:51:22 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:60242017-04-30 13:51:22 [scrapy.core.engine] INFO: Spider opened2017-04-30 13:51:22 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> from <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>2017-04-30 13:51:23 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://doc.scrapy.org/en/latest/_static/selectors-sample1.html> (referer: None)2017-04-30 13:51:25 [traitlets] DEBUG: Using default logger2017-04-30 13:51:25 [traitlets] DEBUG: Using default logger[s] Available Scrapy objects:[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)[s] crawler <scrapy.crawler.Crawler object at 0x00000000049F1C18>[s] item {}[s] request <GET http://doc.scrapy.org/en/latest/_static/selectors-sample1.html>[s] response <200 https://doc.scrapy.org/en/latest/_static/selectors-sample1.html>[s] settings <scrapy.settings.Settings object at 0x00000000049F1978>[s] spider <DefaultSpider 'default' at 0x4d494a8>[s] Useful shortcuts:[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)[s] fetch(req) Fetch a scrapy.Request and update local objects[s] shelp() Shell help (print this help)[s] view(response) View response in a browserIn [1]:
其中罗列了可以使用的对象有:
scrapy模块,crawler,item,request,response,settings,spider
这时我们爬回的网站源代码就是response对象。
In [1]: type(response) #查看response对象Out[1]: scrapy.http.response.html.HtmlResponse
response中的xpath方法其实已经可以进行数据提取了,其返回的是SeletorList对象:
In [3]: response.xpath('//a')Out[3]:[<Selector xpath='//a' data=u'<a href="image1.html">Name: My image 1 <'>, <Selector xpath='//a' data=u'<a href="image2.html">Name: My image 2 <'>, <Selector xpath='//a' data=u'<a href="image3.html">Name: My image 3 <'>, <Selector xpath='//a' data=u'<a href="image4.html">Name: My image 4 <'>, <Selector xpath='//a' data=u'<a href="image5.html">Name: My image 5 <'>]In [4]: response.xpath('//a/text()')Out[4]:[<Selector xpath='//a/text()' data=u'Name: My image 1 '>, <Selector xpath='//a/text()' data=u'Name: My image 2 '>, <Selector xpath='//a/text()' data=u'Name: My image 3 '>, <Selector xpath='//a/text()' data=u'Name: My image 4 '>, <Selector xpath='//a/text()' data=u'Name: My image 5 '>]In [5]: response.xpath('//a/text()').extract()Out[5]:[u'Name: My image 1 ', u'Name: My image 2 ', u'Name: My image 3 ', u'Name: My image 4 ', u'Name: My image 5 ']
其中SelectorList可以使用正则表达式,返回一个列表:
In [10]: response.xpath('//a[contains(@href, "image")]/text()').re(r'Name:\s*(.*)')Out[10]:[u'My image 1 ', u'My image 2 ', u'My image 3 ', u'My image 4 ', u'My image 5 ']
SelectorList对象,即一个Selector列表。Seletor,SeletorList对象都有xpath方法,返回皆为SelectorList或Selector:
In [13]: links=response.xpath('//a[contains(@href, "image")]')In [15]: for index, link in enumerate(links): args = (index, link.xpath('@href').extract()[0], link.xpath('img/@src').extract()[0]) print 'Link number %d points to url %s and image %s' % args
返回结果:
Link number 0 points to url image1.html and image image1_thumb.jpg
Link number 1 points to url image2.html and image image2_thumb.jpg
Link number 2 points to url image3.html and image image3_thumb.jpg
Link number 3 points to url image4.html and image image4_thumb.jpg
Link number 4 points to url image5.html and image image5_thumb.jpg
官网上部分API已经不再适用。
xpath语法
一、选取节点
常用的路径表达式:
二、谓语
谓语被嵌在方括号内,用来查找某个特定的节点或包含某个制定的值的节点
实例:
三、通配符
Xpath通过通配符来选取未知的XML元素
四、取多个路径
使用“|”运算符可以选取多个路径
五、Xpath轴
轴可以定义相对于当前节点的节点集
六、功能函数
使用功能函数能够更好的进行模糊搜索
- scrapy.Selector的使用探索
- Scrapy 探索:使用 Scrapy 爬取自己的 CSDN 博客
- 探索Android中selector和shape的结合使用
- 探索Android中selector和shape的结合使用
- Scrapy爬虫局部Selector的选取办法
- Scrapy 探索:如何使用 Pycharm 研读 Scrapy 源码
- Scrapy selector介绍
- Scrapy Selector 语法
- Scrapy选择器Selector
- scrapy Selector 选择器
- scrapy的Response和Selector编码格式介绍
- selector的使用
- android selector的使用
- selector 的使用
- Android的Selector使用
- android selector的使用
- selector的使用
- selector的使用
- RxJava2使用详解
- Java多线程学习,错过的好文章,值得一看!
- 算法预备军(1)~数据结构绪论
- 终极 Shell on-zshrc
- 常见的排序算法实现
- scrapy.Selector的使用探索
- 如何在awk中使用正则表达式
- 软件开发模型之瀑布模型
- iOS
- 我的java学习路之编写第一个程序
- phpstorm version 2016.2 License Server激活
- [摘抄-Java-学习中]Java 图片叠加水印&文字自动换行
- Android 中java反射应用(二)——应用篇
- EasyUI DataGrid中URL中文乱码的解决办法