scrapy的应用
来源:互联网 发布:网络投稿小说网站 编辑:程序博客网 时间:2024/06/01 16:25
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mySpider.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
#"http://stackoverflow.com/"
]
def parse(self,response):
#sel = Selector(response)
#sites = sel.xpath('//ul/li')
#for site in sites:
# title = site.xpath('a/text()').extract()
# link = site.xpath('a/@href').extract()
#desc = site.xpath('text()').extract()
# print title
##filename = response.url.split("/")[-2]
#with open(filename,'wb') as f:
# f.write(response.body)
hxs = HtmlXPathSelector(text=response.body)
#print hxs.select('/title/text()').extract()
items = []
for sel in hxs.select('//ul/li'):
item = DmozItem()
item['title'] = sel.select('a/text()').extract()
item['link'] = sel.select('a/@href').extract()
item['desc']= sel.select('text()').extract()
items.append(item)
return items
#print title
# for t in title:
# print t.encode('utf-8')
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from mySpider.items import DmozItem
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/",
#"http://stackoverflow.com/"
]
def parse(self,response):
#sel = Selector(response)
#sites = sel.xpath('//ul/li')
#for site in sites:
# title = site.xpath('a/text()').extract()
# link = site.xpath('a/@href').extract()
#desc = site.xpath('text()').extract()
# print title
##filename = response.url.split("/")[-2]
#with open(filename,'wb') as f:
# f.write(response.body)
hxs = HtmlXPathSelector(text=response.body)
#print hxs.select('/title/text()').extract()
items = []
for sel in hxs.select('//ul/li'):
item = DmozItem()
item['title'] = sel.select('a/text()').extract()
item['link'] = sel.select('a/@href').extract()
item['desc']= sel.select('text()').extract()
items.append(item)
return items
#print title
# for t in title:
# print t.encode('utf-8')
0 0
- scrapy的应用
- scrapy爬虫基本应用
- Python:Scrapy应用
- scrapy模块应用
- scrapy分布式的应用学习笔记(一)
- 【原创】开源爬虫Scrapy的学习及应用
- scrapy的简单应用-抓取链家数据
- 【scrapy】scrapy的环境依赖
- Scrapy进阶之Scrapy的架构
- Scrapy爬虫(九):scrapy的调试技巧
- 【scrapy】debian下scrapy的安装
- 【scrapy】windows下scrapy的安装
- scrapy安装及PyCharm的scrapy配置
- [Python][Scrapy 框架] Python3 Scrapy的安装
- [Python][Scrapy 框架] Python3 Scrapy的安装
- Scrapy 和 scrapy-redis的区别
- 网络爬虫框架scrapy介绍及应用——抓取新浪新闻的标题内容评论
- requests 和 scrapy 在不同的爬虫应用中,各自有什么优势?
- linux中的hexdump命令
- 单元测试如何测试异常与超时
- JOIN详解
- apache的虚拟域名rewrite配置以及.htaccess的使用。
- JQ选择的链式写法
- scrapy的应用
- 2016.10.20-日志
- 基于C#\WPF的UDP网口助手源码
- 解析Json数据的时候抛出MalformedJsonException
- 【多用户访问一个文件】文件锁
- r进行crt预估
- Hive参数
- typedef和#define的用法与区别
- 西南联训5[Source from NK] 题解&总结