爬虫实践(四)--scrapy简单实践
来源:互联网 发布:linux 看cpu使用情况 编辑:程序博客网 时间:2024/05/16 08:58
准备工作
配置文件 settings. py
BOT_NAME = 'scrapyTest'SPIDER_MODULES = ['scrapyTest.spiders']NEWSPIDER_MODULE = 'scrapyTest.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'scrapyTest (+http://www.yourdomain.com)'# Obey robots.txt rulesROBOTSTXT_OBEY = False //不能遵循robots
准备许多浏览器头
agents = [ "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)" . .]
分析cookies(掌阅貌似没这方面反爬)
""" 换Cookie """ cookie = { 'Hm_lpvt_2583df02aa8541db9378beae2ed00ba0': '1502265076', 'Hm_lvt_2583df02aa8541db9378beae2ed00ba0': '1502263527', 'ZyId': 'ada56e4598ab89a9944f' }
打日志(记录问题)
import logginglogging.getLogger("requests").setLevel(logging.WARNING) # 将requests的日志级别设成WARNING logging.basicConfig( level=logging.DEBUG, format= '%(asctime)s %(filename)s[line:%(lineno)d] %(levelname)s %(message)s', datefmt='%a, %d %b %Y %H:%M:%S', filename='cataline.log', filemode='w')例如:logging.info('爬取地址' + view_url)
编写spiders主类
根据对掌阅的分析,首先获取需要爬取的地址,并且压入待爬取队列。这里采取递归获取的策略。
# 开始URL start_urls = [ "http://www.ireader.com/index.php?ca=booksort.index&pca=booksort.index&pid=92", "http://www.ireader.com/index.php?ca=booksort.index&pca=booksort.index&pid=10", "http://www.ireader.com/index.php?ca=booksort.index&pca=booksort.index&pid=68" ] def start_requests(self): for ph_type in self.start_urls: yield Request(url=ph_type, callback=self.parse_type_key) #遍历所有类型 def parse_type_key(self, response): selector = Selector(response) types = selector.xpath('//div[@class="difgenre"]')[1].xpath('.//div[@class="right"]/ul/li') for type in types: type_url = type.xpath('.//a/@href')[0].extract() logging.info('类型' + type_url) yield Request(url=type_url,callback=self.parse_ph_key) #每个类型遍历所有链接,并且自动下一页 def parse_ph_key(self, response): selector = Selector(response) lis = selector.xpath('//ul[@class="newShow"]/li') for li in lis: view_url = li.xpath('.//a/@href')[0].extract() logging.info('爬取地址' + view_url) yield Request(url=view_url, callback=self.parse_content) url_next = selector.xpath('//a[@class="down"]/@href')[0].extract() if url_next: logging.info('下一页地址' + url_next) yield Request(url=url_next,callback=self.parse_ph_key)
获取完地址就要对每个地址进行下载解析(代码不完整)
# 解析内容函数 def parse_content(self, response): logging.debug('正在爬取地址' + response.url) item = ScrapytestItem() item['_id'] = dict([(k, v[0]) for k, v in urlparse.parse_qs(urlparse.urlparse(response.url).query).items()])['bid'] # 当前URL item['url'] = response.url # title item['title'] = response.selector.xpath('//div[@class="bookname"]/h2/a/text()')[0].extract().decode('utf-8') item['tag'] = response.selector.xpath('//div[@class="bookL"]/s/text()')[0].extract().decode('utf-8') try: # 评分 item['rate'] = response.selector.xpath('//div[@class="bookname"]/span/text()')[0].extract().decode('utf-8') # 评价人数 item['num_rate'] = response.selector.xpath('//div[@class="bookinf01"]/p/span[@class="manyMan"]/text()')[0].extract().decode('utf-8').split('人')[0] except Exception: item['rate'] = '' item['num_rate'] = '' yield item
item字段定义
定义你解析html所需要的字段(根据spiders类中parse_content)
import scrapyclass ScrapytestItem(scrapy.Item): url = scrapy.Field() _id = scrapy.Field() title = scrapy.Field() author = scrapy.Field() num_word = scrapy.Field() press = scrapy.Field() num_rate = scrapy.Field() rate = scrapy.Field() tag = scrapy.Field() img = scrapy.Field() des = scrapy.Field() price = scrapy.Field() similar = scrapy.Field()
pipelines类(存储数据)
使用mongodb进行数据存储
import pymongofrom items import ScrapytestItemclass ScrapytestPipeline(object): def __init__(self): clinet = pymongo.MongoClient("localhost", 27017) db = clinet["book"] self.book = db["book"] def process_item(self, item, spider): """ 判断类型 存入MongoDB """ if isinstance(item, ScrapytestItem): try: self.book.insert(dict(item)) except Exception: pass return item
middlewares中间件
这里只用到了两个中间件,而且比较简单
class UserAgentMiddleware(object): """ 换User-Agent """ def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agentclass CookiesMiddleware(object): """ 换Cookie """ cookie = { 'Hm_lpvt_2583df02aa8541db9378beae2ed00ba0': '1502265076', 'Hm_lvt_2583df02aa8541db9378beae2ed00ba0': '1502263527', 'ZyId': 'ada56e4598ab89a9944f' } def process_request(self, request, spider): request.cookies = self.cookie
使用中间件,需要进行配置settings. py
# 后面数字貌似决定执行的顺序DOWNLOADER_MIDDLEWARES = { 'scrapyTest.middlewares.UserAgentMiddleware': 401, 'scrapyTest.middlewares.CookiesMiddleware': 402,}
结果
添加启动入口
from scrapy import cmdlinecmdline.execute("scrapy crawl scrapyTest".split())
完整目录结构
最终爬取了91914条数据
问题分析
- 没有做详细的异常处理
- 便利到数据量太小,只考虑到页面有的数据
- 日志用的比较随意,可以考虑分小时,分情况打日志
github|完整代码
leason|个人博客
阅读全文
0 0
- 爬虫实践(四)--scrapy简单实践
- Python爬虫Scrapy实践
- 爬虫实践(三)--了解scrapy
- Scrapy爬虫原理及实践
- 爬虫实践之爬虫框架Scrapy安装
- Python爬虫实践笔记(四)
- 爬虫实践---Scrapy-爬取慕课网热门课程
- Scrapy爬虫入门教程四 Spider(爬虫)
- Scrapy爬虫实践之搜索并获取前程无忧职位信息(基础篇)
- windows 7 Scrapy爬虫安装成功实践记录
- 爬虫实践---Scrapy-豆瓣电影影评&深度爬取
- 爬虫实践(二)--掌阅书城
- 爬虫实践(一)--手写爬虫
- Python爬虫系列之----Scrapy(四)一个简单的示例
- RPC实践(四)Dubbo实践
- scrapy专利爬虫(一)——scrapy简单介绍
- 创新实践(1)--爬虫的简单理解与java页面的简单抓取
- Python爬虫实践(四):一些不常用设置
- HDU1002(A + B Problem II)
- 搭建QNX开发环境-qnx系统环境开发
- Mybatis返回HashMap时,某个字段值为null时,不会保存key
- cinder 基于镜像创建volume, 竟然下载镜像,问题排查
- torchvision安装
- 爬虫实践(四)--scrapy简单实践
- 51nod 1536不一样的猜数游戏 O(n)素数筛选法。同Codeforces 576A Vasya and Petya's Game。
- 2017 多校系列 6
- js选中checkbox
- Java中判断字符串是否为数字的五种方法
- 常用Android studio调用快捷键
- https://maven.google.com连接不上的解决办法
- shell脚本实现关闭指定程序名的进程
- D. The Union of k-Segments(扫描线)