scrapy自定义RetryMiddleware
来源:互联网 发布:郑秀晶崔雪莉关系知乎 编辑:程序博客网 时间:2024/06/05 05:24
爬虫repo地址:https://github.com/Karmenzind/EasyGoSpider
此处需求为:
- 返回json中带有
{"code": 0}
时,将此请求加入重试队列 - 假如json中含有cookie被禁信息,对cookie列表进行修正
源码注释中有一句:
Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages.
继而根据Scrapy doc对通用Download Middleware中process_response的介绍:
If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
返回request对象时,该response不会再进入spider.parse_item方法,因而无需考虑在后者中的处理。
最后修改结果为:
class LocalRetryMiddleware(RetryMiddleware): def process_response(self, request, response, spider): if request.meta.get('dont_retry', False): return response if response.status in self.retry_http_codes: reason = response_status_message(response.status) return self._retry(request, reason, spider) or response # customiz' here resp_dct = json.loads(response.body) if resp_dct.get('code') != 0: reason = "Code is not 0." if resp_dct.get("data") == "\\u8be5\\u7528\\u6237\\u8bbf\\u95ee\\u6b21\\u6570\\u8fc7\\u591a".decode( 'unicode_escape'): # 访问次数过多 banned_cookie = response.request.cookies reason = "%s has been BANNED today." % banned_cookie spider.logger.warning(reason) spider.cookies.remove(banned_cookie) mongo_cli.cookies.find_one_and_update({"cookie": banned_cookie}, {"$set": {"FailedDate": str(datetime.date.today())}}) return self._retry(request, reason, spider) or response return response
settings中激活
DOWNLOADER_MIDDLEWARES = { "scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware": None, "EasyGoSpider.middleware.LocalRetryMiddleWare": 302 }
另外需要修改Retry time和Filter规则,此前已经设置,忽略。
阅读全文
2 0
- scrapy自定义RetryMiddleware
- 重写scrapy中间件之RetryMiddleware
- Scrapy设置之自定义命令
- Scrapy图片下载,自定义图片名字
- scrapy
- Scrapy
- scrapy
- Scrapy
- Scrapy
- Scrapy
- Scrapy
- Scrapy
- scrapy
- Scrapy
- scrapy
- Scrapy
- Scrapy
- Scrapy
- QT入门_地图
- c++ 拷贝构造函数中形参对象可以直接访问private变量
- Central Europe Regional Contest 2014 B [Gym
- Win下使用Eclipse开发scala程序配置(基于Hadoop2.7.3集群)
- python中的sum函数.sum(axis=1)
- scrapy自定义RetryMiddleware
- 第七章 使用prototype Cell定制Table View(二)
- Log4j输出格式控制--log4j的PatternLayout参数含义
- poj1523—SPF(tarjan算法求无向图中所有的割点)
- 华为机试——字符串分隔
- 单例模式
- CSS3 transform
- Spring MVC概述
- 手动触发链接