scrapy自定义RetryMiddleware

来源:互联网 发布:郑秀晶崔雪莉关系知乎 编辑:程序博客网 时间:2024/06/05 05:24

爬虫repo地址:https://github.com/Karmenzind/EasyGoSpider

此处需求为:

  • 返回json中带有{"code": 0}时,将此请求加入重试队列
  • 假如json中含有cookie被禁信息,对cookie列表进行修正

源码注释中有一句:

Failed pages are collected on the scraping process and rescheduled at the end, once the spider has finished crawling all regular (non failed) pages.

继而根据Scrapy doc对通用Download Middleware中process_response的介绍:

If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().

返回request对象时,该response不会再进入spider.parse_item方法,因而无需考虑在后者中的处理。

最后修改结果为:

class LocalRetryMiddleware(RetryMiddleware):    def process_response(self, request, response, spider):        if request.meta.get('dont_retry', False):            return response        if response.status in self.retry_http_codes:            reason = response_status_message(response.status)            return self._retry(request, reason, spider) or response        # customiz' here        resp_dct = json.loads(response.body)        if resp_dct.get('code') != 0:            reason = "Code is not 0."            if resp_dct.get("data") == "\\u8be5\\u7528\\u6237\\u8bbf\\u95ee\\u6b21\\u6570\\u8fc7\\u591a".decode(                    'unicode_escape'):  # 访问次数过多                banned_cookie = response.request.cookies                reason = "%s has been BANNED today." % banned_cookie                spider.logger.warning(reason)                spider.cookies.remove(banned_cookie)                mongo_cli.cookies.find_one_and_update({"cookie": banned_cookie},                                                      {"$set": {"FailedDate": str(datetime.date.today())}})            return self._retry(request, reason, spider) or response        return response

settings中激活

DOWNLOADER_MIDDLEWARES = {    "scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware": None,    "EasyGoSpider.middleware.LocalRetryMiddleWare": 302    }

另外需要修改Retry time和Filter规则,此前已经设置,忽略。

原创粉丝点击