Scrapy之信号

来源:互联网 发布:淘宝运费险最多赔多少 编辑:程序博客网 时间:2024/06/16 17:55

信号提供了一种机制,可以让事件发生时调用该事件的回调函数,例如,当爬虫开启,或者当抓取到了一个Item。你可以通过crawler.signals.connect()方法来把它们和回调函数关联起来。Scrapy共有11个信号,或许理解它们的最简单的方式就是在实例中观察它们。这时创建了一个爬虫的工程,主要目的就是记录了每次的方法调用。爬虫本身比较简单,只是yield了两个Item然后抛出一个异常,并且在处理第二个Item时让Item Pipeline抛出一个DropItem异常:

def parse(self, response):    for i in range(2):        item = HooksasyncItem()        item['name'] = "Hello %d" % i        yield item    raise Exception("dead")

完整的爬虫工程可以找一下这里。

使用这个工程,我们可以更好地理解信号是在何时被发送的。看一下下面的执行结果,注意日志行之间的注释:

$ scrapy crawl test... many lines ...# First we get those two signals...INFO: Extension, signals.spider_opened firedINFO: Extension, signals.engine_started fired# Then for each URL we get a request_scheduled signalINFO: Extension, signals.request_scheduled fired...# when download completes we get response_downloadedINFO: Extension, signals.response_downloaded firedINFO: DownloaderMiddlewareprocess_response called forexample.com# Work between response_downloaded and response_receivedINFO: Extension, signals.response_received firedINFO: SpiderMiddlewareprocess_spider_input called forexample.com# here our parse() method gets called... and thenSpiderMiddleware usedINFO: SpiderMiddlewareprocess_spider_output called forexample.com# For every Item that goes through pipelines successfully...INFO: Extension, signals.item_scraped fired# For every Item that gets dropped using the DropItemexception...INFO: Extension, signals.item_dropped fired# If your spider throws something else...INFO: Extension, signals.spider_error fired# ... the above process repeats for each URL# ... till we run out of them. then...INFO: Extension, signals.spider_idle fired# by hooking spider_idle you can schedule further Requests. Ifyou don't# the spider closes.INFO: Closing spider (finished)INFO: Extension, signals.spider_closed fired# ... stats get printed# and finally engine gets stopped.INFO: Extension, signals.engine_stopped fired

只有11个信号可能有些限制,但是Scrapy所有默认的中间件都是用它们实现的,所以11个信号已经足够了。要注意的是,除了spider_idlespider_errorrequest_scheduledresponse_receivedresponse_downloaded这些信号,你都可以在其他信号中返回Deferred对象而不是实际的值。

0 0
原创粉丝点击