Scrapy Pipeline之使用REST API

我们想在Scrapy Pipeline中使用的一些功能经常是以REST API的形式提供的,在誓死不渝的部分,我们会理解如何访问这些功能。


treq是一个Python的包,它的功能和Python requests包相同,只不过是专门供以Twisted为基础的应用程序来使用。它允许我们很这容易地执行GET、POST及其他HTTP操作。安装时只需使用pip install treq命令即可。





$ curl http://es:9200{    "name" : "Living Brain",    "cluster_name" : "elasticsearch",    "version" : { ... },    "tagline" : "You Know, for Search"}

在机器上用浏览器访问http://localhost:9200也能得到相同的结果。如果访问http://localhost:9200/properties/property/_search,我们看到的结果会是ES尝试了一下但是没有找到与properties相关的索引。现在我们已经使用了ES的REST API。

$ curl -XDELETE http://es:9200/properties


@defer.inlineCallbacksdef process_item(self, item, spider):    data = json.dumps(dict(item), ensure_ascii=False).encode("utf-8")    yield, data)





ITEM_PIPELINES = {'properties.pipelines.tidyup.TidyUp': 100,'': 800,}ES_PIPELINE_URL = 'http://es:9200/properties/property'


$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=90...INFO: Enabled item pipelines: EsWriter...INFO: Closing spider (closespider_itemcount)...'item_scraped_count': 106,


重新运行一下scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=1000,平均时延从0.78s上升到了0.81s,这是由于在pipeline中的处理时延从0.12s上升到了0.15s。吞吐量依然保持在每秒25个Item上下。

Item插入到数据库中并不是一个好主意。通常情况下,数据库提供了数量级级别的批量插入数据的更有效的方法,我们应该使用这种方式。也就是说,我们应该皆是插入数据或者在爬虫的末尾作为一个后处理的阶段来执行插入行为。仍然有很多人使用Item Pipeline来向数据库中插入数据,不过使用Twisted API而不是通常阻塞的API是实现这种方式的正确途径。

使用Google Geocoding API来进行地理编码的Pipeline

每个property都有各自的区域,我们想要对它们进行编码,也就是找出它们的坐标(经纬度)。我们可以通过使用这些坐标把每处房产放在地图上,或者根据它们的距离远近进行排序。实现这样的功能需要复杂的数据库、复杂的文本匹配以及复杂的空间计算。通过使用Google Geocoding API,我们可以避免重新独立地开发这些功能。度一下用浏览器打开或者使用curl从下面的URL中取得数据:

$ curl ""{    "results" : [        ...        "formatted_address" : "London, UK",        "geometry" : {            ...            "location" : {                "lat" : 51.5073509,                "lng" : -0.1277583            },        "location_type" : "APPROXIMATE",        ...    ],    "status" : "OK"}


使用treq也可以访问到Google Geocoding API,只需要几行代码,我们就可以找到一个地址的location(参考pipelines目录中的geo.py文件):

@defer.inlineCallbacksdef geocode(self, address):    endpoint = 'http://web:9312/maps/api/geocode/json'    parms = [('address', address), ('sensor', 'false')]    response = yield treq.get(endpoint, params=parms)    content = yield response.json()    geo = content['results'][0]["geometry"]["location"]    defer.returnValue({"lat": geo["lat"], "lon": geo["lng"]})

这个函数先生成了一个和我们之前使用的很像的URL,但是现在指向了一个假的地址,为的是使执行更加快、在离线时也能使用以及更加可预料。你也可以使用endpoint =''来直接访问Google的服务器,但是要记住它们对请求有着严格的限制。treq的get()方法中params参数的addresssensor的值是URL编码的。treq.get()返回了一个Deferred对象,然后把它yield,以便当响应到达时继续执行。第二个yield是在response.json()上面,以便等待响应完全被加载并被转换成Python对象。然后我们从第一个结果中寻找location信息,把它格式化为dict并使用defer.returnValue()——一个在使用了inlineCallbacks的方法中合适的返回值的方式——来返回结果。如果某处出错了,该方法会抛出异常并由Scrapy报告给我们。


item["location"] = yield self.geocode(item["address"][0])


ITEM_PIPELINES = {...'properties.pipelines.geo.GeoPipeline': 400,


$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=90 -L DEBUG...{'address': [u'Greenwich, London'],...'image_urls': [u'http://web:9312/images/i06.jpg'],'location': {'lat': 51.482577, 'lon': -0.007659},'price': [1030.0],...

现在可以看到在Item中的location域。如果现在用真正的Google Geocoding API,运行不久就会得到这样的异常:

File "pipelines/" in geocode (content['status'], address))Exception: Unexpected status="OVER_QUERY_LIMIT" for address="*London"


我们可以访问Geocoder API的文档来查找关于它的限制:”Users of the free API: 2500 requests per 24 hour period, 5 requests per second”。即使我们使用的是付费版,也要限制在每秒10个请求以内,所以上面的讨论还是很有价值的。



class Throttler(object):    """    A simple throttler helps you limit the number of requests you make    to a limited resource    """    def __init__(self, rate):        """It will callback at most ```rate``` enqueued things per second"""        self.queue = []        self.looping_call = task.LoopingCall(self._allow_one)        self.looping_call.start(1. / float(rate))    def stop(self):        """Stop the throttler"""        self.looping_call.stop()    def throttle(self):        """        Call this function to get a deferred that will become available        in some point in the future in accordance with the throttling rate        """        d = defer.Deferred()        self.queue.append(d)        return d    def _allow_one(self):        """Makes deferred callbacks periodically"""        if self.queue:            self.queue.pop(0).callback(None)


class GeoPipeline(object):    def __init__(self, stats):        self.throttler = Throttler(5) # 5 Requests per second    def close_spider(self, spider):        self.throttler.stop()

我们需要在正好访问限制的资源(即在process_item()中调用geocode()时)之前,yield Throttlerthrottle()方法:

yield self.throttler.throttle()item["location"] = yield self.geocode(item["address"][0])

在第一个yield时,代码会先暂停并且在到了足够的时间之后继续执行。例如,如果此时有11个Deferred对象,我们限制的速度是每秒5个请求,那么代码就会在11/5 = 2.2s后恢复执行。


class DeferredCache(object):    """    A cache that always returns a value, an error or a deferred    """    def __init__(self, key_not_found_callback):        """Takes as an argument """        self.records = {}        self.deferreds_waiting = {}        self.key_not_found_callback = key_not_found_callback    @defer.inlineCallbacks    def find(self, key):        """        This function either returns something directly from the cache or it        calls ```key_not_found_callback``` to evaluate a value and return it.        Uses deferreds to do this is a non-blocking manner.        """        # This is the deferred for this call        rv = defer.Deferred()        if key in self.deferreds_waiting:            # We have other instances waiting for this key. Queue            self.deferreds_waiting[key].append(rv)        else:            # We are the only guy waiting for this key right now.            self.deferreds_waiting[key] = [rv]            if not key in self.records:                # If we don't have a value for this key we will evaluate it                # using key_not_found_callback.                try:                    value = yield self.key_not_found_callback(key)                    # If the evaluation succeeds then the action for this key                    # is to call deferred's callback with value as an argument                    # (using Python closures)                    self.records[key] = lambda d: d.callback(value)                except Exception as e:                    # If the evaluation fails with an exception then the                    # action for this key is to call deferred's errback with                    # the exception as an argument (Python closures again)                    self.records[key] = lambda d: d.errback(e)            # At this point we have an action for this key in self.records            action = self.records[key]            # Note that due to ```yield key_not_found_callback```, many            # deferreds might have been added in deferreds_waiting[key] in            # the meanwhile            # For each of the deferreds waiting for this key....            for d in self.deferreds_waiting.pop(key):                # ...perform the action later from the reactor thread                reactor.callFromThread(action, d)        value = yield rv        defer.returnValue(value)


  • self.deferreds_waiting:这是一个等待给定key对应的值的Deferred对象的队列
  • self.records:这是一个已经填充了key-action键值对的字典

如果我们观察一下find()的实现,就会发现如果没有在self.records中找到给定的key,那么就会调用一个已经预先定义好的函数来获取该值(yield self.key_not_found_callback(key))。这个回调函数可能会抛出异常,我们该怎样把返回的值或者异常以一种紧凑的方式存储在一起呢?既然Python是一种函数式的语言,我们可以在self.records中存储一些小的函数(lambda表达式),这些函数根据是否有异常的产生来分别调用Deferred对象的callback或者errback方法。在定义lambda表达式的时候就把值或者异常与它关联到了一起,这种把变量和函数关联到一起叫做闭包,是大多数函数式编程语言最有特色及最强大的功能特性。


find()函数的其余部分提供给我们一个防止竞态条件的机制。如果查找一个key已经在处理中,就会在self.deferreds_waiting dict中记录一下。在这个例子中,我们没有直接再次调用key_not_found_callback(),而是把请求加到等待处理那个key的Deferred队列中。当key_not_found_callback()返回的时候,就代表着这个key的值已经可用了,我们就可以激活等待这个key完成的每个Deferred对象。我们本可以直接使用action(d)而不是调用reactor.callFromThread(),但是这样就需要我们自己来处理下游的异常,而且可能出现不必要的很长的Deferred链。


class GeoPipeline(object):    """A pipeline that geocodes addresses using Google's API"""    @classmethod    def from_crawler(cls, crawler):        """Create a new instance and pass it crawler's stats object"""        return cls(crawler.stats)    def __init__(self, stats):        """Initialize empty cache and stats object"""        self.stats = stats        self.cache = DeferredCache(self.cache_key_not_found_callback)        self.throttler = Throttler(5)  # 5 Requests per second    def close_spider(self, spider):        """Stop the throttler"""        self.throttler.stop()    @defer.inlineCallbacks    def geocode(self, address):        """        This method makes a call to Google's geocoding API. You shouldn't        call this more than 5 times per second        """        # The url for this API        #endpoint = ''        endpoint = 'http://web:9312/maps/api/geocode/json'        # Do the call        parms = [('address', address), ('sensor', 'false')]        response = yield treq.get(endpoint, params=parms)        # Decode the response as json        content = yield response.json()        # If the status isn't ok, return it as a string        if content['status'] != 'OK':            raise Exception('Unexpected status="%s" for address="%s"' %                            (content['status'], address))        # Extract the address and geo-point and set item's fields        geo = content['results'][0]["geometry"]["location"]        # Return the final value        defer.returnValue({"lat": geo["lat"], "lon": geo["lng"]})    @defer.inlineCallbacks    def cache_key_not_found_callback(self, address):        """        This method makes an API call while respecting throttling limits.        It also retries attempts that fail due to limits.        """        self.stats.inc_value('geo_pipeline/misses')        while True:            # Wait enough to adhere to throttling policies            yield self.throttler.throttle()            # Do the API call            try:                value = yield self.geocode(address)                defer.returnValue(value)                # Success                break            except Exception, e:                if 'status="OVER_QUERY_LIMIT"' in str(e):                    # Retry in this case                    self.stats.inc_value('geo_pipeline/retries')                    continue                # Propagate the rest                raise    @defer.inlineCallbacks    def process_item(self, item, spider):        """        Pipeline's main method. Uses inlineCallbacks to do        asynchronous REST requests        """        if "location" in item:            # Set by previous step (spider or pipeline). Don't do anything            # apart from increasing stats            self.stats.inc_value('geo_pipeline/already_set')            defer.returnValue(item)            return        # The item has to have the address field set        assert ("address" in item) and (len(item["address"]) > 0)        # Extract the address from the item.        try:            item["location"] = yield self.cache.find(item["address"][0])        except:            self.stats.inc_value('geo_pipeline/errors')            print traceback.format_exc()        # Return the item for the next stage        defer.returnValue(item)


ITEM_PIPELINES = {    'properties.pipelines.tidyup.TidyUp': 100,    '': 800,    # DISABLE 'properties.pipelines.geo.GeoPipeline': 400,    'properties.pipelines.geo2.GeoPipeline': 400,}


$ scrapy crawl easy -s CLOSESPIDER_ITEMCOUNT=1000...Scraped... 15.8 items/s, avg latency: 1.74 s and avg time inpipelines: 0.94 sScraped... 32.2 items/s, avg latency: 1.76 s and avg time inpipelines: 0.97 sScraped... 25.6 items/s, avg latency: 0.76 s and avg time inpipelines: 0.14 s...: Dumping Scrapy stats:...    'geo_pipeline/misses': 35,    'item_scraped_count': 1019,

可以看到,刚开始启动时,爬虫的时延比较高,因为它要把数据填充到缓存中,不过然后时延就回到了之前的值。由最后的统计结果可以得知,有35次没有在缓存中得到数据,这也是在demo数据集中实际使用到的location数目。显然,有1019 - 35 = 984次缓存命中。如果我们使用真正的Google API,并且稍稍增加每秒请求的数目,比如,通过把Throttler(5)改成Throttler(10),我们会在geo_pipeline/retries中得到重试的记录。如果出现了错误,比如,API找不到一个location,那么就会抛出一个异常,会在geo_pipeline/errors留下记录。如果Item中的location已经被设置过了,那么会在geo_pipeline/already_set中显示出来。最后,检查一下ES中的房产记录,用浏览器打开http://localhost:9200/properties/property/_search,我们会看到有些条目中会有location的值,例如,{..."location": {"lat": 51.5269736, "lon": -0.0667204}...},正如我们所期望的那样。


既然我们已经有了位置信息,那么我们就能按照距离的远近来给搜索的结果排序。这是一个HTTPPOST请求,返回了在标题中有Angel的房产,并按照离点{51.54, -0.19}的距离来进行排序:

$ curl http://es:9200/properties/property/_search -d '{    "query" : {"term" : { "title" : "angel" } },    "sort": [{"_geo_distance": {        "location": {"lat": 51.54, "lon": -0.19},        "order": "asc",        "unit": "km",        "distance_type": "plane"}}]}'

不过还有一个问题,如果现在运行的话,会出现这样的错误信息:”failed to find mapper for [location] for geo distance based sort”。这是说location域没有合适的格式来进行空间操作。为了设置合适的类型,我们必须手动来覆盖默认值。首先,把自动检测出的映射保存到一个文件里:

$ curl 'http://es:9200/properties/_mapping/property' > property.txt


"location":{"properties":{"lat":{"type":"double"},"lon": {"type":"double"}}}


"location": {"type": "geo_point"}


$ curl -XDELETE 'http://es:9200/properties'$ curl -XPUT 'http://es:9200/properties'$ curl -XPUT 'http://es:9200/properties/_mapping/property' --data @property.txt


