源码分析参考:Queue

来源:互联网 发布:电脑加速软件知乎 编辑:程序博客网 时间:2024/05/21 17:16

queue.py

该文件实现了几个容器类,可以看这些容器和redis交互频繁,同时使用了我们上边picklecompat中定义的序列化器。这个文件实现的几个容器大体相同,只不过一个是队列,一个是栈,一个是优先级队列,这三个容器到时候会被scheduler对象实例化,来实现request的调度。比如我们使用SpiderQueue最为调度队列的类型,到时候request的调度方法就是先进先出,而实用SpiderStack就是先进后出了。

从SpiderQueue的实现看出来,他的push函数就和其他容器的一样,只不过push进去的request请求先被scrapy的接口request_to_dict变成了一个dict对象(因为request对象实在是比较复杂,有方法有属性不好串行化),之后使用picklecompat中的serializer串行化为字符串,然后使用一个特定的key存入redis中(该key在同一种spider中是相同的)。而调用pop时,其实就是从redis用那个特定的key去读其值(一个list),从list中读取最早进去的那个,于是就先进先出了。 这些容器类都会作为scheduler调度request的容器,scheduler在每个主机上都会实例化一个,并且和spider一一对应,所以分布式运行时会有一个spider的多个实例和一个scheduler的多个实例存在于不同的主机上,但是,因为scheduler都是用相同的容器,而这些容器都连接同一个redis服务器,又都使用spider名加queue来作为key读写数据,所以不同主机上的不同爬虫实例公用一个request调度池,实现了分布式爬虫之间的统一调度。

from scrapy.utils.reqser import request_to_dict, request_from_dictfrom . import picklecompatclass Base(object):    """Per-spider queue/stack base class"""    def __init__(self, server, spider, key, serializer=None):        """Initialize per-spider redis queue.        Parameters:            server -- redis connection            spider -- spider instance            key -- key for this queue (e.g. "%(spider)s:queue")        """        if serializer is None:            # Backward compatibility.            # TODO: deprecate pickle.            serializer = picklecompat        if not hasattr(serializer, 'loads'):            raise TypeError("serializer does not implement 'loads' function: %r"                            % serializer)        if not hasattr(serializer, 'dumps'):            raise TypeError("serializer '%s' does not implement 'dumps' function: %r"                            % serializer)        self.server = server        self.spider = spider        self.key = key % {'spider': spider.name}        self.serializer = serializer    def _encode_request(self, request):        """Encode a request object"""        obj = request_to_dict(request, self.spider)        return self.serializer.dumps(obj)    def _decode_request(self, encoded_request):        """Decode an request previously encoded"""        obj = self.serializer.loads(encoded_request)        return request_from_dict(obj, self.spider)    def __len__(self):        """Return the length of the queue"""        raise NotImplementedError    def push(self, request):        """Push a request"""        raise NotImplementedError    def pop(self, timeout=0):        """Pop a request"""        raise NotImplementedError    def clear(self):        """Clear queue/stack"""        self.server.delete(self.key)class SpiderQueue(Base):    """Per-spider FIFO queue"""    def __len__(self):        """Return the length of the queue"""        return self.server.llen(self.key)    def push(self, request):        """Push a request"""        self.server.lpush(self.key, self._encode_request(request))    def pop(self, timeout=0):        """Pop a request"""        if timeout > 0:            data = self.server.brpop(self.key, timeout)            if isinstance(data, tuple):                data = data[1]        else:            data = self.server.rpop(self.key)        if data:            return self._decode_request(data)class SpiderPriorityQueue(Base):    """Per-spider priority queue abstraction using redis' sorted set"""    def __len__(self):        """Return the length of the queue"""        return self.server.zcard(self.key)    def push(self, request):        """Push a request"""        data = self._encode_request(request)        score = -request.priority        # We don't use zadd method as the order of arguments change depending on        # whether the class is Redis or StrictRedis, and the option of using        # kwargs only accepts strings, not bytes.        self.server.execute_command('ZADD', self.key, score, data)    def pop(self, timeout=0):        """        Pop a request        timeout not support in this queue class        """        # use atomic range/remove using multi/exec        pipe = self.server.pipeline()        pipe.multi()        pipe.zrange(self.key, 0, 0).zremrangebyrank(self.key, 0, 0)        results, count = pipe.execute()        if results:            return self._decode_request(results[0])class SpiderStack(Base):    """Per-spider stack"""    def __len__(self):        """Return the length of the stack"""        return self.server.llen(self.key)    def push(self, request):        """Push a request"""        self.server.lpush(self.key, self._encode_request(request))    def pop(self, timeout=0):        """Pop a request"""        if timeout > 0:            data = self.server.blpop(self.key, timeout)            if isinstance(data, tuple):                data = data[1]        else:            data = self.server.lpop(self.key)        if data:            return self._decode_request(data)__all__ = ['SpiderQueue', 'SpiderPriorityQueue', 'SpiderStack']


原创粉丝点击