python queue和多线程的爬虫 与 JoinableQueue和多进程的爬虫

来源:互联网 发布:预测算法有哪些 编辑:程序博客网 时间:2024/06/05 00:25

多线程加queue的爬虫 和 多进程加JoinableQueue的爬虫

以自己的csdn博客为例(捂脸,算不算刷自己博客访问量啊,哈哈哈)

这个是多线程爬虫,代码比较简单,有注释:

# -*-coding:utf-8-*-"""ayou"""import requestsfrom requests.exceptions import HTTPError, ConnectionErrorfrom bs4 import BeautifulSoup,NavigableStringimport Queueimport threadingimport time#AyouBlog类#get_page_url函数获得所有博客的URLclass AyouBlog():    def __init__(self):        self.headers = {            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0",        }        self.s = requests.session()    def get_page_url(self):        urls_set = set()        url="http://blog.csdn.net/u013055678?viewmode=contents"        try:            html = self.s.get(url, headers=self.headers)        except HTTPError as e:            print(str(e))            return str(e)        except ConnectionError as e:            print(str(e))            return str(e)        try:            soup = BeautifulSoup(html.content, "lxml")            page_div = soup.find_all("span", {"class": "link_title"})            for url in page_div:                a_url = "http://blog.csdn.net"+url.find("a").attrs["href"]                urls_set.add(a_url)        except AttributeError as e:            print(str(e))            return str(e)        return urls_set#ThreadUrl继承线程类#run函数将QUEUE中的URL逐个取出,然后打开,取得博客详细页面的标题class ThreadUrl(threading.Thread):    def __init__(self, queue):        threading.Thread.__init__(self)        self.queue = queue        self.s = requests.session()        self.headers = {            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0",        }    def run(self):        while not self.queue.empty():            host = self.queue.get()            try:                html = self.s.get(host, headers=self.headers)            except HTTPError as e:                print(str(e))                return str(e)            except ConnectionError as e:                print(str(e))                return str(e)            try:                soup = BeautifulSoup(html.content, "lxml")                class_div = soup.find("span",{"class":"link_title"})                print((class_div.text).strip())            except AttributeError as e:                print(str(e))                return str(e)            except NavigableString as e:                print(str(e))                return str(e)            self.queue.task_done()def main():    #创建队列    queue = Queue.Queue()    #将URL放进队列    p = AyouBlog()    for url in p.get_page_url():        print(url)        queue.put(url)    #开多线程    for i in range(7):        t = ThreadUrl(queue)        t.setDaemon(True)        t.start()    #队列清空后再执行其它    queue.join()if __name__=="__main__":    start = time.time()    main()    print("Elapsed Time: %s" % (time.time() - start))
结果耗时:


看一下只有1条线程所需要的时间

只需把main函数中 rang(7) 改成 range(1)

结果耗时在11秒左右

线程并不不是越多就越快,毕竟数据并不多,你会发现开4,5,6,7个耗时其实差不多都在3秒左右


少的数据上面的代码并没有问题,但是我在工作中抓取三千多个页面,六万多条数据,用两个queue,两个线程组,每组开20个线程爬取,数据可以爬取下来,但是整个进程就死在那里,也不报错,无法正常退出,但是我只爬取几百页,几千条数据却能正常退出,没有任何错误,这个问题我不知道怎么回事儿,最后我只能用scrapy重写一下,知道如何解决的朋友请告诉我,谢谢



下面是多进程爬虫代码,有注释:

# -*-coding:utf-8-*-"""ayou"""import requestsfrom requests.exceptions import HTTPError, ConnectionErrorfrom bs4 import BeautifulSoup,NavigableStringfrom multiprocessing import Process, JoinableQueueimport time#AyouBlog类#get_page_url函数获得所有博客的URLclass AyouBlog():    def __init__(self):        self.headers = {            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0",        }        self.s = requests.session()    def get_page_url(self):        urls_set = set()        url="http://blog.csdn.net/u013055678?viewmode=contents"        try:            html = self.s.get(url, headers=self.headers)        except HTTPError as e:            print(str(e))            return str(e)        except ConnectionError as e:            print(str(e))            return str(e)        try:            soup = BeautifulSoup(html.content, "lxml")            page_div = soup.find_all("span", {"class": "link_title"})            for url in page_div:                a_url = "http://blog.csdn.net"+url.find("a").attrs["href"]                urls_set.add(a_url)        except AttributeError as e:            print(str(e))            return str(e)        return urls_set#ThreadUrl继承进程类#run函数将JoinableQueue中的URL逐个取出,然后打开,取得博客详细页面的标题class ThreadUrl(Process):    def __init__(self, queue):        super(ThreadUrl, self).__init__()        self.queue = queue        self.s = requests.session()        self.headers = {            "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0",        }    def run(self):        while not self.queue.empty():            host = self.queue.get()            try:                html = self.s.get(host, headers=self.headers)            except HTTPError as e:                print(str(e))                return str(e)            except ConnectionError as e:                print(str(e))                return str(e)            try:                soup = BeautifulSoup(html.content, "lxml")                class_div = soup.find("span",{"class":"link_title"})                print((class_div.text).strip())            except AttributeError as e:                print(str(e))                return str(e)            except NavigableString as e:                print(str(e))                return str(e)            self.queue.task_done()def main():    #进程列表    worker_list = list()    #创建队列    queue = JoinableQueue()    #将URL放进队列    p = AyouBlog()    for url in p.get_page_url():        print(url)        queue.put(url)    #开多进程    for i in range(3):        t = ThreadUrl(queue)        worker_list.append(t)        t.start()    #队列清空后再执行其它    queue.join()    #进程关闭(这个是不是多余啊?)    for w in worker_list:        w.terminate()if __name__=="__main__":    start = time.time()    main()    print("Elapsed Time: %s" % (time.time() - start))
结果耗时:


进程运行速度比较快,但是进程的开销比较大

还是上面说到的问题,少的数据上面的代码并没有问题,但是六万多条数据,用两个JoinableQueue,两个进程组,每组开5个进程爬取,数据可以爬取下来,但是所有进程就死在那里,也不报错,无法正常退出,但是我只爬取几百页,几千条数据却能正常退出,没有任何错误,这个问题我不知道怎么回事儿

我加断点啥也没看出啥问题,不知道是queue和JoinableQueue最后没有空,还是空了以后多线程多进程还在抢着执行然后死了,闹不明白

有谁知道怎么回事儿,怎么解决,请告诉我一下,谢谢




1 0