WSWP(用 python写爬虫) 笔记五:并发爬虫
来源:互联网 发布:中国能源研究会 知乎 编辑:程序博客网 时间:2024/06/05 01:05
前面已经实现了链接爬虫、数据获取爬虫以及缓存功能。前面实现的都是串行下载网页的爬虫,只有前一次下载完成以后才会启动新的下载。爬取规模较小的网站时,串行下载尚可应对,如果面对的是大型网站时,串行下载效率就很低下了。
现在开始逐步实现使用多线程和多进程这两种下载的并发爬虫。
首先通过Alexa网站获取到最受欢迎的100万个网站列表(可直接下载一个压缩文件,网址:http://s3.amazonaws.com/alexa-static/top-1m.csv.zip)。
首先获取压缩文件的内容:
# alexaCB.pyimport csvfrom zipfile import ZipFilefrom io import StringIOfrom .mongoCache import MongoCacheclass AlexaCallback: def __init__(self, maxUrls=1000): self.maxUrls = maxUrls self.seedUrl = 'http://s3.amazonaws.com/alexa-static/top-1m.csv.zip' def __call__(self, url, html): if url == self.seedUr: urls = [] cache = MongoCache() with ZipFile(StringIO(html)) as zf: csvFilename = zf.namelist()[0] for _, website in csv.reader(zf.open(csvFilename)): if 'http://' + website not in cache: urls.append('http://' + website) if (len(urls) == self.maxUrls): break return urls
使用之前开发的爬虫,修改scrapeCallbak的接口为上面这个爬虫即可。
多线程爬虫
在python中实现多线程编程相对来说比较简单。可以保留与之前开发的链接爬虫类似的队列结构,只是改为在多个线程中启动爬虫循环,以便并行下载这些链接。代码如下:
# threadCrawler.pyimport timeimport threadingimport urllib.parsefrom downloader import DownloaderSLEEP_TIME = 3def threadCrawler(seedUrl, delay=5, cache=None, scrapeCallback=None, userAgent='wswp', proxies=None, numRetries=1, maxThreads=10, timeout=60): """ Crawl this website in multiple threads """ crawlQueue = [seedUrl] # The url's that have been seen seen = set([seedUrl]) downloader = Downloader(cache=cache, delay=delay, userAgent=userAgent, proxies=proxies, numRetries=numRetries, timeout=timeout) def processQueue(): while True: try: url = crawlQueue.pop() except IndexError: break else: html = downloader(url) if scrapeCallback: try: links = scrapeCallback(url, html) or [] except Exception as e: print('Error in callback for: {}:{}'.format(url, e)) else: for link in links: link = normalize(seedUrl, link) if link not in seen: seen.add(link) crawlQueue.append(link) # wait for all download threads to finish threads = [] while threads or crawlQueue: # the crawl is still active for thread in threads: if not thread.is_alive(): # remove the stopped threads threads.remove(thread) while len(threads) < maxThreads and crawlQueue: # can start some more threads thread = threading.Thread(target=processQueue) thread.setDaemon(True) # set daemon so main thread can exit when receives ctrl-c thread.start() threads.append(thread) # all threads have been processed # sleep temporarily so CPU an focus execution on other threads time.sleep(SLEEP_TIME) def normalize(seedUrl, link): """ Normalize this url by removing hash and adding domain """ link, _ = urllib.parse.urlfrag(link) return urllib.parse.urljon(seedUrl, link)
当有url可爬取时,上面的多线程爬虫中的循环会不断的创建线程,直到达到线程池的最大值。在爬取的过程中,如果队列中没有更多可以爬取的url时,线程会提前停止。
多进程爬虫
为了进一步改善性能,对多线程进行再度扩展,使其支持多进程。目前的爬虫队列都是存储在本地中,其他进程都无法处理这一爬虫。为了解决这个问题,需要把队列转移到其他进程可访问的队列中。单独存储队列,意味着就算是不同的服务器上的爬虫也能狗协同处理同一个爬虫任务。如果想要拥有更加健壮的队列,需要考虑使用专门的消息传输工具,比如Celery。这里通过复用MongoDB进行单独存储。MongoDB实现的队列代码如下:
# MongoQueue.pyfrom datetime import datetime, timedeltafrom pymongo import MongoClient, errorsclass MongoQueue: # possilbe states of a download OUTSTANDING, PROCESSING, COMPLETE = range(3) def __init__(self, client=None, timeout=300): """ :param client: MongoDB server IP address :param timeout: """ self.client = MongoClient() if client is None else client self.db = self.client.cache self.timeout = timeout def __nonzero__(self): """ Returns true if there are more jobs to process :return: """ record = self.db.crawlQueue.find_one( {'status':{'$ne': self.COMPLETE}} ) return True if record else False def push(self, url): """ Add new url to queue if does not exist :param url: :return: """ try: self.db.crawlQueue.insert({'_id': url, 'status': self.OUTSTANDING}) except errors.DuplicateKeyError as e: pass def pop(self): """ Get an outstanding url from the queue and set its status to processing. If the queue is empty a KeyError exception is raised. :return: """ record = self.db.crawlQueue.find_and_modify( query={'status': self.OUTSTANDING}, update={'$set': {'status': self.PROCESSING, 'timestamp': datetime.now()}} ) if record: return record['_id'] else: self.repair() raise KeyError def peek(self): record = self.db.crawlQueue.find_one({'status': self.OUTSTANDING}) if record: return record['_id'] def complete(self, url): self.db.crawlQueue.update({'_id': url}, {'$set': {'status': self.COMPLETE}}) def repair(self): """ Release stalled jobs :return: """ record = self.db.crawlQueue.find_and_modify( query={ 'timestamp': {'$lt': datetime.now() - timedelta(seconds=self.timeout)}, 'status': {'$ne': self.COMPLETE} }, update={'$set': {'status': self.OUTSTANDING}} ) if record: print('Released:', record['_id']) def clear(self): self.db.crawlQueue.drop()
上面的代码中对处理url定义了3种状态: OUTSTANDING、PROCESSING和COMPLETE。当添加一个新的url时,其状态为OUTSTANDING;当url从队列中取出准备下载时,状态为PROCESSING,下载结束后,状态为COMPLETE。大部分代码都是关注从队列中取出的url无法正常完成时的处理,比如处理进程被终止,为了避免这种情况,使用了一个timeout参数,默认值是300秒。在repair方法中,如果某个url的处理时间超过timeout的值,就认定处理出错,状态被重置为OUTSTANDING,以便再次处理。
多进程爬虫实现代码如下:
# processCrawler.pyimport timeimport urllib.parseimport threadingimport multiprocessingfrom mongoCache import MongoCachefrom mongoQueue import MongoQueuefrom downloader import DownloaderSLEEP_TIME = 1def threadedCrawler(seedUrl, delay=5, cache=None, scrapeCallbak=None, userAgent='wswp', proxies=None, numRetries=1, maxThreads=10,timeout=60): """ crawl using multiple processing """ crawlQueue = MongoQueue() crawlQueue.clear() crawlQueue.push(seedUrl) downloader = Downloader(cache=cache, delay=delay, userAgent=userAgent, proxies=proxies, numRetries=numRetries, timeout=timeout) def processQueue(): while True: # keep track that are processing url try: url = crawlQueue.pop() except KeyError: # Currently no urls to process break else: html = downloader(url) if scrapeCallbak: try: links = scrapeCallbak(url, html) or [] except Exception as e: print('Error in callback for: {}:{}'.format(url, e)) else: for link in links: # add this new link to queue crawlQueue.push(normalize(seedUrl, link)) crawlQueue.complete(url) # wait for all download threads to finish threads = [] while threads or crawlQueue: for thread in threads: if not thread.is_alive(): threads.remove(thread) while len(threads) < maxThreads and crawlQueue.peek(): # can start some more threads thread = threading.Thread(target=processQueue) thread.setDaemon(True) thread.start() threads.append(thread) time.sleep(SLEEP_TIME)def processCrawler(args, **kwargs): numCpus = multiprocessing.cpu_count() print('Starting {} processes'.format(numCpus)) processes = [] for i in range(numCpus): p = multiprocessing.Process(target=threadedCrawler,args=[args], kwargs=kwargs) p.start() processes.append(p) # wait for prcesses to complete for p in processes: p.join()def normalize(seedUrl, link): link, _ = urllib.parse.urldefrag(link) return urllib.parse.urljoin(seedUrl, link)
多进程爬虫中将python内建队列替换为了MongoDB实现的新队列,该队列会在内部实现中重复处理url的问题。最后,在url被处理完成以后调用complete()方法,用于记录该url已经被成功解析。
- WSWP(用 python写爬虫) 笔记五:并发爬虫
- WSWP(用python写网络爬虫)笔记 一:实现简单爬虫
- WSWP(用python写爬虫)笔记二:实现链接获取和数据存储爬虫
- WSWP(用 python写爬虫) 笔记三:为爬虫添加缓存网页内容功能
- WSWP(用 python写爬虫) 笔记四:实现缓存功能
- python网络爬虫(五):并发抓取
- 用python 写网络爬虫 学习笔记
- 《用python写网络爬虫》笔记1
- 《用python写网络爬虫》笔记2
- 用python写网络爬虫笔记
- 《用python写网络爬虫》笔记3
- 用python写爬虫
- 学用python写爬虫笔记(1)
- python爬虫(五)图片下载爬虫
- python写简单爬虫的五种方法 (转)
- 学用python写爬虫(2)
- python写爬虫技巧(五):简单的百度贴吧网页爬虫
- 用Python写简单爬虫
- Mybatis调用存储过程
- 多态+多态对象模型
- 上传EXCEL文件并读取到数据库
- WebSocket 网络通信协议
- Tinker实践:一步步带你集成Tinker,让你的APP也用上热修复,告别重复性更新
- WSWP(用 python写爬虫) 笔记五:并发爬虫
- 【机房】数据源配置
- 正则表达式-Collection-List
- Eclipse快捷键
- nginx同一iP多域名配置方法
- Cauchy序列
- C++之深浅拷贝
- 函数
- commons-lang包中对我们有用的类主要有: