如何构建一个分布式爬虫：基础篇

来源：互联网发布：vue.js和jquery 编辑：程序博客网时间：2024/05/29 08:27

继上篇我们谈论了Celery的基本知识后，本篇继续讲解如何一步步使用Celery构建分布式爬虫。这次我们抓取的对象定为celery官方文档。

首先，我们新建目录distributedspider，然后再在其中新建文件workers.py,里面内容如下

from celery import Celeryapp = Celery('crawl_task', include=['tasks'], broker='redis://223.129.0.190:6379/1', backend='redis://223.129.0.190:6379/2')# 官方推荐使用json作为消息序列化方式app.conf.update(    CELERY_TIMEZONE='Asia/Shanghai',    CELERY_ENABLE_UTC=True,    CELERY_ACCEPT_CONTENT=['json'],    CELERY_TASK_SERIALIZER='json',    CELERY_RESULT_SERIALIZER='json',)

上述代码主要是做Celery实例的初始化工作，include是在初始化celery app的时候需要引入的内容，主要就是注册为网络调用的函数所在的文件。然后我们再编写任务函数，新建文件tasks.py,内容如下

import requestsfrom bs4 import BeautifulSoupfrom workers import app@app.taskdef crawl(url):    print('正在抓取链接{}'.format(url))    resp_text = requests.get(url).text    soup = BeautifulSoup(resp_text, 'html.parser')    return soup.find('h1').text

它的作用很简单，就是抓取指定的url，并且把标签为h1的元素提取出来

最后，我们新建文件task_dispatcher.py，内容如下

from workers import appurl_list = [    'http://docs.celeryproject.org/en/latest/getting-started/introduction.html',    'http://docs.celeryproject.org/en/latest/getting-started/brokers/index.html',    'http://docs.celeryproject.org/en/latest/getting-started/first-steps-with-celery.html',    'http://docs.celeryproject.org/en/latest/getting-started/next-steps.html',    'http://docs.celeryproject.org/en/latest/getting-started/resources.html',    'http://docs.celeryproject.org/en/latest/userguide/application.html',    'http://docs.celeryproject.org/en/latest/userguide/tasks.html',    'http://docs.celeryproject.org/en/latest/userguide/canvas.html',    'http://docs.celeryproject.org/en/latest/userguide/workers.html',    'http://docs.celeryproject.org/en/latest/userguide/daemonizing.html',    'http://docs.celeryproject.org/en/latest/userguide/periodic-tasks.html']def manage_crawl_task(urls):    for url in urls:        app.send_task('tasks.crawl', args=(url,))if __name__ == '__main__':    manage_crawl_task(url_list)