scrapy分布式的应用学习笔记(一)

来源:互联网 发布:mysql数据库重启命令 编辑:程序博客网 时间:2024/05/17 16:53

1. 创建项目,在项目目录执行命令 scrapy startproject webbot 生成下面的目录和文件

  1. scrapy.cfg: 项目配置文件
  2. webbot//: 项目的 python 源代码 module
  3. webbot//items.py: 定义 item 类,用来存储爬取的数据.
  4. webbot//pipelines.py: pipelines文件,定义清洗,处理数据的类
  5. webbot//settings.py: 项目的配置文件
  6. webbot//spiders/: 放置你开发的蜘蛛(可以定义多个蜘蛛)

2. 将项目创建到scrapyd服务上

scrapy deploy

出现如下字段表示成功:

{"status": "ok", "project": "webbot", "version": "1417871576", "spiders": 0}

查看下创建的服务:

default              http://localhost:6800/

3.首先定义 items

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass WebbotItem(scrapy.Item):   #  define the fields for your item here like:   #  name = scrapy.Field()        title = scrapy.Field()        date = scrapy.Field()        author = scrapy.Field()        conten = scrapy.Field()

4.编写第一个爬虫(Spider)

    1.首先在item.py里面定义要抓取的字段

# -*- coding: utf-8 -*-import scrapyclass WebbotItem(scrapy.Item):   #  define the fields for your item here like:   #  name = scrapy.Field()        title = scrapy.Field()        date = scrapy.Field()        author = scrapy.Field()        conten = scrapy.Field()

    2.然后在spiders目录下编写自己的spider爬虫

# -*- coding: utf-8 -*-import scrapyfrom webbot.items import WebbotItemfrom scrapy.selector import Selectorclass Qqspider(scrapy.Spider):    name = "qq"    start_urls = ["http://mycq.qq.com/t-910767-1.htm","http://mycq.qq.com/t-946048-1.htm"]    def parse(self,response):        sel = Selector(response)        items = []        sites = sel.xpath('//div[@id="postlist"]/div[contains(@id,"post_")]').extract() for site in sites:    item = WebbotItem()    item['title'] = sel.xpath('.//td[@class="t_f"]/text()').extract()    item['date'] = sel.xpath('.//span[@class="tm"]/em/text() |//span[@class="gray mlm"]/text()').extract()    item['author'] = sel.xpath('.//a[@class="lblue f14"]/text()').extract()    item['conten'] = sel.xpath('.//td[@class="t_f"]/text()').extract()    items.append(item)    return items

    3.测试运行

scrapy crawl qq

    得到以下结果表示成功:


5.将爬虫发布到scrapyd上去让scrapyd调度处理从爬虫

     1.添加项目

            

scrapy deploy  -p project=webbot

      2.得到类似这样的结果表示成功:

Packing version 1417947207Deploying to project "webbot" in http://localhost:6800/addversion.jsonServer response (200):{"status": "ok", "project": "webbot", "version": "1417947207", "spiders": 1}

       3.添加爬虫到scrapyd服务里

curl http://localhost:6800/schedule.json -d project=webbot -d spider=qq

       4.打开http://localhost:6800/查看刚才配置的信息结果



0 0