Scrapy+MongoDB爬取并存储豌豆荚App数据

来源：互联网发布：资产管理app软件编辑：程序博客网时间：2024/05/16 05:20

基于python2.7,使用scrapy爬取豌豆荚app的名字大小及下载次数等字段并将其存储到MongoDB数据库中,步骤如下:

一.新建scrapy项目并编写爬虫程序

使用scarpy命令新建爬虫项目:

scrapy startproject ChannelCrawler

生成爬虫项目后在Items.py中对爬取数据结构的Item进行编写,根据要爬取的4个字段有Items.py:

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AppInfo(scrapy.Item):    name = scrapy.Field()    size = scrapy.Field()    downloadTimes = scrapy.Field()    description = scrapy.Field()

然后r在spider文件夹的init.py下进行爬虫程序的编写,程序代码如下,步骤是先爬取所有app的分类再进行app详细数据的爬取:

# This package will contain the spiders of your Scrapy project## Please refer to the documentation for information on how to create and manage# your spiders.import scrapyfrom ChannelCrawler.items import AppInfoclass wandoujiaAppCrawler(scrapy.Spider):    name = "wandoujiaAppCrawler"    def start_requests(self):        urls = [            "http://www.wandoujia.com/apps",        ]        for url in urls:            yield scrapy.Request(url=url, callback=self.parseCategory)    #爬取所有app分类    def parseCategory(self, response):        for pageUrl in response.css('li.parent-cate a::attr(href)').extract():            yield scrapy.Request(url=pageUrl, callback=self.parse)    def parse(self, response):        for app in response.css('li.card'):            item = AppInfo()            item['name'] = app.css('div.app-desc h2 a::text').extract_first(),            item['downloadTimes'] = app.css('div.app-desc div.meta span::text').extract_first(),            item['size'] = app.xpath('//div[@class="app-desc"]/div/span[3]/text()').extract_first(),            yield item        #爬取下一页        next_pages = response.css('div.page-wp a::attr(href)').extract()        for page in next_pages:            yield scrapy.Request(url=page, callback=self.parse)

主要爬取内容为app页面上的app名字,app下载量和大小,app简介4个字段,如豌豆荚页面所示:
爬取图片

二.代理的配置

这里可以使用代理中间件来进行爬取,配置步骤首先是在middlewares.py中编写爬虫代理的中间件,middlewares.py代码如下:

# -*- coding: utf-8 -*-# Define here the models for your spider middleware## See documentation in:# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlclass ProxyMiddleware(object):    def process_request(self, request, spider):        request.meta['proxy'] = "http://proxy.yourproxy:8001"

然后再在scrapy的配置文件中注册中间件,settings.py中添加如下字段:

DOWNLOADER_MIDDLEWARES = {    'ChannelCrawler.middlewares.ProxyMiddleware': 100,}

代理即配置完成,代理配置更加详细的内容可以参考我写的另一篇博客:scrapy代理的配置方法.

三.进行MongoDB数据库的配置

进行数据的配置首先要在settings.py中进行对应数据库的配置,然后在pipelines中编写mongoDB的配置代码,首先在settings.py中加入如下内容:

ITEM_PIPELINES = {   'ChannelCrawler.pipelines.MongoPipeline': 300,}MONGODB_SERVER = "localhost"MONGODB_PORT = 27017MONGODB_COLLECTION = "wandoujiaApps"MONGODB_DB = "scrapyTest"

这里制定了MongoDB的Pipeline,然后在下面的配置中写了MongoDB要存入的数据库名称和collection名称.然后再在pipelines.py中加入mongoDB的配置,代码如下:

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf import settings# class ChannelcrawlerPipeline(object):#     def process_item(self, item, spider):#         return itemclass MongoPipeline(object):    def __init__(self):        connection = pymongo.MongoClient(            settings['MONGODB_SERVER'],            settings['MONGODB_PORT']        )        db = connection[settings['MONGODB_DB']]        self.collection = db[settings['MONGODB_COLLECTION']]    def process_item(self, item, spider):        self.collection.insert(dict(item))        return item

这样爬虫在运行时即自动将爬取的内容存入mongoDB中,运行scarpy crawl wandoujiaAppCrawler即可运行爬虫并将数据存入MongoDB.

0 0