Scrapy+MongoDB爬取并存储豌豆荚App数据
来源:互联网 发布:资产管理app软件 编辑:程序博客网 时间:2024/05/16 05:20
基于python2.7,使用scrapy爬取豌豆荚app的名字大小及下载次数等字段并将其存储到MongoDB数据库中,步骤如下:
一.新建scrapy项目并编写爬虫程序
使用scarpy命令新建爬虫项目:
scrapy startproject ChannelCrawler
生成爬虫项目后在Items.py中对爬取数据结构的Item进行编写,根据要爬取的4个字段有Items.py:
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass AppInfo(scrapy.Item): name = scrapy.Field() size = scrapy.Field() downloadTimes = scrapy.Field() description = scrapy.Field()
然后r在spider文件夹的init.py下进行爬虫程序的编写,程序代码如下,步骤是先爬取所有app的分类再进行app详细数据的爬取:
# This package will contain the spiders of your Scrapy project## Please refer to the documentation for information on how to create and manage# your spiders.import scrapyfrom ChannelCrawler.items import AppInfoclass wandoujiaAppCrawler(scrapy.Spider): name = "wandoujiaAppCrawler" def start_requests(self): urls = [ "http://www.wandoujia.com/apps", ] for url in urls: yield scrapy.Request(url=url, callback=self.parseCategory) #爬取所有app分类 def parseCategory(self, response): for pageUrl in response.css('li.parent-cate a::attr(href)').extract(): yield scrapy.Request(url=pageUrl, callback=self.parse) def parse(self, response): for app in response.css('li.card'): item = AppInfo() item['name'] = app.css('div.app-desc h2 a::text').extract_first(), item['downloadTimes'] = app.css('div.app-desc div.meta span::text').extract_first(), item['size'] = app.xpath('//div[@class="app-desc"]/div/span[3]/text()').extract_first(), yield item #爬取下一页 next_pages = response.css('div.page-wp a::attr(href)').extract() for page in next_pages: yield scrapy.Request(url=page, callback=self.parse)
主要爬取内容为app页面上的app名字,app下载量和大小,app简介4个字段,如豌豆荚页面所示:
二.代理的配置
这里可以使用代理中间件来进行爬取,配置步骤首先是在middlewares.py中编写爬虫代理的中间件,middlewares.py代码如下:
# -*- coding: utf-8 -*-# Define here the models for your spider middleware## See documentation in:# http://doc.scrapy.org/en/latest/topics/spider-middleware.htmlclass ProxyMiddleware(object): def process_request(self, request, spider): request.meta['proxy'] = "http://proxy.yourproxy:8001"
然后再在scrapy的配置文件中注册中间件,settings.py中添加如下字段:
DOWNLOADER_MIDDLEWARES = { 'ChannelCrawler.middlewares.ProxyMiddleware': 100,}
代理即配置完成,代理配置更加详细的内容可以参考我写的另一篇博客:scrapy代理的配置方法.
三.进行MongoDB数据库的配置
进行数据的配置首先要在settings.py中进行对应数据库的配置,然后在pipelines中编写mongoDB的配置代码,首先在settings.py中加入如下内容:
ITEM_PIPELINES = { 'ChannelCrawler.pipelines.MongoPipeline': 300,}MONGODB_SERVER = "localhost"MONGODB_PORT = 27017MONGODB_COLLECTION = "wandoujiaApps"MONGODB_DB = "scrapyTest"
这里制定了MongoDB的Pipeline,然后在下面的配置中写了MongoDB要存入的数据库名称和collection名称.然后再在pipelines.py中加入mongoDB的配置,代码如下:
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymongofrom scrapy.conf import settings# class ChannelcrawlerPipeline(object):# def process_item(self, item, spider):# return itemclass MongoPipeline(object): def __init__(self): connection = pymongo.MongoClient( settings['MONGODB_SERVER'], settings['MONGODB_PORT'] ) db = connection[settings['MONGODB_DB']] self.collection = db[settings['MONGODB_COLLECTION']] def process_item(self, item, spider): self.collection.insert(dict(item)) return item
这样爬虫在运行时即自动将爬取的内容存入mongoDB中,运行scarpy crawl wandoujiaAppCrawler即可运行爬虫并将数据存入MongoDB.
0 0
- Scrapy+MongoDB爬取并存储豌豆荚App数据
- Scrapy+Mongodb爬取数据
- 笔记:scrapy爬取的数据存入MySQL,MongoDB
- 使用scrapy+mongodb爬取数据示例(附下载)
- scrapy爬取豆瓣电影top250并存储到mysql
- scrapy爬取某个手机app的文章数据
- Python爬虫 爬取Google Play 100万个App的数据,并入库到数据库 scrapy框架
- 利用MongoDB存储scrapy的数据
- Scrapy爬数据并存储到mysql中
- Scrapy爬数据并存储到mysql中
- 爬取实习僧网站并存储到MongoDB
- 简书文章爬取并存储到mongodb
- Scrapy爬虫(3)爬取中国高校前100名并写入MongoDB
- Scrapy+mongoDB爬取豆瓣TOP250
- scrapy+mongodb爬取红袖天香
- Scrapy爬取数据案例
- 豌豆荚排行榜爬取(selenium)
- Kaggle数据条目爬取存储到mongodb
- css设置图片高随宽变化而变化
- java net https获取302跳转后地址
- PyQt常用控件
- 解决QPropertyAnimation动画小bug
- POJ 2299 Ultra-QuickSort (归并排序)
- Scrapy+MongoDB爬取并存储豌豆荚App数据
- CSS3 Text-overflow
- linux下find(文件查找)命令的用法总结
- jenkins api
- VS“当前上下文中不存在名称“ViewBag”,
- PAT基础编程 5-22 龟兔赛跑 (20分)
- word2vec 自己训练中文语料
- 数字签名
- 136. Single Number