Scrapy爬虫爬取天气数据存储为txt和json等多种格式
来源:互联网 发布:淘宝违规词查询工具 编辑:程序博客网 时间:2024/05/22 10:27
一、创建Scrrapy项目
scrapy startproject weather
二、 创建爬虫文件
scrapy genspider wuhanSpider wuhan.tianqi.com
三、SCrapy项目各个文件
(1) items.py
import scrapyclass WeatherItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() cityDate = scrapy.Field() week = scrapy.Field() img = scrapy.Field() temperature = scrapy.Field() weather = scrapy.Field() wind = scrapy.Field()
(2)wuhanSpider.py
# -*- coding: utf-8 -*-import scrapyfrom weather.items import WeatherItemclass WuhanspiderSpider(scrapy.Spider): name = "wuHanSpider" allowed_domains = ["tianqi.com"] citys = ['wuhan','shanghai'] start_urls = [] for city in citys: start_urls.append('http://' + city + '.tianqi.com/') def parse(self, response): subSelector = response.xpath('//div[@class="tqshow1"]') items = [] for sub in subSelector: item = WeatherItem() cityDates = '' for cityDate in sub.xpath('./h3//text()').extract(): cityDates += cityDate item['cityDate'] = cityDates item['week'] = sub.xpath('./p//text()').extract()[0] item['img'] = sub.xpath('./ul/li[1]/img/@src').extract()[0] temps = '' for temp in sub.xpath('./ul/li[2]//text()').extract(): temps += temp item['temperature'] = temps item['weather'] = sub.xpath('./ul/li[3]//text()').extract()[0] item['wind'] = sub.xpath('./ul/li[4]//text()').extract()[0] items.append(item) return items(3)pipelines.py,处理spider的结果
import timeimport os.pathimport urllib2#将获得的数据存储到txt文件class WeatherPipeline(object): def process_item(self, item, spider):today = time.strftime('%Y%m%d', time.localtime())fileName = today + '.txt'with open(fileName,'a') as fp:fp.write(item['cityDate'].encode('utf8') + '\t')fp.write(item['week'].encode('utf8') + '\t')imgName = os.path.basename(item['img'])fp.write(imgName + '\t')if os.path.exists(imgName):passelse:with open(imgName, 'wb') as fp:response = urllib2.urlopen(item['img'])fp.write(response.read())fp.write(item['temperature'].encode('utf8') + '\t')fp.write(item['weather'].encode('utf8') + '\t')fp.write(item['wind'].encode('utf8') + '\n\n')time.sleep(1)return item
import timeimport jsonimport codecs#将获得的数据存储到json文件class WeatherPipeline(object): def process_item(self, item, spider):today = time.strftime('%Y%m%d', time.localtime())fileName = today + '.json'with codecs.open(fileName, 'a', encoding='utf8') as fp:line = json.dumps(dict(item), ensure_ascii=False) + '\n'fp.write(line)return item
import MySQLdbimport os.path#将获得的数据存储到mysql数据库class WeatherPipeline(object): def process_item(self, item, spider):cityDate = item['cityDate'].encode('utf8')week = item['week'].encode('utf8') img = os.path.basename(item['img'])temperature = item['temperature'].encode('utf8')weather = item['weather'].encode('utf8')wind = item['wind'].encode('utf8')conn = MySQLdb.connect(host='localhost',port=3306,user='crawlUSER',passwd='crawl123',db='scrapyDB',charset = 'utf8')cur = conn.cursor()cur.execute("INSERT INTO weather(cityDate,week,img,temperature,weather,wind) values(%s,%s,%s,%s,%s,%s)", (cityDate,week,img,temperature,weather,wind))cur.close()conn.commit()conn.close()return item
(4)settings.py,决定 由哪个文件来处理获取的数据
BOT_NAME = 'weather'SPIDER_MODULES = ['weather.spiders']NEWSPIDER_MODULE = 'weather.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#USER_AGENT = 'weather (+http://www.yourdomain.com)'#### user addITEM_PIPELINES = {'weather.pipelines.WeatherPipeline':1,'weather.pipelines2json.WeatherPipeline':2,'weather.pipelines2mysql.WeatherPipeline':3}
scrapy crawl wuHanSpider
(6)结果显示
1.txt数据
2.json数据
3. 存储到mysql数据库
阅读全文
0 0
- Scrapy爬虫爬取天气数据存储为txt和json等多种格式
- scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据
- scrapy爬虫成长日记之创建工程-抽取数据-保存为json格式的数据
- Scrapy 爬虫框架爬取网页数据
- python,scrapy爬虫sql之爬取数据存储到mysql的piplelines.py配置
- Python3.x 爬虫 爬取中国天气网数据
- python3网络爬虫爬取天气网空气质量数据
- Scrapy爬虫入门——爬取dmoz.org数据
- Scrapy爬虫(七):爬虫数据存储实例
- Scrapy爬取新浪天气问题
- Scrapy--爬取全国天气信息
- Scrapy实例1-爬取天气预报存储到Json
- 数据保存!!!Python 爬取网页数据后,三种保存格式---保存为txt文件、CSV文件和mysql数据库
- Scrapy网络爬虫实战[保存为Json文件及存储到mysql数据库]
- Scrapy抓取天气数据和显示
- AFNetWorking同时解析MsgPack和JSON等多种返回数据
- Scrapy+MongoDB爬取并存储豌豆荚App数据
- php从数据库中取数据转换为json格式
- Dubbo的自产自销
- 数据结构-双端链表
- java打印从1到最大的n位数
- CodeForces 735A Ostap and Grasshopper
- 学习jdbc的总结
- Scrapy爬虫爬取天气数据存储为txt和json等多种格式
- 排序算法——希尔排序
- grep命令中文手册(info grep翻译)
- 安迪的第一个字典,紫书P112UVa10815
- 智能管家(二)——工具类封装与首页引导页开发
- 数的全排列
- Ubuntu的那些事儿
- Hibernate基于主键单向和双向多对多关系映射
- 2、链路层