Scrapy--爬取全国地区对应的天气网址

来源:互联网 发布:美国博士知乎 编辑:程序博客网 时间:2024/05/01 13:46

创建一个scrapy工程:

scrapy startproject get_url

新建一个爬虫:

/scrapy_project/get_url/get_url/spiders$ scrapy genspider myspider weather.com.cn

编辑myspider.py文件内容如下:

# -*- coding: utf-8 -*-import scrapyclass MyspiderSpider(scrapy.Spider):    name = 'myspider'    allowed_domains = ['weather.com.cn']    start_urls = []    base_url = 'http://www.weather.com.cn/weather/'    # 生成需要爬取的网址    # 直辖市、省会和特别行政区总共有34个    for i in range(1,35,1):        # 直辖市或特别行政区        if i < 5:            # 直辖市或特别行政区里面的城区            for j in range(1,20,1):                num_str = str(101000000 + i*10000 + j*100)                start_urls.append(base_url + num_str + '.shtml')        else:            # j代表省下面有多少个城市            # 最多的为四川省和广东省,都有21个            for j in range(1,23,1):                # k代表一个市下面有多少个县                # 最多为河北>保定,有26个                for k in range(1,27,1):                    num_str = str(101000000 + i*10000 + j*100 + k)                    start_urls.append(base_url + num_str + '.shtml')    def parse(self, response):        if response.status == 200 and len(response.body) > 1000:            url = response.url            city_code = url[url.rfind('/')+1:url.rfind('.')]            # 小于101050100说明是直辖市或特别行政区,101320000为香港,101330000为澳门            if (int(city_code) < 101050100) or (101320000 < int(city_code) < 101340000):                city = response.xpath('//div[@class="crumbs fl"]/a/text()').extract()[0]                region = response.xpath('//div[@class="crumbs fl"]/span/text()').extract()[1]                city_name = city + '>' + region            else:                province = response.xpath('//div[@class="crumbs fl"]/a/text()').extract()[0]                city = response.xpath('//div[@class="crumbs fl"]/a/text()').extract()[1]                region = response.xpath('//div[@class="crumbs fl"]/span/text()').extract()[2]                city_name = province + '>' + city + '>' + region            with open(r'out.txt','a+') as write_file:                write_file.write('\'' + city_name + '\':\'' + url + '\',' + os.linesep)            return

运行爬虫:

/scrapy_project/get_url/get_url/spiders$ scrapy runspider myspider.py

爬取结束后,在myspider.py同层就会多出一个out.txt文件,内容如下:

'北京>城区':'http://www.weather.com.cn/weather/101010100.shtml','北京>通州':'http://www.weather.com.cn/weather/101010600.shtml','北京>顺义':'http://www.weather.com.cn/weather/101010400.shtml','北京>朝阳':'http://www.weather.com.cn/weather/101010300.shtml','北京>怀柔':'http://www.weather.com.cn/weather/101010500.shtml',......

接下来就可以使用这些地区与网址的对应信息进行天气数据的爬取了。