python练习--天气信息抓取（1）

来源：互联网发布：淘宝怎么进去卖家中心编辑：程序博客网时间：2024/04/19 07:49

一直想利用python做点小东西，但是又不熟悉，一直有个打算就是实现抓取天气信息，国庆节的今天最后一天，也算是对计划作个起步。

#!/usr/bin/env python#!encoding:utf-8#filename:weather2_try.pyimport urllib,requestsimport reimport sys,os#url = 'http://www.weather.com.cn/weather/101190401.shtml'url = 'http://www.weather.com.cn/weather/101190201.shtml'  #抓取的页面地址page = requests.get(url)text = page.text    #得到该页面并保存在text中#print textprint '*'* 50#district         #地点获取tar0 = r'<\w\d>.*'   #匹配地区信息的行，方便进行提取district = re.compile(tar0)dis = district.search(text)  #搜寻匹配的行s0 = dis.group()        #获得一个或多个分组截获的字符串print s0[4:6]            #输出特定字符，即地区名字#date                #日期获取tar2 = r'<td\s[a-z]{5}=.*'date = re.compile(tar2)time = date.search(text)s2 = time.group()t2 = s2.split('''"''')print t2[7],t2[8][1:6]#weather                  #天气获取tar3 = r'<td\s[a-z]{5}="\d\d%"><.*>'weat = re.compile(tar3)weather = weat.search(text)s3 = weather.group()t3 = s3.split(">")r = t3[2]print r[0:2]#运行结果如下：

Type "copyright", "credits" or "license()" for more information.>>> ================================ RESTART ================================>>> **************************************************常熟农历九月初三 7日星期一大雨>>> ================================ RESTART ================================>>> **************************************************无锡农历九月初三 7日星期一中到                           #如果是进行字符提取会出现不全的情况，需要用正则匹配一步到位>>>

反正信息是可以提取出来了，外围设计及天气的信息，后续继续完成。

2013-10-8

进一步完善之：

#!/usr/bin/env python#!encoding:utf-8#filename:weather2_try.pyimport urllib,requestsimport reimport sys,osurl = 'http://www.weather.com.cn/weather/101190401.shtml'page = requests.get(url)text = page.textprint '*'*50#districttar0 = r'<title>(.*)<\/title>'  #采用后向引用district = re.compile(tar0)dis = district.search(text)print dis.group(1)         #取（）中的内容

输出：

>>> ================================ RESTART ================================>>> **************************************************苏州天气预报-今日_明日_一周天气预报:8日星期二夜间至9日星期三白天  阵雨转多云  18/25℃  >>>

这样就避开了字符串处理的过程，防止字符位置改变引起程序修改麻烦。

2013-10-9：

如果要获取时间呢？

单独抓取时间

<div class="weatherMain">  <div class="weatherLeft">    <div class="weatherTop">      <h1 class="weatheH1"  id="live">        今天是2013年10月8日 星期二 农历九月初四  <!--today5-->        <select class="weatherSelect" onchange="MM_jumpMenu('parent',this,0)">          <option>相关地区</option>

网页中的信息显示方式如上，正则表达式如何写？才能取出“今天是2013年10月8日星期二农历九月初四”
要直接取出来，尽量不要二次处理字符串。 <h1 class="weatheH1" id="live">这个可以取出，但是不晓得如何取下一行，并且是汉字，而且前面有空格

#!/usr/bin/env python#-*- coding: utf-8 -*-#filename:weather_1.pyimport urllib,requestsimport re,osurl = 'http://www.weather.com.cn/weather/101120201.shtml'page = requests.get(url)text = page.texttar1 = r'<h1[^>]+id="live">\s*([^<]+)\s*<'   #这个正则如何分析呢？date = re.compile(tar1)dat = date.search(text)#dat = re.search(tar1, text, re.I) print dat.group(1)

>>> ================================ RESTART ================================>>> 今天是2013年10月9日 星期三 农历九月初五  >>>

输出基本满足要求，就是后面还有两行空白。

10-22 日补充

前面内容主要对正则表达式用法不了解，后来学习了下，再回来处理本问题，变得更简单些了，定位也可以更精准。

#!/usr/bin/env python#-*- coding: utf-8 -*-#filename:weather_listID.pyimport urllib,requestsimport re,osdef All_city(location_code,location):#get the city weather one by one    url = 'http://www.weather.com.cn/weather/101' + location_code + '.shtml'    page = requests.get(url)    text = page.text    tar0 = r'<title>(.*)-(.*)_(.*)_(.*):(.*)<\/title>'  #采用后向引用          dis = re.compile(tar0).search(text)      print dis.group(1) ,dis.group(5)        #取（）中的内容      print '\r'def One_example():   #get one of the page title weather    tar0 = r'<title>(.*)-(.*)_(.*)_(.*):(.*)<\/title>'  #采用后向引用          dis = re.compile(tar0).search(text)      print dis.group(1)  ,dis.group(5)        #取（）中的内容  def Date():#get the date of today    tar1 = r'<h1[^>]+id="live">\s*([^<]+)\s*<'    dat = re.search(tar1, text, re.I)     print dat.group(1)    url = 'http://www.weather.com.cn/weather/101190401.shtml'#获取任一指定页面的天气信息page = requests.get(url)text = page.text#get http list,then city code and city nametar9 = u'<li><span><a href="http://www.weather.com.cn/weather/101(.*)</a>'#获取101后面的城市代码及城市名字htt = re.compile(tar9).findall(text,re.M)length_list = len(htt)#获取得到的http list长度#print htt[1].encode('gbk')#打印乱码时启用Date()print '***********************'One_example()print '***********************'for i in range(1,length_list):#调用获取的http list，按照关键字进行依次分割，然后得到地区和地区代码    location = re.compile('\.shtml.*\">').split(htt[i])    print "CITY NAME: %s \t CITY CODE:%s  " % (location[1],location[0])    All_city(location[0],location[1])#进行天气信息输出的函数调用，输出天气

输出：

>>> ================================ RESTART ================================>>> 今天是2013年10月22日 星期二 农历九月十八#date的输出  ***********************苏州天气预报 22日星期二  多云  22/15℃  #抓取的其中一个例子，也是以此为基石展开的***********************CITY NAME: 石家庄  CITY CODE:090101  石家庄天气预报 22日星期二  多云  19/7℃  CITY NAME: 昆明  CITY CODE:290101  昆明天气预报 22日星期二  阵雨转中雨  17/13℃  CITY NAME: 济南  CITY CODE:120101  济南天气预报 22日星期二  晴转多云  22/11℃  CITY NAME: 西安  CITY CODE:110101  西安天气预报 22日星期二  晴  22/9℃  CITY NAME: 深圳  CITY CODE:280601  深圳天气预报 22日星期二  多云  29/21℃  CITY NAME: 武汉  CITY CODE:200101  武汉天气预报 22日星期二  多云  24/11℃  CITY NAME: 海口  CITY CODE:310101  海口天气预报 22日星期二  多云  28/23℃  CITY NAME: 哈尔滨  CITY CODE:050101  哈尔滨天气预报 22日星期二  雾转霾  10/4℃  CITY NAME: 三亚  CITY CODE:310201  三亚天气预报 22日星期二  多云  30/24℃

另外这个程序还被我用python mysetup.py py2exe打包成了windows下的可执行程序，也可以顺利执行，附图如下：

一个获取天气的方法基本成型，但是还有很多可以完善的地方，比如改成多线程，这样获取的更快，只抓取省会城市或者抓取包括县一级的天气，显示近3天，7天的天气，显示或提供指定城市的天气查询等，还可以有很多任务。