从人人网获取全国中学信息(省市县)

来源:互联网 发布:js event新建对象 编辑:程序博客网 时间:2024/04/28 18:55

最近有个项目需要用到全国中学的信息,自己整理肯定事很难可行了,在网上看到这一片文章:用HttpClient抓取人人网高校数据库(http://www.iteye.com/topic/826988),就想能不能从人人网上把这部分信息搞下来。下面是实现步骤:

1 首先firefox安装httpfox插件,用来监控http请求

2 登录人人网,修改个人基本信息,点击修改学校信息

3 打开httpfox,点击start,开始监控http信息,如下图



4 点击学校信息的高中,此时会弹出学校选择对话框,然后注意观察http请求





注意看选中的那一行,就是从服务器传回来的学校数据,content里面是服务器返回的数据,一段标准的html代码,拷贝链接到浏览器,出现如下界面,可以得出一个结论,这部分资源信息,人人也没有加session验证,省去了不少麻烦。



多点击几个其他的省看一下:


发现每个省请求的链接都不同,分析可以知道,每个市对应于一个html文件,省和市这两级的数据经过分析可以在cityArray.js这个文件中找到,文件的结构如下所示:

var _city_1=["110101:\u4e1c\u57ce\u533a","110102:\u897f\u57ce\u533a","110103:\u5d07\u6587\u533a","110104:\u5ba3\u6b66\u533a","110105:\u671d\u9633\u533a","110106:\u4e30\u53f0\u533a","110107:\u77f3\u666f\u5c71\u533a","110108:\u6d77\u6dc0\u533a","110109:\u95e8\u5934\u6c9f\u533a","110111:\u623f\u5c71\u533a","110112:\u901a\u5dde\u533a","110113:\u987a\u4e49\u533a","110114:\u660c\u5e73\u533a","110115:\u5927\u5174\u533a","110116:\u6000\u67d4\u533a","110117:\u5e73\u8c37\u533a","110228:\u5bc6\u4e91\u53bf","110229:\u5ef6\u5e86\u53bf"];var _city_2=["310101:\u9ec4\u6d66\u533a","310103:\u5362\u6e7e\u533a","310104:\u5f90\u6c47\u533a","310105:\u957f\u5b81\u533a","310106:\u9759\u5b89\u533a","310107:\u666e\u9640\u533a","310108:\u95f8\u5317\u533a","310109:\u8679\u53e3\u533a","310110:\u6768\u6d66\u533a","310112:\u95f5\u884c\u533a","310113:\u5b9d\u5c71\u533a","310114:\u5609\u5b9a\u533a","310115:\u6d66\u4e1c\u65b0\u533a","310116:\u91d1\u5c71\u533a","310117:\u677e\u6c5f\u533a","310118:\u9752\u6d66\u533a","310119:\u5357\u6c47\u533a","310120:\u5949\u8d24\u533a","310230:\u5d07\u660e\u53bf"];var _city_3=["120101:\u548c\u5e73\u533a","120102:\u6cb3\u4e1c\u533a","120103:\u6cb3\u897f\u533a","120104:\u5357\u5f00\u533a","120105:\u6cb3\u5317\u533a","120106:\u7ea2\u6865\u533a","120107:\u5858\u6cbd\u533a","120108:\u6c49\u6cbd\u533a","120109:\u5927\u6e2f\u533a","120110:\u4e1c\u4e3d\u533a","120111:\u897f\u9752\u533a","120112:\u6d25\u5357\u533a","120113:\u5317\u8fb0\u533a","120114:\u6b66\u6e05\u533a","120115:\u5b9d\u577b\u533a","120221:\u5b81\u6cb3\u53bf","120223:\u9759\u6d77\u53bf","120225:\u84df\u53bf"];
每一行代表一个省的城市数据,对中文进行了编码,可以通过程序解码出来看看,python中可以用下面的代码,解码\u897f\u57ce格式的编码

s="\u6cb3"s.decode('unicode_escape')

由此文件可以分析出全国省市两级的结构,代码如下

def getProvinceData():    content = open("/home/xiyang/workspace/school-data/cityArray.js")    #分离出市级id和名称    partten = re.compile("(\d+):([\w\d\\\\]+)")    provinceList = []    for line in content.readlines():        data = partten.findall(line)        citys = []        province = {}         for s in data:            if len(s[0]) == 4:#城市                #print s[0],s[1].decode('unicode_escape')                citys.append({"id":s[0],"name":s[1].decode('unicode_escape')})                    province_id = len(data[0][0])==4 and data[0][0] or data[0][0][0:4]            #只处理列表中的几个省        if provinceMap.has_key(int(province_id)):            province['id'] = province_id            province['name'] = provinceMap[int(province_id)]            province['citys'] = citys            provinceList.append(province)            return provinceList
通过httpfox分析出来每个市对应一个html文件,包含该市所有的县区和学校的数据,请求的链接类似与http://support.renren.com/juniorschool/1101.html,返回的数据格式片段如下所示:

<ul id="schoolCityQuList" class="module-qulist"><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370102')">历下区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370103')">市中区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370104')">槐荫区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370105')">天桥区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370112')">历城区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370113')">长清区</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370124')">平阴县</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370125')">济阳县</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370126')">商河县</a></li><li><a href="#highschool_anchor"  onclick="SchoolComponent.tihuan('city_qu_370181')">章丘市</a></li></ul><ul id="city_qu_370102" style="display:none;"><li><a onclick='if(SchoolComponent.cl_school){return SchoolComponent.cl_school(event,40019572)}' href="40019572">山师大附中</a></li><li><a onclick='if(SchoolComponent.cl_school){return SchoolComponent.cl_school(event,40033777)}' href="40033777">济南二十四中</a></li><li><a onclick='if(SchoolComponent.cl_school){return SchoolComponent.cl_school(event,40033962)}' href="40033962">济宁育才学校</a></li>

获取这些数据可以使用urllib2方便的获取

#获得某个市级区域的学校列表,如果事直辖市,则是整个直辖市的学校def getTownHtml(town_id):    try:        url = "http://support.renren.com/juniorschool/%s.html" % town_id        print "请求网络数据:",url        return urllib2.urlopen(url).read()    except:        print "网络错误!"        pass

5 分析这部分数据,可以有多个思路:

  • 直接使用jquery分析html,然后使用文件相关api保存到文件,这种方式分析html是比较方便的,由于chrome和firefox的文件操作比较麻烦,没有继续尝试
  • 使用正则表达式分析出数据,这种方式提取县区还是比较简单的,但是想要分析完整的数据还是不太好用
  • 使用html解析工具,解析html的结构,提取数据,最后选用了这一种。
在python中,有多个解析html的工具,比如HTMLParser,sgmllib,htmllib,他们都是基于事件驱动的,对于这种结构的数据还是不怎么好用,最后选用了BeautifulSoup,分析这个数据格式小菜一碟,下面是代码:
#获得某个的市级区域所有县区的学校def getCitySchool(content):    soup = BeautifulSoup(content)        #某个城市的中学列表    citySchoolData = []    #县区的列表    townlist = soup.findAll('a',href="#highschool_anchor")        for town in townlist:        d = {}        d['name'] = getUnicodeStr(town.string)        d['id'] = town['onclick'][24:38]        townSchools = []        #获得每个县的中学列表        for school in soup.find('ul',id=d['id']).findChildren('a'):            townSchools.append(getUnicodeStr(school.string))        d['schoollist'] = townSchools                citySchoolData.append(d)        return citySchoolData
上面的函数,分析html的内容,并返回某个市所有的县区和学校的信息,

执行之后的结果:


这样即可根据个人的需要完成进一步的操作了。
完整的代码如下,
#!/usr/bin/env python#-*- coding:utf-8 -*-#============================================# Author:sdlgxxy@gmail.com# date:2012-12-29# description: 解析人人网全国中学数学信息# 思路:# 1 首先获取全国的省市的数据(下载cityarray.js并使用正则表达式解析)# 2 每个市(包括直辖市)对应一个html文件,包含了该市所有的县区列表和学校列表,通过urllib2模块从网上下载数据# 3 使用BeautifulSoup分析从网上抓取的数据,然后解析出数据内容# 4 将数据存到mongodb中#============================================import urllib2import refrom BeautifulSoup import BeautifulSoupfrom pymongo import MongoClientdb_host = "127.0.0.1"db_port = 27017db_name = "openclass"provinceMap = {    "北京":1101,    "上海":3101,    "天津":1201,    "重庆":5001,    "黑龙江":2301,    "吉林":2201,    "辽宁":2101,    "山东":3701,    "山西":1401,    "陕西":6101,    "河北":1301,    "河南":4101,    "湖北":4201,    "湖南":4301,    "海南":4601,    "江苏":3201,    "江西":3601,    "广东":4401,    "广西":4501,    "云南":5301,    "贵州":5201,    "四川":5101,    "内蒙古":1501,    "宁夏":6401,    "甘肃":6201,    "青海":6301,    "西藏":5401,    "新疆":6501,    "安徽":3401,    "浙江":3301,    "福建":3501,    "香港":8101,}provinceMap = dict([[v,k] for k,v in provinceMap.items()])#解码字符串 北京def getUnicodeStr(s):    name = []    for word in s.split(";"):        try:            name.append(unichr(int(word[2:])))        except:            pass        return "".join(name)#获得某个市级区域的学校列表,如果事直辖市,则是整个直辖市的学校def getTownHtml(town_id):    try:        url = "http://support.renren.com/juniorschool/%s.html" % town_id        print "请求网络数据:",url        return urllib2.urlopen(url).read()    except:        print "网络错误!"        pass         def getProvinceData():    content = open("/home/xiyang/workspace/school-data/cityArray.js")    #分离出市级id和名称    partten = re.compile("(\d+):([\w\d\\\\]+)")    provinceList = []    for line in content.readlines():        data = partten.findall(line)        citys = []        province = {}         for s in data:            if len(s[0]) == 4:#城市                #print s[0],s[1].decode('unicode_escape')                citys.append({"id":s[0],"name":s[1].decode('unicode_escape')})                    province_id = len(data[0][0])==4 and data[0][0] or data[0][0][0:4]            #只处理列表中的几个省        if provinceMap.has_key(int(province_id)):            province['id'] = province_id            province['name'] = provinceMap[int(province_id)]            province['citys'] = citys            provinceList.append(province)            return provinceList#获得某个的市级区域所有县区的学校def getCitySchool(content):    soup = BeautifulSoup(content)        #某个城市的中学列表    citySchoolData = []    #县区的列表    townlist = soup.findAll('a',href="#highschool_anchor")        for town in townlist:        d = {}        d['name'] = getUnicodeStr(town.string)        d['id'] = town['onclick'][24:38]        townSchools = []        #获得每个县的中学列表        for school in soup.find('ul',id=d['id']).findChildren('a'):            townSchools.append(getUnicodeStr(school.string))        d['schoollist'] = townSchools                citySchoolData.append(d)        return citySchoolDataconn = MongoClient(db_host)db = conn.openclassjuniorschool = db.juniorschoolif __name__ == "__main__":    provinceList = getProvinceData();    print provinceList        for province in provinceList:        citys = province['citys']        #print province        if citys:#有城市,说明不是直辖市            for city in citys:                data = {"id":city['id'],"name":city['name'],"data":getCitySchool(getTownHtml(city['id']))}                     print "insert into mongodb:",city['name']                juniorschool.insert(data)                          else:#直辖市            data = {"id":province['id'],"name":province['name'],"data":getCitySchool(getTownHtml(province['id']))}               print "insert into mongodb:",province['name']                       juniorschool.insert(data)
执行玩代码,会把相关的数据存到mongodb,如下所示: