Python网络页面抓取和页面分析

来源：互联网发布：sql server数据库教程编辑：程序博客网时间：2024/06/05 04:41

(1)安装第三方库httplib2
首先下载python的httplib2的安装包，下载地址为：http://code.google.com/p/httplib2/downloads/list；其次，在dos窗口下进入httplib2的解压目录，执行命令：python setup.py install 。即完成安装。然后在PyDev中加入这个第三方库，windows->preferences->PyDev->Editor->Interpreter-Python->Libraries->New Folder

http://docs.python.org/library/index.html 这个网址给出各种lib库的讲解。

(2)下面的例子是抓取http://guangzhou.8684.cn/x_24f5dad9这个网址的信息，然后通过正则表达式，提取其中的公交线路和公交站点信息。

#!/usr/bin/python# -*- coding: utf-8 -*-'''Created on 2013-8-26@author: chenll''''''【工具需求】抓取广州1路线的公交线路数据和站点数据。'''import os,httplib2,re#获取HTMl页面内容def getContent():    h = httplib2.Http(".cache")    resp, content = h.request("http://guangzhou.8684.cn/x_24f5dad9",     headers={'cache-control':'no-cache'})    return content.decode('gbk').encode('utf-8') ;def featch():   content = getContent();   #start:<div class="hc_d3" id="show1">   #end:<h2 class="hc_re">   startIndex = content.index('<div class="hc_d3" id="show1">');   endIndex = content.index('<h2 class="hc_re">');   subContent = content[startIndex:endIndex];   reg = r'[\s\S]*<h2\s*class="hc_p6">([\s\S]*)<span\s*id="ad581"></span></h2>\s*<p\s*class="hc_p7"><span>([\S]*)</span>\s*<span>([\s\S]*)</span>\s*<span>([\s\S]*)</span>\s*<a href="[\s\S]*">[\s\S]*</a>\s*</p>\s*<p\s*class="hc_p8">([\s\S]*)';   match = re.match(reg,subContent);   if match:       #线路名称       lineName = match.group(1);       #线路类型       lineType = match.group(2);       #起始首班车时间       lineTime = match.group(3);       #车票       tickect = match.group(4);       #站点信息       stationInfo = match.group(5);        reg = r'\s*<i>去程：</i>([\s\S]*)<i>回程：</i>([\s\S]*)'       match1 = re.match(reg,stationInfo);       if match1:           #去程           qc = match1.group(1)           qcArray = qc.split('-');           for each in qcArray:               reg = r'\s*<a\s*href="[\s\S]*">([\s\S]*)</a>\s*'               match2 = re.match(reg,each);               if match2:                   #去程站点                   print match2.group(1)           #回程           hc = match1.group(2);           hcArray = hc.split('-');           for each in hcArray:               reg = r'\s*<a\s*href="[\s\S]*">([\s\S]*)</a>\s*'               match2 = re.match(reg,each);               if match2:                   #回程站点                   print match2.group(1)#定义主调函数def main():    featch();if __name__ == '__main__':        main();