Python抓取网页中的链接

来源：互联网发布：外国人淘宝编辑：程序博客网时间：2024/04/28 03:15

需要从web中抓取相关的网页。正好想学习一下Python，首先看了一下Python简明教程，内容讲的不多，但是能够使你快速入门，我一直认为实例驱动学习是最有效的办法。所以直接通过实际操作怎么去抓取网页来丰富对Python的学习效果会更好。

Python提供了各种各样的库，使得各种操作变得很方便。这里使用的是Python的urllib2和sgmllib库。为了处理HTML，Python总共提供了三个模块：sgmllib htmllibHTMLParser。本文中采用的是sgmllib，但是通过查找相关资料发现其实第三方工具BeautifulSoup是最好的，能够处理较差的HTML。所以后面还要接着学习BeautifulSoup。

（2）脚本代码

import urllib2import sgmllibclass LinksParser(sgmllib.SGMLParser):urls = []def do_a(self, attrs):for name, value in attrs:if name == 'href' and value not in self.urls:if value.startswith('http'):self.urls.append(value)print valueelse:continuereturnif __name__ == "__main__":# str = ""# if str.strip() is '':# print "str is None"# else:# print "str is no None"p =  LinksParser()f = urllib2.urlopen('http://www.baidu.com')value = f.read()print valuep.feed(value)for url in p.urls:print urlf.close()p.close()

转自：http://blog.csdn.net/cscmaker/article/details/8730153