python解析html tag

来源：互联网发布：云协作软件编辑：程序博客网时间：2024/05/16 19:46

有时候网页上信息太多，一方面用肉眼看容易出错，另一方面点击了网页的链接之后，原来页面的信息就被刷新了，这时候如果能通过程序自动的分析网页上的信息就好了，python的HTMLParser能够很好的解决这个问题，当然它只是把内容抓取下来，具体分析还得看不同人的需求。

from HTMLParser import HTMLParserclass MyHTMLParser(HTMLParser):    def __init__(self):        HTMLParser.__init__(self)        self.links = []    def handle_starttag(self, tag, attrs):        #print "hello"        if tag == "a":            if len(attrs) == 0:                pass            else:                for (variable, value) in attrs:                    if variable == "href":                        self.links.append(value)if __name__ == "__main__":    html_code = """    <a href="www.google.com">google.com</a>    <A Href="www.sina.com.cn">Sina</a>    """    hp = MyHTMLParser()    hp.feed(html_code)    hp.close()    print(hp.links)

首先自定义一个类MyHTMLParser，从HTMLParser继承，重载handle_starttag()方法，然后通过feed方法把html内容喂给MyHTMLParser对象，最后关闭就OK了。

在eshell中

$python htmlparser.py['www.google.com', 'www.sina.com.cn']

0 0