Python_API_Structured Markup Processing Tools_sgmllib.SGMLParser.feed

来源：互联网发布：淘宝网上药店哪家正规编辑：程序博客网时间：2024/05/22 01:49

API文档：

SGMLParser.feed(data)
Feed some text to the parser. It is processed insofar as it consists of complete elements; incomplete data is buffered until more data is fed or close() is called.

翻译文档：

参数：

data:待处理的文档字符串

描述：

根据data字符串进行文档解析，该方法不保证对整个HTML文档进行处理，它可能会对其进行缓冲处理，等待接受更多内容。

只要没有更多的内容，就应该调用close来刷新缓冲区，并且强制所有内容被完全处理

例子：

#! /usr/bin/env python
#coding=utf-8

from sgmllib import SGMLParser

class URLLister(SGMLParser):
    def reset(self):
        SGMLParser.reset(self)
        self.urls=[]

    def start_a(self,attrs):
        href = [v for k,v in attrs if k=='href']
        if href:
            self.urls.extend(href)


import urllib

usock = urllib.urlopen('http://www.baidu.com')
parser = URLLister()
parser.feed(usock.read())

usock.close()
parser.close()

for url in parser.urls:
    print url

输出：

http://www.baidu.com/gaoji/preferences.html
http://passport.baidu.com/?login&tpl=mn
https://passport.baidu.com/?reg&tpl=mn
http://news.baidu.com
http://tieba.baidu.com
http://zhidao.baidu.com
http://mp3.baidu.com
http://image.baidu.com
http://video.baidu.com
http://map.baidu.com
#
#
#
http://hi.baidu.com
http://baike.baidu.com
http://www.hao123.com
/more/
http://utility.baidu.com/traf/click.php?id=215&url=http://www.baidu.com
javascript:void(0)
http://e.baidu.com/?refer=888
http://top.baidu.com
http://home.baidu.com
http://ir.baidu.com
/duty/
http://www.miibeian.gov.cn