抓取一个网页并解析HTML

来源:互联网 发布:olay新生塑颜系列知乎 编辑:程序博客网 时间:2024/06/05 20:49

在看廖雪峰老师的Python教程,常见内置模块 HTMLParser:
http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001407500818913cef22f247dbd4699921fe9d309727a20000

作业:找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。

#!/usr/bin/env python# -*- coding: utf-8 -*-# @Date    : 2017-06-01 09:08:30# @Author  : kk (zwk.patrick@foxmail.com)# @Link    : blog.csdn.net/PatrickZhengimport HTMLParser, urllibclass MyHTMLParser(HTMLParser.HTMLParser):    def __init__(self):        HTMLParser.HTMLParser.__init__(self)        self._title = [False]        self._time =[False]        self._place = [False]        self.time = ''   # 用于拼接时间    def _attr(self, attrlist, attrname):        for attr in attrlist:            if attr[0] == attrname:                return attr[1]        return None    def handle_starttag(self, tag, attrs):        #print('<%s>' % tag)        if tag == 'h3' and self._attr(attrs, 'class') == 'event-title':            self._title[0] = True        if tag == 'time':            self._time[0] = True        if tag == 'span' and self._attr(attrs, 'class') == 'event-location':            self._place[0] = True    def handle_endtag(self, tag):        # </time> 结束拼接        if tag == 'time':            self._time.append(self.time)  # 将time完整内容放入self._time            self.time = ''                # 初始化 self.time            self._time[0] = False    def handle_startendtag(self, tag, attrs):        #print('<%s/>' % tag)        pass    def handle_data(self, data):        #print('data: %s' % data)        if self._title[0] == True:            self._title.append(data)            self._title[0] = False        if self._time[0] == True:            self.time += data             # 拼接time        if self._place[0] == True:            self._place.append(data)            self._place[0] = False    def handle_comment(self, comment):        #print('<!-- %s -->' % comment)        pass    def handle_entityref(self, name):        if self._time[0] == True:            self.time += '-'               # &ndash -> '-'    def handle_charref(self, name):        #print('&#%s:' % name)        pass    def show_content(self):        for n in range(1, len(self._title)):            print 'Title: %s' % self._title[n]            print 'Time:  %s' % self._time[n]            print 'Place: %s' % self._place[n]            print '--------------------------------------'html = ''try:    page = urllib.urlopen('https://www.python.org/events/python-events/')  # 打开网页    html = page.read()  # 读取网页内容finally:    page.close()parser = MyHTMLParser()parser.feed(html)parser.show_content()

运行结果:

Title: PyCon Taiwan 2017Time:  06 June - 12 June  2017Place: Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan--------------------------------------Title: PyCon CZ 2017Time:  09 June - 12 June  2017Place: Prague, Czechia--------------------------------------Title: PythonDay MexicoTime:  10 June - 11 June  2017Place: Isabel la Católica 51, Centro, 06010 Mexico City, Mexico--------------------------------------Title: PyParis 2017Time:  12 June - 14 June  2017Place: Paris, France--------------------------------------Title: PyCon Israel 2017Time:  12 June - 15 June  2017Place: Wahl Center, Max VeAnna Webb st., Ramat Gan, Israel--------------------------------------Title: PyData Berlin 2017Time:  30 June - 03 July  2017Place: Treskowallee 8, 10318 Berlin, Germany--------------------------------------Title: PyConWEB 2017Time:  27 May - 29 May  2017Place: Munich, Germany--------------------------------------Title: PyDataBCN 2017Time:  19 May - 22 May  2017Place: Barcelona, Spain--------------------------------------***Repl Closed***
原创粉丝点击