抓取一个网页并解析HTML
来源:互联网 发布:olay新生塑颜系列知乎 编辑:程序博客网 时间:2024/06/05 20:49
在看廖雪峰老师的Python教程,常见内置模块 HTMLParser:
http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001407500818913cef22f247dbd4699921fe9d309727a20000
作业:找一个网页,例如https://www.python.org/events/python-events/,用浏览器查看源码并复制,然后尝试解析一下HTML,输出Python官网发布的会议时间、名称和地点。
#!/usr/bin/env python# -*- coding: utf-8 -*-# @Date : 2017-06-01 09:08:30# @Author : kk (zwk.patrick@foxmail.com)# @Link : blog.csdn.net/PatrickZhengimport HTMLParser, urllibclass MyHTMLParser(HTMLParser.HTMLParser): def __init__(self): HTMLParser.HTMLParser.__init__(self) self._title = [False] self._time =[False] self._place = [False] self.time = '' # 用于拼接时间 def _attr(self, attrlist, attrname): for attr in attrlist: if attr[0] == attrname: return attr[1] return None def handle_starttag(self, tag, attrs): #print('<%s>' % tag) if tag == 'h3' and self._attr(attrs, 'class') == 'event-title': self._title[0] = True if tag == 'time': self._time[0] = True if tag == 'span' and self._attr(attrs, 'class') == 'event-location': self._place[0] = True def handle_endtag(self, tag): # </time> 结束拼接 if tag == 'time': self._time.append(self.time) # 将time完整内容放入self._time self.time = '' # 初始化 self.time self._time[0] = False def handle_startendtag(self, tag, attrs): #print('<%s/>' % tag) pass def handle_data(self, data): #print('data: %s' % data) if self._title[0] == True: self._title.append(data) self._title[0] = False if self._time[0] == True: self.time += data # 拼接time if self._place[0] == True: self._place.append(data) self._place[0] = False def handle_comment(self, comment): #print('<!-- %s -->' % comment) pass def handle_entityref(self, name): if self._time[0] == True: self.time += '-' # &ndash -> '-' def handle_charref(self, name): #print('&#%s:' % name) pass def show_content(self): for n in range(1, len(self._title)): print 'Title: %s' % self._title[n] print 'Time: %s' % self._time[n] print 'Place: %s' % self._place[n] print '--------------------------------------'html = ''try: page = urllib.urlopen('https://www.python.org/events/python-events/') # 打开网页 html = page.read() # 读取网页内容finally: page.close()parser = MyHTMLParser()parser.feed(html)parser.show_content()
运行结果:
Title: PyCon Taiwan 2017Time: 06 June - 12 June 2017Place: Academia Sinica, 128 Academia Road, Section 2, Nankang, Taipei 11529, Taiwan--------------------------------------Title: PyCon CZ 2017Time: 09 June - 12 June 2017Place: Prague, Czechia--------------------------------------Title: PythonDay MexicoTime: 10 June - 11 June 2017Place: Isabel la Católica 51, Centro, 06010 Mexico City, Mexico--------------------------------------Title: PyParis 2017Time: 12 June - 14 June 2017Place: Paris, France--------------------------------------Title: PyCon Israel 2017Time: 12 June - 15 June 2017Place: Wahl Center, Max VeAnna Webb st., Ramat Gan, Israel--------------------------------------Title: PyData Berlin 2017Time: 30 June - 03 July 2017Place: Treskowallee 8, 10318 Berlin, Germany--------------------------------------Title: PyConWEB 2017Time: 27 May - 29 May 2017Place: Munich, Germany--------------------------------------Title: PyDataBCN 2017Time: 19 May - 22 May 2017Place: Barcelona, Spain--------------------------------------***Repl Closed***
阅读全文
0 0
- 抓取一个网页并解析HTML
- 抓取网页并解析HTML
- 抓取网页并解析HTML
- 抓取远程网页并解析HTML
- Python写爬虫——抓取网页并解析HTML
- Python写爬虫——抓取网页并解析HTML
- Python写爬虫——抓取网页并解析HTML
- Python写爬虫——抓取网页并解析HTML
- Python写爬虫——抓取网页并解析HTML
- Python写爬虫——抓取网页并解析HTML
- 抓取网页数据并解析
- Python 抓取并解析 HTML
- 抓取网页数据并解析Android
- 谈如何解析Html并抓取数据
- 一个抓取网页解析内容的程序。
- 自动抓取并解析一个商品页
- Python之HTML的解析(网页抓取一)
- 用XPATH解析网页并抓取要的内容
- Convolutional Neural Network-based Place Recognition
- jQuery中JSONP的两种实现方式简单解释
- NSAttributedString
- 修改系统默认 alert 弹框样式
- python matplotlib阶段性总结——word转txt、绘图、文件操作
- 抓取一个网页并解析HTML
- 关于自然语言的实体抽取和舆情分析技术
- 将数组转换为List
- J2EE系列之Spring4学习笔记(十二)--Spring对事务管理
- 线程中断、超时与降级——《亿级流量》内容补充
- 19.隐式Intent
- 灰度变换:imadjust and stretchlim
- 24_方法_方法的本质_形参_实参_return语句
- c#多线程对于字典型的处理