爬取豆瓣网正在上映电影信息(HTMLParser实现)
来源:互联网 发布:如何评价慈禧太后知乎 编辑:程序博客网 时间:2024/05/04 07:15
from urllib import requestfrom html.parser import HTMLParserimport jsonclass MovieParser(HTMLParser): def __init__(self): HTMLParser.__init__(self) self.movies = [] def handle_starttag(self, tag, attrs): # print("attrs ", attrs) def _attr(attrlist,attrname): for attr in attrlist: if attr[0] == attrname: return attr[1] return None if tag == 'li' and _attr(attrs,'data-title') and _attr(attrs,'data-category') == 'nowplaying': movie = {} movie['title'] = _attr(attrs,'data-title') movie['score'] = _attr(attrs,'data-score') movie['director'] = _attr(attrs,'data-director') movie['actors'] = _attr(attrs,'data-actors') self.movies.append(movie) print('%(title)s| %(score)s| %(director)s| %(actors)s' % movie)def nowplaying(url): req = request.Request(url) req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36') s = request.urlopen(req).read() parser = MovieParser() parser.feed(s.decode('utf-8')) return parser.moviesif __name__ == "__main__": url = "https://movie.douban.com/nowplaying/wuhan/" movies = nowplaying(url) print('%s' % json.dumps(movies, sort_keys=True, indent=4, separators=(',', ': ')))
对html.parser不了解的可以看一下以下官方文档的解释(只取了一点)
HTMLParser.
handle_starttag
(tag, attrs)- This method is called to handle the start of a tag (e.g.
<div id="main">
). - The tag argument is the name of the tag converted to lower case. The attrs argument is a list of
(name, value)
pairs containing the attributes found inside the tag’s<>
brackets. The name will be translated to lower case, and quotes in the value have been removed, and character and entity references have been replaced. - For instance, for the tag
<A HREF="https://www.cwi.nl/">
, this method would be called ashandle_starttag('a', [('href', 'https://www.cwi.nl/')])
. - All entity references from
html.entities
are replaced in the attribute values.
As a basic example, below is a simple HTML parser that uses the HTMLParser
class to print out start tags, end tags, and data as they are encountered:
from html.parser import HTMLParserclass MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): print("Encountered a start tag:", tag) def handle_endtag(self, tag): print("Encountered an end tag :", tag) def handle_data(self, data): print("Encountered some data :", data)parser = MyHTMLParser()parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1></body></html>')
The output will then be:
Encountered a start tag: htmlEncountered a start tag: headEncountered a start tag: titleEncountered some data : TestEncountered an end tag : titleEncountered an end tag : headEncountered a start tag: bodyEncountered a start tag: h1Encountered some data : Parse me!Encountered an end tag : h1Encountered an end tag : bodyEncountered an end tag : html
0 0
- 爬取豆瓣网正在上映电影信息(HTMLParser实现)
- 爬取豆瓣网电影信息
- python3实现豆瓣top250电影信息爬取
- python爬取豆瓣电影信息
- scrpy 豆瓣电影信息爬取
- python爬取豆瓣电影信息
- nodejs爬取豆瓣top250电影信息
- selenium结合lxml爬取豆瓣电影相关信息
- BeautifulSoup和Selenium对比爬取豆瓣Top250电影信息
- 基于BeautifulSoup爬取豆瓣网上的电影信息
- 豆瓣电影信息爬取并保存到excel
- 【scrapy】scrapy按分类爬取豆瓣电影基础信息
- 使用scrapy框架爬取豆瓣电影top250信息
- Python爬虫入门 | 2 爬取豆瓣电影信息
- Python爬虫(1)——基于BeautifulSoup爬取豆瓣电影信息
- python 爬虫学习三(Scrapy 实战,豆瓣爬取电影信息)
- 爬取豆瓣的电影
- Python爬取豆瓣电影
- KVM 实践webvirtmgr 安装与配置
- 初识Spring中MultipartHttpServletRequest文件上传
- 关闭caffe日志输出
- Redis for windows64下服务无法启动问题
- nodejs express获取不了用户的外网ip地址解决方法
- 爬取豆瓣网正在上映电影信息(HTMLParser实现)
- salesforce DML和Database及rollback方法简单描述
- go结构体的学习和使用
- Android_BitmapFactory.Options详解
- getX()和getRawX()的区别
- java四种内部类详解
- 表单按回车自动提交
- ASCII值的意义
- 大牛博客收藏