Scraper——BeautifulSoup and LXML
来源:互联网 发布:python 字节流 编辑:程序博客网 时间:2024/06/10 06:49
爬虫解析方式除了正则表达式,还有BeautifulSoup包和LXML模块。现在分别来介绍这两种方式。
1.BeautifulSoup包
功能比正则表达式很多,且要简洁明白一些。但是,由于它是用python编写出来的包,速度会慢一些。
1.BeautifulSoup包
功能比正则表达式很多,且要简洁明白一些。但是,由于它是用python编写出来的包,速度会慢一些。
# 数据抓取——BeautifulSoup包'''官方文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/'''# beautifulsoup包处理错误的HTML格式from bs4 import BeautifulSoupbroken_html = '<ul class=country><li>Area<li>Population</ul>'soup = BeautifulSoup(broken_html, "html.parser")fixed_html = soup.prettify()# 修复HTML格式# print fixed_htmlul = soup.find('ul', attrs={'class': 'country'})# 调取元素# print ul.find('li')# print ul.find_all('li')# 现在用此方法抽取国家面积数据import urllib2def download(url, user_agent="wswp", num_retries=2): print "Download :", url headers = {"User_agent": user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print "Download Error :", e.reason html = None if num_retries > 0: if hasattr(e, "code") and 500 <= e.code < 600: return download(url, user_agent, num_retries-1) return htmlif __name__ == "__main__": url = "http://example.webscraping.com/view/United-Kingdom-239" html = download(url) soup = BeautifulSoup(html, "html.parser", from_encoding="utf-8") # 先找到其父元素 tr = soup.find(attrs={'id':'places_area__row'}) # 然后再找到面积所在的子元素 td = tr.find(attrs={'class':'w2p_fw'}) # 最后输出子元素的内容 area = td.text print area# 总结:BeautifulSoup包虽然比正则表达式要复杂,但是,并不难懂,而且更易构造和理解。最后,像多余的空格和标签属性这种布局上的小变化,我们使用BeautifulSoup包更为方便。
2.LXML模块
这此模块中有一个CSS选择器。在使用前,必须先要安装cssselect包。不然,会出现错误!
# 数据抓取——Lxml模块'''Lxml是基于libxml2这一XML解析库的Python封存,该模块的解析速度更加块,比BeautifulSoup包快,因为,它使用的C语言编写。'''# 使用第一步先将不合法的HTML解析为统一的格式。import lxml.htmlimport urllib2'''broken_html = '<ul class=country><li>Area<li>Population</ul>'# 解析htmltree = lxml.html.fromstring(broken_html)fixed_html = lxml.html.tostring(tree, pretty_print=True)'''# print fixed_htmldef download(url, user_agent="wswp", num_retries=2): print "Download :", url headers = {"User_agent": user_agent} request = urllib2.Request(url, headers=headers) try: html = urllib2.urlopen(request).read() except urllib2.URLError as e: print "Download Error :", e.reason html = None if num_retries > 0: if hasattr(e, "code") and 500 <= e.code < 600: return download(url, user_agent, num_retries - 1) return htmlif __name__ == "__main__": url = "http://example.webscraping.com/view/United-Kingdom-239" html = download(url) tree = lxml.html.fromstring(html) td = tree.cssselect('tr#places_area__row > td.w2p_fw')[0] # 注意在最新的lxml模块中已经没有cssselect包,需要单独下载 pip install cssselect area = td.text_content() print area
阅读全文
1 0
- Scraper——BeautifulSoup and LXML
- BeautifulSoup及lxml使用小记
- python3.3 lxml+beautifulsoup 爬虫说明
- BeautifulSoup和lxml的基本用法示例
- Python——BeautifulSoup
- OpenSource security vulnerability aggregator (web scraper) and search engine
- BeautifulSoup4 and lxml notes
- mac中安装python, pydev, beautifulsoup, lxml, scrapy
- mac中安装python, pydev, beautifulsoup, lxml, scrapy
- BeautifulSoup提示找不到lxml解析包的解决方法
- mac中安装python, pydev, beautifulsoup, lxml, scrapy
- Python requests+gevent+BeautifulSoup lxml 干点啥-加点速
- 安装lxml HTML 解析器,需要c语言库? BeautifulSoup
- centos下安装python的beautifulsoup、request、lxml插件
- Python 爬虫 —— BeautifulSoup
- python—BeautifulSoup学习总结
- lxml: Cannot import lxml.html.soupparser.fromstring, depends on outdated BeautifulSoup
- Python学习——lxml.etree
- Python数据处理进阶——pandas
- Python中关于CSV文件中的I/O
- Crawler——链接爬虫
- Scraping_regex
- 冒泡排序——java语言
- Scraper——BeautifulSoup and LXML
- 二分图——洛谷P2825 [HEOI2016]游戏
- grunt 新手一日入门
- 第三章 组合逻辑
- 自定义Flash背景的相关设置方法以及其与目录下的文件的对应关系
- C语言程序设计(24)
- PKUSC 2017 酱油记
- Scraper_compare( 'NoneType' object has no attribute 'group')
- 数据挖掘——亲和性分析