使用python的BeautifulSoup库解析页面应选择适当容错能力的解析器

来源：互联网发布：数据透视表多条件筛选编辑：程序博客网时间：2024/05/21 18:33
# -*- coding: utf-8 -*-"""filename : net_csdn_bbs_topics392225180.pyauthor: hu@daonao.com QQ: 443089607 weixin: huzhenghui weibo: http://weibo.com/443089607category : BeautifulSouporiginal url : http://bbs.csdn.net/topics/392225180original title : 请教如何用BeautifulSoup抓取示例中div标签的文本title : 使用python的BeautifulSoup库解析页面应选择适当容错能力的解析器csdn blog url :weibo article url :weibo message url :为了清晰直观展现python严格要求的缩进，请访问博客上博文详细说明见源代码中的注释"""# standard importimport loggingimport sysimport bs4logging.basicConfig(level=logging.DEBUG, format='%(asctime)s - %(levelname)s - %(message)s')logging.debug('start')# 3.6.0logging.debug('python version : %s', sys.version)# 4.5.3logging.debug('bs4.__version : %s', bs4.__version__)STR_HTML_PAGE = """<html><body></div>BeautifulSoup<p></p></body></html>"""# 解析器 : Python标准库# 使用方法 : BeautifulSoup(markup, "html.parser")# 优势 : Python的内置标准库# 优势 : 执行速度适中# 优势 : 文档容错能力强# 劣势 : Python 2.7.3 or 3.2.2)前 的版本中文档容错能力差BEAUTIFULSOUP_HTML_PARSER = bs4.BeautifulSoup(STR_HTML_PAGE, 'html.parser')logging.debug('html.parser')"""<html> <body> </body></html>BeautifulSoup<p></p>"""print(BEAUTIFULSOUP_HTML_PARSER.prettify())sys.stdout.flush()# 解析器 : lxml HTML 解析器# 安装 : pip install lxml# 使用方法 : BeautifulSoup(markup, "lxml")# 优势 : 速度快# 优势 : 文档容错能力强# 劣势 : 需要安装C语言库BEAUTIFULSOUP_LXML = bs4.BeautifulSoup(STR_HTML_PAGE, 'lxml')logging.debug('lxml')"""<html> <body>  BeautifulSoup  <p>  </p> </body></html>"""print(BEAUTIFULSOUP_LXML.prettify())sys.stdout.flush()# 解析器 : lxml XML 解析器# 安装 : pip install lxml# 使用方法 : BeautifulSoup(markup, ["lxml", "xml"])# 使用方法 : BeautifulSoup(markup, "xml")# 优势 : 速度快# 优势 : 唯一支持XML的解析器# 劣势 : 需要安装C语言库BEAUTIFULSOUP_XML = bs4.BeautifulSoup(STR_HTML_PAGE, 'xml')logging.debug('xml')"""<?xml version="1.0" encoding="utf-8"?><html> <body> </body> BeautifulSoup <p> </p></html>"""print(BEAUTIFULSOUP_XML.prettify())sys.stdout.flush()# 解析器 : html5lib# 安装 : pip install html5lib# 使用方法 : BeautifulSoup(markup, "html5lib")# 优势 : 最好的容错性# 优势 : 以浏览器的方式解析文档# 优势 : 生成HTML5格式的文档# 劣势 : 速度慢# 劣势 : 不依赖外部扩展BEAUTIFULSOUP_HTML5LIB = bs4.BeautifulSoup(STR_HTML_PAGE, 'html5lib')logging.debug('html5lib')"""<html> <head> </head> <body>  BeautifulSoup  <p>  </p> </body></html>"""print(BEAUTIFULSOUP_HTML5LIB.prettify())#end of file
阅读全文
0 0