html5lib-python doc
来源:互联网 发布:aws caffe 编辑:程序博客网 时间:2024/06/05 07:55
http://html5lib.readthedocs.org/en/latest/
By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation (i.e.xml.etree.cElementTree on Python 2.x).
Overview
html5lib is a pure-python library for parsing HTML. It is designed toconform to the WHATWG HTML specification, as is implemented by all majorweb browsers.
Usage
Simple usage follows this pattern:
import html5libwith open("mydocument.html", "rb") as f: document = html5lib.parse(f)
or:
import html5libdocument = html5lib.parse("<p>Hello World!")
By default, the document will be anxml.etree element instance.Whenever possible, html5lib chooses the acceleratedElementTreeimplementation (i.e.xml.etree.cElementTree on Python 2.x).
Two other tree types are supported: xml.dom.minidom andlxml.etree. To use an alternative format, specify the name ofa treebuilder:
import html5libwith open("mydocument.html", "rb") as f: lxml_etree_document = html5lib.parse(f, treebuilder="lxml")
When using with urllib2 (Python 2), the charset from HTTP should bepass into html5lib as follows:
from contextlib import closingfrom urllib2 import urlopenimport html5libwith closing(urlopen("http://example.com/")) as f: document = html5lib.parse(f, encoding=f.info().getparam("charset"))
When using with urllib.request (Python 3), the charset from HTTPshould be pass into html5lib as follows:
from urllib.request import urlopenimport html5libwith urlopen("http://example.com/") as f: document = html5lib.parse(f, encoding=f.info().get_content_charset())
To have more control over the parser, create a parser object explicitly.For instance, to make the parser raise exceptions on parse errors, use:
import html5libwith open("mydocument.html", "rb") as f: parser = html5lib.HTMLParser(strict=True) document = parser.parse(f)
When you’re instantiating parser objects explicitly, pass a treebuilderclass as thetree keyword argument to use an alternative documentformat:
import html5libparser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("dom"))minidom_document = parser.parse("<p>Hello World!")
More documentation is available at http://html5lib.readthedocs.org/.
Installation
html5lib works on CPython 2.6+, CPython 3.2+ and PyPy. To install it,use:
$ pip install html5lib
Optional Dependencies
The following third-party libraries may be used for additionalfunctionality:
- datrie can be used to improve parsing performance (though inalmost all cases the improvement is marginal);
- lxml is supported as a tree format (for both building andwalking) under CPython (butnot PyPy where it is known to causesegfaults);
- genshi has a treewalker (but not builder); and
- charade can be used as a fallback when character encoding cannotbe determined;chardet, from which it was forked, can also be usedon Python 2.
- ordereddict can be used under Python 2.6(collections.OrderedDict is used instead on later versions) toserialize attributes in alphabetical order.
Bugs
Please report any bugs on the issue tracker.
Tests
Unit tests require the nose library and can be run using thenosetests command in the root directory;ordereddict isrequired under Python 2.6. All should pass.
Test data are contained in a separate html5lib-tests repository and includedas a submodule, thus for git checkouts they must be initialized:
$ git submodule init$ git submodule update
If you have all compatible Python implementations available on yoursystem, you can run tests on all of them using thetox utility,which can be found on PyPI.
Questions?
There’s a mailing list available for support on Google Groups,html5lib-discuss,though you may get a quicker response asking on IRC in#whatwg onirc.freenode.net.
- The moving parts
- Tree builders
- Tree walkers
- Tree adapters
- Encoding discovery
- Tokenizers
- Change Log
- 0.9999
- 0.999
- 0.99
- 1.0b3
- 1.0b2
- 1.0b1
- 0.95
- 0.90
- 0.11.1
- 0.11
- 0.10
- 0.9
- 0.2
- License
Indices and tables
- Index
- Module Index
- Search Page
- html5lib-python doc
- 如何使用Python模块 html5lib
- python doc
- Python doc
- python doc 简要介绍
- python doc string 规范
- python 批处理doc命令
- anaconda 下安装 html5lib
- 如果Python自带的htmlparser解析失败,请安装lxml或者html5lib来替换自带的parser
- ubuntu下为python安装BeautifulSoup4并安装解析器lxml和html5lib(包括python2和python3)
- http://www.python.org/doc/
- doc
- doc
- DOC
- doc
- doc
- Python中的Doc String 函数描述
- python学习:HTML转换成doc
- 数据库操作大全
- Java基础——方法
- 'ascii' codec can't decode byte 0xc4 in position 27: ordinal not in range(128)
- 那些原先不知道的事
- windows-install-python-and-sphinx(*.rst file)
- html5lib-python doc
- Linux高级字符设备驱动
- 世界杯官方照
- SAP CRM配置传输问题
- iOS8开发~Swift(五)Swift与OC混编
- fusionchats 和 fusionwidget都下载哪些js文件
- linux下启动ftp服务
- vcs联合编译v/sv/c++代码
- Lucene40PostingWriter