了解Google的开源HTML5解析引擎Gumbo
来源:互联网 发布:期货数据库 编辑:程序博客网 时间:2024/05/20 01:44
谷歌开源Gumbo:纯C语言实现的HTML5解析库,的确是针对浏览器尤其是HTML5的研究人员是一个福音
看看网上描述的这些特征:
- 完全符合HTML5规范
- 强大,并且对于一些有问题的代码,能够灵活、有弹性地处理
- 简单的API,可以很容易地与其他语言捆绑
- 支持源位置和指针回到原始文本
- 轻巧、没有外部依赖
- 通过所有的html5lib-0.95测试
已在超过25亿个来自谷歌索引的页面中进行过测试
在没有接触的情况下,先不发表言论,废话少说:拿来看看
https://github.com/google/gumbo-parser#gumbo---a-pure-c-html5-parser
下载版本master.zip还不到1M
下载完毕后,首先在cygwin下测试编译
huareal@gpx /cygdrive/f/pbase/gumbo
# ls
COPYING Makefile.in aclocal.m4 config.sub depcomp gumbo.pc.in missing src
Doxyfile README.md autom4te.cache configure docs install-sh python testdata
Makefile.am THANKS config.guess configure.ac examples ltmain.sh setup.py tests
huareal@gpx /cygdrive/f/pbase/gumbo
首先
./configure
配置一下
huareal@gpx /cygdrive/f/pbase/gumbo
# ./configure
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.exe
checking for suffix of executables... .exe
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for gcc... gcc
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for gcc option to accept ISO C99... -std=gnu99
checking how to run the C preprocessor... gcc -std=gnu99 -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking stddef.h usability... yes
checking stddef.h presence... yes
checking for stddef.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for strings.h... (cached) yes
checking for inline... inline
checking for size_t... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for style of include used by make... GNU
checking dependency style of gcc -std=gnu99... gcc3
checking dependency style of g++... gcc3
checking whether make supports nested variables... yes
checking build system type... i686-pc-cygwin
checking host system type... i686-pc-cygwin
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/bin/sed
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc -std=gnu99... /usr/i686-pc-cygwin/bin/ld.exe
checking if the linker (/usr/i686-pc-cygwin/bin/ld.exe) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 8192
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert i686-pc-cygwin file names to i686-pc-cygwin format... func_convert_file_noop
checking how to convert i686-pc-cygwin file names to toolchain format... func_convert_file_noop
checking for /usr/i686-pc-cygwin/bin/ld.exe option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... file_magic ^x86 archive import|^x86 DLL
checking for dlltool... dlltool
checking how to associate runtime and link libraries... func_cygming_dll_for_implib
checking for ar... ar
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc -std=gnu99 object... ok
checking for sysroot... no
checking for mt... no
checking if : is a manifest tool... no
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc -std=gnu99 supports -fno-rtti -fno-exceptions... no
checking for gcc -std=gnu99 option to produce PIC... -DDLL_EXPORT -DPIC
checking if gcc -std=gnu99 PIC flag -DDLL_EXPORT -DPIC works... yes
checking if gcc -std=gnu99 static flag -static works... no
checking if gcc -std=gnu99 supports -c -o file.o... yes
checking if gcc -std=gnu99 supports -c -o file.o... (cached) yes
checking whether the gcc -std=gnu99 linker (/usr/i686-pc-cygwin/bin/ld.exe) supports shared libraries... yes
checking whether -lc should be explicitly linked in... yes
checking dynamic linker characteristics... Win32 ld.exe
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/i686-pc-cygwin/bin/ld.exe
checking if the linker (/usr/i686-pc-cygwin/bin/ld.exe) is GNU ld... yes
checking whether the g++ linker (/usr/i686-pc-cygwin/bin/ld.exe) supports shared libraries... yes
checking for g++ option to produce PIC... -DDLL_EXPORT -DPIC
checking if g++ PIC flag -DDLL_EXPORT -DPIC works... yes
checking if g++ static flag -static works... no
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/i686-pc-cygwin/bin/ld.exe) supports shared libraries... yes
checking dynamic linker characteristics... Win32 ld.exe
checking how to hardcode library paths into programs... immediate
configure: creating ./config.status
config.status: creating Makefile
config.status: creating gumbo.pc
config.status: executing depfiles commands
config.status: executing libtool commands
看来检测的配置项还不少啊。
接下来
make install
huareal@gpx /cygdrive/f/pbase/gumbo
# make install
CC src/libgumbo_la-attribute.lo
CC src/libgumbo_la-char_ref.lo
CC src/libgumbo_la-error.lo
CC src/libgumbo_la-parser.lo
src/parser.c: In function 'handle_in_table_text':
src/parser.c:3049: warning: array subscript has type 'char'
src/parser.c: In function 'is_special_node':
src/parser.c:1550: warning: control reaches end of non-void function
CC src/libgumbo_la-string_buffer.lo
CC src/libgumbo_la-string_piece.lo
CC src/libgumbo_la-tag.lo
src/tag.c: In function 'gumbo_tag_from_original_text':
src/tag.c:205: warning: array subscript has type 'char'
CC src/libgumbo_la-tokenizer.lo
CC src/libgumbo_la-utf8.lo
CC src/libgumbo_la-util.lo
CC src/libgumbo_la-vector.lo
CCLD libgumbo.la
Creating library file: .libs/libgumbo.dll.a
CXX examples/clean_text.o
CXXLD clean_text.exe
CXX examples/find_links.o
CXXLD find_links.exe
CC examples/get_title.o
CCLD get_title.exe
CXX examples/positions_of_class.o
CXXLD positions_of_class.exe
make[1]: Entering directory `/cygdrive/f/pbase/gumbo'
test -z "/usr/local/lib" || /usr/bin/mkdir -p "/usr/local/lib"
/bin/sh ./libtool --mode=install /usr/bin/install -c libgumbo.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libgumbo.dll.a /usr/local/lib/libgumbo.dll.a
/usr/bin/install: cannot create regular file `/usr/local/lib/libgumbo.dll.a': Permission denied
make[1]: *** [install-libLTLIBRARIES] Error 1
make[1]: Leaving directory `/cygdrive/f/pbase/gumbo'
make: *** [install-am] Error 2
有错误,应该是/usr/local/lib/目录没有读写权限
修复cygwin下面的usr/local下的目录权限后,重新执行make install
huareal@gpx /cygdrive/f/pbase/gumbo
# make install
make[1]: Entering directory `/cygdrive/f/pbase/gumbo'
test -z "/usr/local/lib" || /usr/bin/mkdir -p "/usr/local/lib"
/bin/sh ./libtool --mode=install /usr/bin/install -c libgumbo.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libgumbo.dll.a /usr/local/lib/libgumbo.dll.a
libtool: install: base_file=`basename libgumbo.la`
libtool: install: dlpath=`/bin/sh 2>&1 -c '. .libs/'libgumbo.la'i; echo cyggumbo-1.dll'`
libtool: install: dldir=/usr/local/lib/`dirname ../bin/cyggumbo-1.dll`
libtool: install: test -d /usr/local/lib/../bin || mkdir -p /usr/local/lib/../bin
libtool: install: /usr/bin/install -c .libs/cyggumbo-1.dll /usr/local/lib/../bin/cyggumbo-1.dll
libtool: install: chmod a+x /usr/local/lib/../bin/cyggumbo-1.dll
libtool: install: if test -n '' && test -n 'strip --strip-unneeded'; then eval 'strip --strip-unneeded /usr/loc
libtool: install: /usr/bin/install -c .libs/libgumbo.lai /usr/local/lib/libgumbo.la
libtool: install: /usr/bin/install -c .libs/libgumbo.a /usr/local/lib/libgumbo.a
libtool: install: chmod 644 /usr/local/lib/libgumbo.a
libtool: install: ranlib /usr/local/lib/libgumbo.a
test -z "/usr/local/include" || /usr/bin/mkdir -p "/usr/local/include"
/usr/bin/install -c -m 644 src/gumbo.h '/usr/local/include'
test -z "/usr/local/lib/pkgconfig" || /usr/bin/mkdir -p "/usr/local/lib/pkgconfig"
/usr/bin/install -c -m 644 gumbo.pc '/usr/local/lib/pkgconfig'
make[1]: Leaving directory `/cygdrive/f/pbase/gumbo'
貌似安装成功
接下来下载gtest-1.6.0.zip,解压缩后
huareal@gpx /cygdrive/f/pbase/gtest-1.6.0
# ls
CHANGES CONTRIBUTORS Makefile.am README build-aux codegear configure.ac include make samples src xcode
CMakeLists.txt COPYING Makefile.in aclocal.m4 cmake configure fused-src m4 msvc scripts test
huareal@gpx /cygdrive/f/pbase/gtest-1.6.0
# cd ../
huareal@gpx /cygdrive/f/pbase
# ls
credis-0.2.3 gtest-1.6.0 gumbo http_load http_load-12mar2006.tar.tar linux rediscppclient
huareal@gpx /cygdrive/f/pbase
# cd gumbo
huareal@gpx /cygdrive/f/pbase/gumbo
# ln -s ../gtest-1.6.0 gtest
huareal@gpx /cygdrive/f/pbase/gumbo
接下来进行
make check
输出在控制台内容很多,只能抓取后面的一小部分了
[ RUN ] CharRefTest.NamedReplacement
[ OK ] CharRefTest.NamedReplacement (0 ms)
[ RUN ] CharRefTest.NamedReplacementNoSemicolon
[ OK ] CharRefTest.NamedReplacementNoSemicolon (0 ms)
[ RUN ] CharRefTest.NamedReplacementWithInvalidUtf8
[ OK ] CharRefTest.NamedReplacementWithInvalidUtf8 (0 ms)
[ RUN ] CharRefTest.NamedReplacementInvalid
[ OK ] CharRefTest.NamedReplacementInvalid (0 ms)
[ RUN ] CharRefTest.AdditionalAllowedChar
[ OK ] CharRefTest.AdditionalAllowedChar (0 ms)
[ RUN ] CharRefTest.InAttribute
[ OK ] CharRefTest.InAttribute (1 ms)
[ RUN ] CharRefTest.MultiChars
[ OK ] CharRefTest.MultiChars (1 ms)
[ RUN ] CharRefTest.CharAfter
[ OK ] CharRefTest.CharAfter (1 ms)
[----------] 16 tests from CharRefTest (33 ms total)
[----------] 1 test from GumboAttributeTest
[ RUN ] GumboAttributeTest.GetAttribute
[ OK ] GumboAttributeTest.GetAttribute (0 ms)
[----------] 1 test from GumboAttributeTest (2 ms total)
[----------] Global test environment tear-down
[==========] 156 tests from 8 test cases ran. (727 ms total)
[ PASSED ] 156 tests.
PASS: gumbo_test.exe
=============
1 test passed
=============
make[1]: Nothing to be done for `check-local'.
make[1]: Leaving directory `/cygdrive/f/pbase/gumbo'
测试成功了。
接下来按照例子写一个:
- 了解Google的开源HTML5解析引擎Gumbo
- Google开放HTML5解析库Gumbo的源代码
- 谷歌开源Gumbo:纯C语言实现的HTML5解析库
- 谷歌开源Gumbo:纯C语言实现的HTML5解析库
- 谷歌开源Gumbo:纯C语言实现的HTML5解析库
- C++ DOM 解析和选择器工具 gumbo-parser和gumbo-query的集成编译
- html5开源引擎 整理
- google开源gumbo-parser系列
- 腾讯推出HTML5的开源专业级图像处理引擎
- 腾讯开源:基于HTML5的图像处理引擎
- 引擎的一些了解
- 引擎的一些了解
- Google的矢量图形引擎skia开源了.
- HTML5 开源游戏引擎 LayaAir
- [GitHub开源]基于HTML5实现的轻量级Google Earth三维地图引擎,带你畅游世界
- 开源免费的HTML5游戏引擎——青瓷引擎(QICI Engine) 1.0正式版发布了!
- 开源免费的HTML5游戏引擎——青瓷引擎(QICI Engine) 1.0正式版发布了!
- 15分钟了解Apache Phoenix(HBase的开源SQL引擎)
- 常见相似度量
- Ural 1698. Square Country 5 记忆化搜索
- Hbase安装与简介
- how to analysis the crash dump
- 安装Oracle 11.2.0.2.0 for RedHat5.4_32bit碰到的一些错误--分析并解决
- 了解Google的开源HTML5解析引擎Gumbo
- 两篇很牛的vim使用技巧
- linux内核内存管理学习之一(基本概念,分页及初始化)
- PhysicsEditor的应用
- C++获取系统图标方法
- 后缀自动机学习资料
- 跟我一起学Python之九:字符串常用方法
- java 打印表格文件
- [HDU 4336]Card Collection[状态压缩DP][概率DP][容斥原理]