了解Google的开源HTML5解析引擎Gumbo

来源:互联网 发布:期货数据库 编辑:程序博客网 时间:2024/05/20 01:44

谷歌开源Gumbo:纯C语言实现的HTML5解析库,的确是针对浏览器尤其是HTML5的研究人员是一个福音

看看网上描述的这些特征:

  • 完全符合HTML5规范
  • 强大,并且对于一些有问题的代码,能够灵活、有弹性地处理
  • 简单的API,可以很容易地与其他语言捆绑
  • 支持源位置和指针回到原始文本
  • 轻巧、没有外部依赖
  • 通过所有的html5lib-0.95测试
  • 已在超过25亿个来自谷歌索引的页面中进行过测试

在没有接触的情况下,先不发表言论,废话少说:拿来看看

https://github.com/google/gumbo-parser#gumbo---a-pure-c-html5-parser

下载版本master.zip还不到1M

下载完毕后,首先在cygwin下测试编译

huareal@gpx /cygdrive/f/pbase/gumbo
# ls
COPYING      Makefile.in  aclocal.m4      config.sub    depcomp   gumbo.pc.in  missing   src
Doxyfile     README.md    autom4te.cache  configure     docs      install-sh   python    testdata
Makefile.am  THANKS       config.guess    configure.ac  examples  ltmain.sh    setup.py  tests

huareal@gpx /cygdrive/f/pbase/gumbo
首先

./configure

配置一下

huareal@gpx /cygdrive/f/pbase/gumbo
# ./configure
checking for g++... g++
checking whether the C++ compiler works... yes
checking for C++ compiler default output file name... a.exe
checking for suffix of executables... .exe
checking whether we are cross compiling... no
checking for suffix of object files... o
checking whether we are using the GNU C++ compiler... yes
checking whether g++ accepts -g... yes
checking for gcc... gcc
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ISO C89... none needed
checking for gcc option to accept ISO C99... -std=gnu99
checking how to run the C preprocessor... gcc -std=gnu99 -E
checking for grep that handles long lines and -e... /usr/bin/grep
checking for egrep... /usr/bin/grep -E
checking for ANSI C header files... yes
checking for sys/types.h... yes
checking for sys/stat.h... yes
checking for stdlib.h... yes
checking for string.h... yes
checking for memory.h... yes
checking for strings.h... yes
checking for inttypes.h... yes
checking for stdint.h... yes
checking for unistd.h... yes
checking stddef.h usability... yes
checking stddef.h presence... yes
checking for stddef.h... yes
checking for stdlib.h... (cached) yes
checking for string.h... (cached) yes
checking for strings.h... (cached) yes
checking for inline... inline
checking for size_t... yes
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking for a thread-safe mkdir -p... /usr/bin/mkdir -p
checking for gawk... gawk
checking whether make sets $(MAKE)... yes
checking for style of include used by make... GNU
checking dependency style of gcc -std=gnu99... gcc3
checking dependency style of g++... gcc3
checking whether make supports nested variables... yes
checking build system type... i686-pc-cygwin
checking host system type... i686-pc-cygwin
checking how to print strings... printf
checking for a sed that does not truncate output... /usr/bin/sed
checking for fgrep... /usr/bin/grep -F
checking for ld used by gcc -std=gnu99... /usr/i686-pc-cygwin/bin/ld.exe
checking if the linker (/usr/i686-pc-cygwin/bin/ld.exe) is GNU ld... yes
checking for BSD- or MS-compatible name lister (nm)... /usr/bin/nm -B
checking the name lister (/usr/bin/nm -B) interface... BSD nm
checking whether ln -s works... yes
checking the maximum length of command line arguments... 8192
checking whether the shell understands some XSI constructs... yes
checking whether the shell understands "+="... yes
checking how to convert i686-pc-cygwin file names to i686-pc-cygwin format... func_convert_file_noop
checking how to convert i686-pc-cygwin file names to toolchain format... func_convert_file_noop
checking for /usr/i686-pc-cygwin/bin/ld.exe option to reload object files... -r
checking for objdump... objdump
checking how to recognize dependent libraries... file_magic ^x86 archive import|^x86 DLL
checking for dlltool... dlltool
checking how to associate runtime and link libraries... func_cygming_dll_for_implib
checking for ar... ar
checking for archiver @FILE support... @
checking for strip... strip
checking for ranlib... ranlib
checking command to parse /usr/bin/nm -B output from gcc -std=gnu99 object... ok
checking for sysroot... no
checking for mt... no
checking if : is a manifest tool... no
checking for dlfcn.h... yes
checking for objdir... .libs
checking if gcc -std=gnu99 supports -fno-rtti -fno-exceptions... no
checking for gcc -std=gnu99 option to produce PIC... -DDLL_EXPORT -DPIC
checking if gcc -std=gnu99 PIC flag -DDLL_EXPORT -DPIC works... yes
checking if gcc -std=gnu99 static flag -static works... no
checking if gcc -std=gnu99 supports -c -o file.o... yes
checking if gcc -std=gnu99 supports -c -o file.o... (cached) yes
checking whether the gcc -std=gnu99 linker (/usr/i686-pc-cygwin/bin/ld.exe) supports shared libraries... yes
checking whether -lc should be explicitly linked in... yes
checking dynamic linker characteristics... Win32 ld.exe
checking how to hardcode library paths into programs... immediate
checking whether stripping libraries is possible... yes
checking if libtool supports shared libraries... yes
checking whether to build shared libraries... yes
checking whether to build static libraries... yes
checking how to run the C++ preprocessor... g++ -E
checking for ld used by g++... /usr/i686-pc-cygwin/bin/ld.exe
checking if the linker (/usr/i686-pc-cygwin/bin/ld.exe) is GNU ld... yes
checking whether the g++ linker (/usr/i686-pc-cygwin/bin/ld.exe) supports shared libraries... yes
checking for g++ option to produce PIC... -DDLL_EXPORT -DPIC
checking if g++ PIC flag -DDLL_EXPORT -DPIC works... yes
checking if g++ static flag -static works... no
checking if g++ supports -c -o file.o... yes
checking if g++ supports -c -o file.o... (cached) yes
checking whether the g++ linker (/usr/i686-pc-cygwin/bin/ld.exe) supports shared libraries... yes
checking dynamic linker characteristics... Win32 ld.exe
checking how to hardcode library paths into programs... immediate
configure: creating ./config.status
config.status: creating Makefile
config.status: creating gumbo.pc
config.status: executing depfiles commands
config.status: executing libtool commands

看来检测的配置项还不少啊。

接下来

make install

 huareal@gpx /cygdrive/f/pbase/gumbo
# make install
  CC     src/libgumbo_la-attribute.lo
  CC     src/libgumbo_la-char_ref.lo
  CC     src/libgumbo_la-error.lo
  CC     src/libgumbo_la-parser.lo
src/parser.c: In function 'handle_in_table_text':
src/parser.c:3049: warning: array subscript has type 'char'
src/parser.c: In function 'is_special_node':
src/parser.c:1550: warning: control reaches end of non-void function
  CC     src/libgumbo_la-string_buffer.lo
  CC     src/libgumbo_la-string_piece.lo
  CC     src/libgumbo_la-tag.lo
src/tag.c: In function 'gumbo_tag_from_original_text':
src/tag.c:205: warning: array subscript has type 'char'
  CC     src/libgumbo_la-tokenizer.lo
  CC     src/libgumbo_la-utf8.lo
  CC     src/libgumbo_la-util.lo
  CC     src/libgumbo_la-vector.lo
  CCLD   libgumbo.la
Creating library file: .libs/libgumbo.dll.a
  CXX    examples/clean_text.o
  CXXLD  clean_text.exe
  CXX    examples/find_links.o
  CXXLD  find_links.exe
  CC     examples/get_title.o
  CCLD   get_title.exe
  CXX    examples/positions_of_class.o
  CXXLD  positions_of_class.exe
make[1]: Entering directory `/cygdrive/f/pbase/gumbo'
test -z "/usr/local/lib" || /usr/bin/mkdir -p "/usr/local/lib"
 /bin/sh ./libtool   --mode=install /usr/bin/install -c   libgumbo.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libgumbo.dll.a /usr/local/lib/libgumbo.dll.a
/usr/bin/install: cannot create regular file `/usr/local/lib/libgumbo.dll.a': Permission denied
make[1]: *** [install-libLTLIBRARIES] Error 1
make[1]: Leaving directory `/cygdrive/f/pbase/gumbo'
make: *** [install-am] Error 2

有错误,应该是/usr/local/lib/目录没有读写权限

修复cygwin下面的usr/local下的目录权限后,重新执行make install

huareal@gpx /cygdrive/f/pbase/gumbo
# make install
make[1]: Entering directory `/cygdrive/f/pbase/gumbo'
test -z "/usr/local/lib" || /usr/bin/mkdir -p "/usr/local/lib"
 /bin/sh ./libtool   --mode=install /usr/bin/install -c   libgumbo.la '/usr/local/lib'
libtool: install: /usr/bin/install -c .libs/libgumbo.dll.a /usr/local/lib/libgumbo.dll.a
libtool: install: base_file=`basename libgumbo.la`
libtool: install:  dlpath=`/bin/sh 2>&1 -c '. .libs/'libgumbo.la'i; echo cyggumbo-1.dll'`
libtool: install:  dldir=/usr/local/lib/`dirname ../bin/cyggumbo-1.dll`
libtool: install:  test -d /usr/local/lib/../bin || mkdir -p /usr/local/lib/../bin
libtool: install:  /usr/bin/install -c .libs/cyggumbo-1.dll /usr/local/lib/../bin/cyggumbo-1.dll
libtool: install:  chmod a+x /usr/local/lib/../bin/cyggumbo-1.dll
libtool: install:  if test -n '' && test -n 'strip --strip-unneeded'; then eval 'strip --strip-unneeded /usr/loc
libtool: install: /usr/bin/install -c .libs/libgumbo.lai /usr/local/lib/libgumbo.la
libtool: install: /usr/bin/install -c .libs/libgumbo.a /usr/local/lib/libgumbo.a
libtool: install: chmod 644 /usr/local/lib/libgumbo.a
libtool: install: ranlib /usr/local/lib/libgumbo.a
test -z "/usr/local/include" || /usr/bin/mkdir -p "/usr/local/include"
 /usr/bin/install -c -m 644 src/gumbo.h '/usr/local/include'
test -z "/usr/local/lib/pkgconfig" || /usr/bin/mkdir -p "/usr/local/lib/pkgconfig"
 /usr/bin/install -c -m 644 gumbo.pc '/usr/local/lib/pkgconfig'
make[1]: Leaving directory `/cygdrive/f/pbase/gumbo'

貌似安装成功

 

接下来下载gtest-1.6.0.zip,解压缩后

huareal@gpx /cygdrive/f/pbase/gtest-1.6.0
# ls
CHANGES         CONTRIBUTORS  Makefile.am  README      build-aux  codegear   configure.ac  include  make  samples  src   xcode
CMakeLists.txt  COPYING       Makefile.in  aclocal.m4  cmake      configure  fused-src     m4       msvc  scripts  test

huareal@gpx /cygdrive/f/pbase/gtest-1.6.0
# cd ../

huareal@gpx /cygdrive/f/pbase
# ls
credis-0.2.3  gtest-1.6.0  gumbo  http_load  http_load-12mar2006.tar.tar  linux  rediscppclient

huareal@gpx /cygdrive/f/pbase
# cd gumbo

huareal@gpx /cygdrive/f/pbase/gumbo
# ln -s ../gtest-1.6.0 gtest

huareal@gpx /cygdrive/f/pbase/gumbo

接下来进行

make check

输出在控制台内容很多,只能抓取后面的一小部分了

[ RUN      ] CharRefTest.NamedReplacement
[       OK ] CharRefTest.NamedReplacement (0 ms)
[ RUN      ] CharRefTest.NamedReplacementNoSemicolon
[       OK ] CharRefTest.NamedReplacementNoSemicolon (0 ms)
[ RUN      ] CharRefTest.NamedReplacementWithInvalidUtf8
[       OK ] CharRefTest.NamedReplacementWithInvalidUtf8 (0 ms)
[ RUN      ] CharRefTest.NamedReplacementInvalid
[       OK ] CharRefTest.NamedReplacementInvalid (0 ms)
[ RUN      ] CharRefTest.AdditionalAllowedChar
[       OK ] CharRefTest.AdditionalAllowedChar (0 ms)
[ RUN      ] CharRefTest.InAttribute
[       OK ] CharRefTest.InAttribute (1 ms)
[ RUN      ] CharRefTest.MultiChars
[       OK ] CharRefTest.MultiChars (1 ms)
[ RUN      ] CharRefTest.CharAfter
[       OK ] CharRefTest.CharAfter (1 ms)
[----------] 16 tests from CharRefTest (33 ms total)

[----------] 1 test from GumboAttributeTest
[ RUN      ] GumboAttributeTest.GetAttribute
[       OK ] GumboAttributeTest.GetAttribute (0 ms)
[----------] 1 test from GumboAttributeTest (2 ms total)

[----------] Global test environment tear-down
[==========] 156 tests from 8 test cases ran. (727 ms total)
[  PASSED  ] 156 tests.
PASS: gumbo_test.exe
=============
1 test passed
=============
make[1]: Nothing to be done for `check-local'.
make[1]: Leaving directory `/cygdrive/f/pbase/gumbo'

测试成功了。

 

接下来按照例子写一个: