BeautifulSoup不能完整识别网页html代码

来源：互联网发布：sql server数据库备份编辑：程序博客网时间：2024/06/05 07:25

环境：Python版本：2.7.3

>>> html = gethtml('http://www.joiway.com/')>>> soup = BeautifulSoup(html)>>> soup.find_all("a",href=True)[]>>> soup.find_all("a")[]>>> soup.find_all("link")[<link href="http://oss.aliyuncs.com/jianzhimao/web-res/icon/jianzhimao-logo-min.png" rel="icon" type="image/x-icon"/>, <link href="http://oss.aliyuncs.com/jianzhimao/web-res/icon/jianzhimao-logo-min.png" resl="shortcut icon" type="image/x-icon"/>, <link href="/templets/default/style/style.css" rel="stylesheet" type="text/css">\n<!--\u2013[if lt IE9]-->\n<script>\n(function() {\n    if (!\n    /*@cc_on!@*/\n    0) return;\n    var e = "abbr, article, aside, audio, canvas, datalist, details, dialog, eventsource, figure, figcaption, footer, header, hgroup, main, mark, menu, meter, nav, output, progress, section, time, video".split(', ');\n    var i= e.length;\n    while (i--){\n        document.createElement(e[i])\n    }\n})()\n</script>\n</link>]

如上，soup.find_all()找不到a标签，然而用chrome查看该网站源码是存在a标签的，gethtml（源码就不贴了）函数运作也没问题，单独打印soup发现只解析了部分html代码.

问题的原因：没有安装第三方的HTML解析器，所以用的是默认的解析器。而Python 2.7.3的默认解析器存在文档容错能力差的毛病。

解决方法： pip install html5lib （或者 lxml）

参考：https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#id5

0 0