xpath抽取页面中的标签数据

来源：互联网发布：算法导论吃透后的水平编辑：程序博客网时间：2024/06/05 15:44

写代码的时候慢慢发现xpath真的比re好用，尤其是牵涉到比较复杂的页面数据结构的时候，多层嵌套，用xpath显然可以比正则表达式提高效率，而且完整、正确、高效的写出来实用的正则表达式还是需要花点时间的，今天在解析页面数据的时候索性直接抛开了正则表达式，来使用xpath进行页面内容的解析，感觉效率还是很不错的，但是有一个问题就是用xpath解析得到的结果跟我直接打开页面的源码文件然后“Ctrl+F”匹配到的数量不太一致，我在页面中一个一个地看匹配项，发现这样有的并没有被计入，有的在一堆数据的嵌套中的标签就没有被计算在内，暂时是还不太清楚这个问题出现的原因，下面的是我的测试程序，希望同行不吝赐教：

#!/usr/bin/python#-*-coding:utf-8-*-import lxmlimport urllibfrom lxml import etreedef tags_count(Html):    htmlcount=0    scriptcount=0    iframecount=0    framecount =0    hrefcount=0    embedcount=0    objectcount=0    divcount = 0    formcount = 0    form_methodcount = 0    form_actioncount = 0    count={}    count_list = []    page = etree.HTML(Html)    htmltag_list = page.xpath('//html')    htmlcount = len(htmltag_list)    scripttag_list = page.xpath('//script')    scriptcount = len(scripttag_list)    iframetag_list = page.xpath('//iframe')    iframecount = len(iframetag_list)    frame_list = page.xpath('//frame')    framecount = len(frame_list)    hreftag_list = page.xpath('//href')    hrefcount = len(hreftag_list)    embedtag_list = page.xpath('//embed')    embedcount = len(embedtag_list)    objecttag_list = page.xpath('//object')    objectcount = len(objecttag_list)    divtag_list = page.xpath('//div')    divcount = len(divtag_list)    form_list = page.xpath('//form')    formcount = len(form_list)    count["html"] = htmlcount    count["script"] = scriptcount    count["iframe"] = iframecount    count["frame"] = framecount    count["href"] = hrefcount    count["embed"] = embedcount    count["object"] = objectcount    count["div"] = divcount    count_list.append(htmlcount)    count_list.append(scriptcount)    count_list.append(iframecount)    count_list.append(framecount)    count_list.append(hrefcount)    count_list.append(embedcount)    count_list.append(objectcount)    count_list.append(divcount)    count_list.append(formcount)    return count, count_listif __name__ == '__main__':    url = 'http://www.baidu.com'    Html = urllib.urlopen(url).read()    count, count_list = tags_count(Html)    print count    print count_list

下面是结果：

{'script': 1, 'frame': 0, 'object': 0, 'html': 1, 'href': 0, 'iframe': 0, 'embed': 0, 'div': 18}
[1, 1, 0, 0, 0, 0, 0, 18, 1]

下面推荐几个链接写的有关FireFox浏览器安装xpath相关的插件来辅助进行解析页面的文章

http://blog.csdn.net/qiyueqinglian/article/details/49280221

http://blog.sina.com.cn/s/blog_5aefba9a0100csy8.html

http://blog.csdn.net/talking12391239/article/details/17349685

http://blog.csdn.net/sxl0727tu/article/details/51897693

http://www.cnblogs.com/swllow/p/6373253.html

http://www.myexception.cn/open-source/407398.html

http://www.jianshu.com/p/512f4b501ba2

http://download.csdn.net/detail/haozi409/9313945

0 0