xpath抽取页面中的标签数据

来源:互联网 发布:算法导论吃透后的水平 编辑:程序博客网 时间:2024/06/05 15:44

    写代码的时候慢慢发现xpath真的比re好用,尤其是牵涉到比较复杂的页面数据结构的时候,多层嵌套,用xpath显然可以比正则表达式提高效率,而且完整、正确、高效的写出来实用的正则表达式还是需要花点时间的,今天在解析页面数据的时候索性直接抛开了正则表达式,来使用xpath进行页面内容的解析,感觉效率还是很不错的,但是有一个问题就是用xpath解析得到的结果跟我直接打开页面的源码文件然后“Ctrl+F”匹配到的数量不太一致,我在页面中一个一个地看匹配项,发现这样有的并没有被计入,有的在一堆数据的嵌套中的标签就没有被计算在内,暂时是还不太清楚这个问题出现的原因,下面的是我的测试程序,希望同行不吝赐教:

#!/usr/bin/python#-*-coding:utf-8-*-import lxmlimport urllibfrom lxml import etreedef tags_count(Html):    htmlcount=0    scriptcount=0    iframecount=0    framecount =0    hrefcount=0    embedcount=0    objectcount=0    divcount = 0    formcount = 0    form_methodcount = 0    form_actioncount = 0    count={}    count_list = []    page = etree.HTML(Html)    htmltag_list = page.xpath('//html')    htmlcount = len(htmltag_list)    scripttag_list = page.xpath('//script')    scriptcount = len(scripttag_list)    iframetag_list = page.xpath('//iframe')    iframecount = len(iframetag_list)    frame_list = page.xpath('//frame')    framecount = len(frame_list)    hreftag_list = page.xpath('//href')    hrefcount = len(hreftag_list)    embedtag_list = page.xpath('//embed')    embedcount = len(embedtag_list)    objecttag_list = page.xpath('//object')    objectcount = len(objecttag_list)    divtag_list = page.xpath('//div')    divcount = len(divtag_list)    form_list = page.xpath('//form')    formcount = len(form_list)    count["html"] = htmlcount    count["script"] = scriptcount    count["iframe"] = iframecount    count["frame"] = framecount    count["href"] = hrefcount    count["embed"] = embedcount    count["object"] = objectcount    count["div"] = divcount    count_list.append(htmlcount)    count_list.append(scriptcount)    count_list.append(iframecount)    count_list.append(framecount)    count_list.append(hrefcount)    count_list.append(embedcount)    count_list.append(objectcount)    count_list.append(divcount)    count_list.append(formcount)    return count, count_listif __name__ == '__main__':    url = 'http://www.baidu.com'    Html = urllib.urlopen(url).read()    count, count_list = tags_count(Html)    print count    print count_list

下面是结果:

{'script': 1, 'frame': 0, 'object': 0, 'html': 1, 'href': 0, 'iframe': 0, 'embed': 0, 'div': 18}
[1, 1, 0, 0, 0, 0, 0, 18, 1]

下面推荐几个链接写的有关FireFox浏览器安装xpath相关的插件来辅助进行解析页面的文章

http://blog.csdn.net/qiyueqinglian/article/details/49280221

http://blog.sina.com.cn/s/blog_5aefba9a0100csy8.html

http://blog.csdn.net/talking12391239/article/details/17349685

http://blog.csdn.net/sxl0727tu/article/details/51897693

http://www.cnblogs.com/swllow/p/6373253.html

http://www.myexception.cn/open-source/407398.html

http://www.jianshu.com/p/512f4b501ba2

http://download.csdn.net/detail/haozi409/9313945

0 0
原创粉丝点击