python过滤html文档中的Tag标签

来源：互联网发布：华云数据招聘编辑：程序博客网时间：2024/06/06 08:29

最近在练习爬虫时需提取HTML文档正文内容，现总结如下方法。

方法一：

模块 lxml.html.clean 提供一个Cleaner 类来清理 HTML 页。它支持删除嵌入或脚本内容、特殊标记、 CSS 样式注释或者更多。　　

注意，page_structure,safe_attrs_only为False时保证页面的完整性，否则，这个Cleaner会把你的html结构与标签里的属性都给清理了。

Cleaner参数说明：http://lxml.de/api/lxml.html.clean.Cleaner-class.html

from lxml.html.clean import Cleanerimport requestsurl ='http://www.csh.com.cn/'headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}html = requests.get(url, headers=headers).content#清除不必要的标签cleaner = Cleaner(style = True,scripts=True,comments=True,javascript=True,page_structure=False,safe_attrs_only=False)content = cleaner.clean_html(html.decode('utf-8')).encode('utf-8')#这里打印出来的结果会将上面过滤的标签去掉，但是未过滤的标签任然存在。print content

方法二：

正则表达式过滤标签。

（1）过滤全部标签：

import reimport requestsurl ='http://www.csh.com.cn/'headers={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'}html = requests.get(url, headers=headers).content#清除所有标签，正则匹配所有标签。reg = re.compile('<[^>]*>')content = reg.sub('',html).replace('\n','').replace(' ','')#此时所得结果为页面文本内容，不包含任何标签信息。print content

（2）过滤指定标签：

此方式我在测试时有的script标签不能完全过滤掉。大家可视情况而定。

def filter_tags(htmlstr):  re_cdata=re.compile('//<!\[CDATA\[[^>]*//\]\]>',re.I) #匹配CDATA  re_script=re.compile('<\s*script[^>]*>[^<]*<\s*/\s*script\s*>',re.I)#Script  re_style=re.compile('<\s*style[^>]*>[^<]*<\s*/\s*style\s*>',re.I)#style  re_br=re.compile('<br\s*?/?>')#处理换行  re_h=re.compile('</?\w+[^>]*>')#HTML标签  re_comment=re.compile('<!--[^>]*-->')#HTML注释  blank_line=re.compile('\n+')  #过滤匹配内容  s=re_cdata.sub('',htmlstr)#去掉CDATA  s=re_script.sub('',s) #去掉SCRIPT  s=re_style.sub('',s)#去掉style  s=re_br.sub('\n',s)#将br转换为换行  s=re_h.sub('',s) #去掉HTML 标签  s=re_comment.sub('',s)#去掉HTML注释  s=blank_line.sub('\n',s)#去掉多余的空行  return s

方法三：

BeautifulSoup过滤指定标签有以下三种方法：

clear() ：clear() 方法移除当前tag的内容:

extract()：extract() 方法将当前tag移除文档树,并作为方法结果返回:

decompose()：decompose() 方法将当前节点移除文档树并完全销毁:

from bs4 import BeautifulSoup# clear() 方法移除当前tag的内容:markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'soup = BeautifulSoup(markup)a_clear = soup.ai_clear = soup.i.clear()# extract() 方法将当前tag移除文档树,并作为方法结果返回markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'soup = BeautifulSoup(markup)a_extract = soup.ai_extract = soup.i.extract()# decompose() 方法将当前节点移除文档树并完全销毁markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'soup = BeautifulSoup(markup)a_decompose = soup.ai_decompose = soup.i.decompose()# 输出print a_clear         # <a href="http://example.com/">I linked to <i></i></a>print i_clear         # Noneprint a_extract       # <a href="http://example.com/">I linked to </a>print i_extract       # <i>example.com</i>print a_decompose     # <a href="http://example.com/">I linked to </a>print i_decompose     # None

这些方法仅供参考，请大家根据自己的情况择优选取使用。

参考文章：

python-27：clear()，extract()，decompose()：http://www.w2bc.com/article/89892

阅读全文

0 0