利用HtmlParse获取Html内容并提取

来源：互联网发布：mac ps cs6永久序列号编辑：程序博客网时间：2024/04/30 09:06

一．网上获取html内容

1.利用url获取html内容：

public static String getHtmlContent(String urlstr){/*思路： 1.读出原网页：url--》openstream--》inputStreamRead---》bufferReader---》。read * 2.解决自动识别字符编码 利用cpdetecter：http://sourceforge.jp/projects/sfnet_cpdetector/*/String result="";if(StringUtil.isEmpty(urlstr)) return null;try {String charset = getCode(urlstr);//System.out.println(charset);URL url = new URL(urlstr);InputStream is = url.openStream();InputStreamReader isr = new InputStreamReader(is, charset);BufferedReader br = new BufferedReader(isr);String temp =null;while(true){temp = br.readLine();/*if(StringUtil.isNotEmpty(temp)){*/  // 这个工具不能滥用，因为temp可能是“”但是正文没结束;   if(temp !=null){result += temp+"\n";}else{break;}}} catch (Exception e) {e.printStackTrace();} return result;}

2.自动源码的识别字符编码

public static StringgetCode(String url){        // 引入cpdector包（），利用CodepageDetectorProxy代理装入JChardetfacade容器,然后detectCodePage出东东；具体查看文档，并自己推敲出来。        String result="";        if(StringUtil.isEmpty(url))return null;               CodepageDetectorProxy cdp =CodepageDetectorProxy.getInstance();        cdp.add(JChardetFacade.getInstance());        try {            result = cdp.detectCodepage(newURL(url)).toString();        } catch(Exception e) {            e.printStackTrace();        }        return result;

3.总结如何引入包，如何快速推敲所需的类

在包中有一个说明文档：binary-release.txt,仔细阅读即可。

我们知道CodepageDetectorProxy 是一个代理类，是个单例模式；开发api中有

说明这个代理类需要一个容器，于是我们找到ICodepageDetector有：

这几个实现类都是有对用功能的类，故由名字可以猜出JChardetFacade…这个可能比较大

二．正则表达式提取内容

在StringUtil中添加这个方法

public static StringgetContentUseRegex(String regexString ,String content,int index){        String result="";        if(isEmpty(regexString)|| isEmpty(content)) return result;               Pattern pattern = Pattern.compile(regexString);        Matcher matcher =pattern.matcher(content);        if(matcher.find()){            //System.out.println("find");            result = matcher.group(index);        }        return result;}测试：@Test    public voidgetContentUseRegexTest(){        //<h1 itemprop="headline">习近平在中非合作论坛约翰内斯堡峰会上总结讲话</h1>        String source = "<h1itemprop=\"headline\">习近平在中非合作论坛约翰内斯堡峰会上总结讲话</h1>";        String regex ="<h1(.*)itemprop=\\\"headline(.*)\\\">(.*)</h1>";        String str = StringUtil.getContentUseRegex(regex,source,3);        System.out.println(str);               //<divclass="time" id="pubtime_baidu" itemprop="datePublished"content="2015-12-06T08:35:00+08:00">2015-12-06 08:35:00</div>        source = "<divclass=\"time\" id=\"pubtime_baidu\"itemprop=\"datePublished\"content=\"2015-12-06T08:35:00+08:00\">2015-12-0608:35:00</div>";        regex = "<div(.*)itemprop=\\\"datePublished\\\"(.*)>(.*)</div>";        str = StringUtil.getContentUseRegex(regex,source, 3);        System.out.println(str);}

三． htmlparser抽取内容

简介：htmlparser^[1] 是一个纯的java写的html（标准通用标记语言下的一个应用）解析的库，它不依赖于其它的java库文件，主要用于改造或提取html。它能超高速解析html，而且不会出错。下载地址：http://sourceforge.net/projects/htmlparser/files/Integration-Builds/2.0-20060923/

1引入htmlparser.jar,htmlexer.jar

2封装获取节点文本方法

public static StringgetContentUseParse(String urlstr,String encoding,String tag,StringattrName,String attrVal){        /* 思路：引用htmlParse包--》Parse。parse（AndFileter）         *其中NodeFileter是一个接口，AndFilterTagNameFilter HasAttributeFilter都是其实现类         *AndFilter 是一个可以层层封装的过滤类；用AndFilter andFilter= new AndFilter(new TagNameFilter("h1"),new HasAttributeFilter("itemprop","headline"));         *解析后得到NodeList ，于是就可以了         */            String result ="";        AndFilter andFilter=null;        if(StringUtil.isEmpty(urlstr))return result;        if(StringUtil.isEmpty(encoding))encoding="utf-8";        try {            Parser parser = newParser(urlstr);            parser.setEncoding(encoding);            if(StringUtil.isNotEmpty(attrName)&& StringUtil.isNotEmpty(attrVal)){                 andFilter = newAndFilter(new TagNameFilter(tag),newHasAttributeFilter(attrName, attrVal));            }else if(StringUtil.isNotEmpty(attrName)&& StringUtil.isEmpty(attrVal)){                 andFilter = newAndFilter(new TagNameFilter(tag),newHasAttributeFilter(attrName));            }else{                NodeFilter[]  nodeFilters = newNodeFilter[1];                nodeFilters[0] = newTagNameFilter(tag);                 andFilter = newAndFilter(nodeFilters);            }            NodeList nodeLists =parser.parse(andFilter);            parser.reset();            Node node = nodeLists.elementAt(0);            result = node.toPlainTextString();        } catch(Exception e) {            e.printStackTrace();        }        return result;   }

3 测试：

@Testpublic void getHtmlContentUseParseTest(){//<div class=\"time\" id=\"pubtime_baidu\" itemprop=\"datePublished\" content=\"2015-12-06T08:35:00+08:00\">2015-12-06 08:35:00</div>//<h1 itemprop="headline">习近平在中非合作论坛约翰内斯堡峰会上总结讲话</h1>String encoding = HtmlUtil.getCode("http://news.sohu.com/20151206/n429917146.shtml");String str = HtmlUtil.getContentUseParse("http://news.sohu.com/20151206/n429917146.shtml", encoding,"h1","itemprop","headline");System.out.println(str);}

1 0