【垂直搜索引擎搭建10】HtmlParser中Filter实践

来源:互联网 发布:app数据抓取 编辑:程序博客网 时间:2024/06/03 09:26

Filter种类:


判断类Filter:

TagNameFilter
HasAttributeFilter
HasChildFilter
HasParentFilter
HasSiblingFilter
IsEqualFilter


逻辑运算Filter:

AndFilter
NotFilter
OrFilter
XorFilter


其他Filter:

NodeClassFilter
StringFilter
LinkStringFilter
LinkRegexFilter
RegexFilter
CssSelectorNodeFilter


      这里介绍一下TagNameFilter、HasChildFilter、HasAttributeFilter 和这几个filter的组合使用方法。

package org.algorithm;import org.htmlparser.NodeFilter;import org.htmlparser.Parser;import org.htmlparser.filters.AndFilter;import org.htmlparser.filters.HasAttributeFilter;import org.htmlparser.filters.HasChildFilter;import org.htmlparser.filters.TagNameFilter;import org.htmlparser.util.NodeList;import org.htmlparser.util.ParserException;import org.htmlparser.Node;public class FilterImg {    public static void main(String[] args) throws ParserException {        Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");        NodeFilter filter = new TagNameFilter("p");        NodeList nodes = parser.extractAllNodesThatMatch(filter);        Node source = nodes.elementAt(0);        String sou = "";        if(source!=null){            sou = source.toString();        }        System.out.println(sou);    }}

场景一:
如果你想抓取页面中带有图片的链接,如何实现?方法很简单,采用一个链接的TagNameFilter,以及 具有图片的HasChildFilter,最后采用AndFilter将这两个串联起来,代码如下:

package org.algorithm;import org.htmlparser.NodeFilter;import org.htmlparser.Parser;import org.htmlparser.filters.AndFilter;import org.htmlparser.filters.HasAttributeFilter;import org.htmlparser.filters.HasChildFilter;import org.htmlparser.filters.TagNameFilter;import org.htmlparser.util.NodeList;import org.htmlparser.util.ParserException;import org.htmlparser.Node;public class FilterImg {    public static void main(String[] args) throws ParserException {        Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");        NodeFilter filter = new AndFilter(new TagNameFilter ("a"),new HasChildFilter (new TagNameFilter ("img")));        NodeList nodes = parser.extractAllNodesThatMatch(filter);        Node source = nodes.elementAt(0);        String sou = "";        if(source!=null){            sou = source.toString();        }        System.out.println(sou);    }}

场景二:
对于<div class=”f”><li class=”m”>这种类型的页面代码,如何抓取里面的内容。方式也不难,还是采用三个filter来实现,TagNameFilterHasAttributeFilterAndFilter,代码如下:

package org.algorithm;import org.htmlparser.NodeFilter;import org.htmlparser.Parser;import org.htmlparser.filters.AndFilter;import org.htmlparser.filters.HasAttributeFilter;import org.htmlparser.filters.HasChildFilter;import org.htmlparser.filters.TagNameFilter;import org.htmlparser.util.NodeList;import org.htmlparser.util.ParserException;import org.htmlparser.Node;public class FilterImg {    public static void main(String[] args) throws ParserException {        Parser parser = new Parser("http://smart.huanqiu.com/roll/2016-08/9351546.html");        NodeFilter filter = new AndFilter(new TagNameFilter("p"),new HasAttributeFilter("title"));        NodeList nodes = parser.extractAllNodesThatMatch(filter);        Node source = nodes.elementAt(0);        String sou = "";        if(source!=null){            sou = source.toString();        }        System.out.println(sou);    }}
0 0