HTMLPARSER学习小结(二)

来源:互联网 发布:linux搜狗输入法 编辑:程序博客网 时间:2024/05/17 08:07

判斷類Filter,該類與parser 配合使用,具體用法見下面例證:

1.TagNameFilter
TabNameFilter
是最容易理解的一个Filter,根据Tag的名字进行过滤。


            Parser parser = new Parser( URL );
           
 NodeFilter filter = new TagNameFilter ("DIV");
            NodeList nodes = parser.extractAllNodesThatMatch(filter); 

            
            if(nodes!=null) {
                for (int i = 0; i < nodes.size(); i++) {
                    Node textnode = (Node) nodes.elementAt(i);
                    
                    message("getText:"+textnode.getText());

                   System.out.println(textnode);
                    message("=================================================");

    

                  }

         }

結果:

Tag (294[4,0],313[4,19]): div id="top_main"
  Txt (313[4,19],319[5,4]): \n    
  Tag (319[5,4],339[5,24]): div id="logoindex"
    Txt (339[5,24],349[6,8]): \n        
    Rem (349[6,8],360[6,19]): 这是注释
    Txt (360[6,19],391[8,0]): \n        白泽居-www.baizeju.com\n
    Tag (391[8,0],424[8,33]): a href="http://www.baizeju.com"
      Txt (424[8,33],443[8,52]): 白泽居-www.baizeju.com
      End (443[8,52],447[8,56]): /a
    Txt (447[8,56],453[9,4]): \n    
    End (453[9,4],459[9,10]): /div
  Txt (459[9,10],486[11,0]): \n    白泽居-www.baizeju.com\n
  End (486[11,0],492[11,6]): /div


getText:div id="top_main"
=================================================
Tag (319[5,4],339[5,24]): div id="logoindex"
  Txt (339[5,24],349[6,8]): \n        
  Rem (349[6,8],360[6,19]): 这是注释
  Txt (360[6,19],391[8,0]): \n        白泽居-www.baizeju.com\n
  Tag (391[8,0],424[8,33]): a href="http://www.baizeju.com"
    Txt (424[8,33],443[8,52]): 白泽居-www.baizeju.com
    End (443[8,52],447[8,56]): /a
  Txt (447[8,56],453[9,4]): \n    
  End (453[9,4],459[9,10]): /div


getText:div id="logoindex"
=================================================

2.HasChildFilter

修改代码:

Parser parser = new Parser( URL );
NodeFilter innerFilter = new TagNameFilter ("DIV");
NodeFilter filter = new HasChildFilter(innerFilter);
NodeList nodes = parser.extractAllNodesThatMatch(filter);
输出结果:
getText:body 
=================================================
getText:div id="top_main"
=================================================
可以看到,输出的是两个有DIVTagTag节点。(body有子节点DIV "top_main""top_main"有子节点"logoindex"

注意HasChildFilter还有一个构造函数:
public HasChildFilter (NodeFilter filter, boolean recursive)
如果recursivefalse,则只对第一级子节点进行过滤。比如前面的例子,bodytop_main都是在第一级的子节点里就有DIV节点,所以匹配上了。如果我们用下面的方法调用:
NodeFilter filter = new HasChildFilter( innerFilter, true );

输出结果:
getText:html xmlns="http://www.w3.org/1999/xhtml"
=================================================
getText:body 
=================================================
getText:div id="top_main"
=================================================
可以看到输出结果中多了一个html xmlns="http://www.w3.org/1999/xhtml",这个是整个HTML页面的节点(根节点),虽然这个节点下直接没有DIV节点,但是它的子节点body下面有DIV节点,所以它也被匹配上了。





0 0
原创粉丝点击