htmlpaser打造个性化的爬虫程序 第二天

来源:互联网 发布:马特达蒙电影知乎 编辑:程序博客网 时间:2024/05/18 03:02

任务1:

抽取页面中的所有链接

  LinkBean lb = new LinkBean ();              lb.setURL ("http://sthaboutme.sinaapp.com/");              URL[] urls = lb.getLinks ();              for (int i = 0; i < urls.length; i++)                  System.out.println (urls[i]);

任务2:

抽取页面中满足既定条件的链接


try {Parser parser = new Parser("http://sthaboutme.sinaapp.com");String matchPattern = "http://sthaboutme.sinaapp.com/?";NodeFilter filter = new LinkRegexFilter(matchPattern);NodeList nlist = parser.extractAllNodesThatMatch(filter);System.out.println(nlist.size());for(int i= 0 ;i < nlist.size();i++){LinkTag link =(LinkTag)nlist.elementAt(i);System.out.println(link.getLink());}} catch (ParserException e) {// TODO Auto-generated catch blocke.printStackTrace();}

任务3:

抽取页面中满足多条件的链接

try {Parser parser = new Parser("http://sthaboutme.sinaapp.com");String StrContain = "http://";String StrNotContain ="#";NodeFilter filter1 = new LinkRegexFilter(StrContain);NodeFilter filter2 = new StringFilter(StrNotContain){ public boolean accept (Node node)    {        boolean ret = true;            if (LinkTag.class.isAssignableFrom (node.getClass ()))        {            String link = ((LinkTag)node).getLink ();                if (link.indexOf (mPattern) > -1)                {                 ret = false;                 // System.out.print(mPattern);                }                               }        return ret;    };};AndFilter andFilter = new AndFilter(filter1,filter2);NodeList nlist = parser.extractAllNodesThatMatch(andFilter);System.out.println(nlist.size());for(int i= 0 ;i < nlist.size();i++){LinkTag link =(LinkTag)nlist.elementAt(i);System.out.println(link.getLink());}} catch (ParserException e) {// TODO Auto-generated catch blocke.printStackTrace();}