Heritrix 的主题抓取策略

来源:互联网 发布:175凌波城输出数据 编辑:程序博客网 时间:2024/04/27 13:49

 

Hetiitrix 主题策略抓取主要分两种:基于链接和基于内容。

扩展FrontierScheduler (是否作为候选URL,每个候选URL都创建一个线程)和扩展Extractor(对于页面的内容是否进行抽取)

一.      扩展FrontierScheduler

1.新建org.archive.crawler.postprocessor.MyFrontierScheduler|MyFrontierScheduler

继承FrontierScheduler类,重写schedule方法,过滤剩下包含news的域名:

package org.archive.crawler.postprocessor;  import java.util.logging.Level;import java.util.logging.Logger; importorg.archive.crawler.datamodel.CandidateURI;import org.archive.crawler.datamodel.CrawlURI;import org.archive.crawler.datamodel.FetchStatusCodes;import org.archive.crawler.framework.Processor;  public classMyFrontierScheduler extends FrontierScheduler {     /**     *     */    private static final long serialVersionUID =-1074778906898000967L;     /**     * @param nameName of this filter.     */    publicMyFrontierScheduler(String name) {       super(name);    }    @Override    protected voidschedule(CandidateURI caUri) {    if(caUri.toString().contains("news")){         System.out.println(caUri.toString());         getController().getFrontier().schedule(caUri);    }   }} 


2.在conf的process.options中添加这个类

3.配置heritrix的前端并注意首次地址要包含“news”

 

 

 

 

二.      扩展Extractor

1.   在org.archive.extractor 中添加MyExtractor

2.   覆盖extract方法,其中extract有一个curi的参数。

CrawlURL是候选URL的封装版(封装了HttpRecorder,Link等)。里面有HttpRecorder类,该类可以找到CharSequence 的context文本。以下是API文档对CrawlURL类的简介:

Represents acandidate URI and the associated state it collects as it is crawled.

Core stateis in instance variables but a flexible attribute list is also available. Usethis 'bucket' to carry custom processing extracted data and state acrossCrawlURI processing. See theCandidateURI.putString(String, String),CandidateURI.getString(String), etc.

利用CrawlURL,我们首先可以获取context文本,然后在文本中找出想要的链接,最后在这个队列后面添加需要的链接。

 

import java.util.regex.Matcher;import java.util.regex.Pattern; import org.apache.commons.httpclient.URIException;import org.archive.crawler.datamodel.CrawlURI;import org.archive.crawler.extractor.Extractor;import org.archive.crawler.extractor.Link;import org.archive.io.ReplayCharSequence;import org.archive.util.HttpRecorder; public class MyExtractor extends Extractor{  //<a href="XXXX"...=... >  //http://news.sina.com.cn/c/nd/2015-12-08/doc-ifxmhqac0214384.shtml  //http://mil.news.sina.com.cn/jqs/2015-12-08/doc-ifxmnurf8411968.shtml  private String HERF ="<a(.*)href\\s*=\\s*(\"([^\"]*)\"|([^\\s>]*))(.*)>";  private String sinaUrl ="http://(.*)news.sina.com.cn(.*).shtml";  public MyExtractor(Stringname, String description) {             super(name,description);        }public MyExtractor(String name) {       super(name, "sinaextrator");  }   /**   *   */  private static final longserialVersionUID = -963034874191929396L;   @Override  protected voidextract(CrawlURI curi) {       String url="";       try {       HttpRecorder hr =curi.getHttpRecorder();       if(null == hr){             throw newException("httprecorder is null");       }       ReplayCharSequence  rc = hr.getReplayCharSequence();       if(null == rc){             return ;       }       String context =rc.toString();             Pattern pattern =Pattern.compile(HERF,Pattern.CASE_INSENSITIVE);       Matcher matcher =pattern.matcher(context);       while(matcher.find()){             url =matcher.group(2);             url =url.replace("\"", "");             //System.out.println(url);             if(url.matches(sinaUrl)){                  System.out.println(url);                  <span style="color:#ff0000;">curi.createAndAddLinkRelativeToBase(url, context,Link.NAVLINK_HOP);</span>             }       }       } catch (Exception e) {             e.printStackTrace();       }        } } 

Ps:要一定要有Myextrator(String)这个构造函数,因为默认用这个构造函数。



3.在conf/process.options中添加这个类,然后在前台的extract中选中这个类即可,注意这个类必须是放在HTTP的下面。



1 0
原创粉丝点击