Heritrix3.3.0源码阅读 URI过滤规则

来源:互联网 发布:淘宝财务管理软件 编辑:程序博客网 时间:2024/05/01 10:46

在Heritrix3.3.0源码阅读 crawler-beans.cxml中URI过滤规则的配置中,我们看到了Heritrix3.3.0配置的用于决定URI是否被接受的类。而本文的目的是,通过阅读源码,了解

(1)一个URI处理类是怎样工作的

(2)一系列URI处理类是如何配合工作的。

首先,我们来解决第一个问题。

(一)

所有URI处理类都必须继承DecideRule抽象类:

<span style="font-size:24px;">package org.archive.modules.deciderules;import java.io.Serializable;import org.archive.modules.CrawlURI;import org.archive.spring.HasKeyedProperties;import org.archive.spring.KeyedProperties;public abstract class DecideRule implements Serializable, HasKeyedProperties {    // 一个线程安全的HashMap,用于保存一些键值对    protected KeyedProperties kp = new KeyedProperties();    public KeyedProperties getKeyedProperties() {        return kp;    }        {        setEnabled(true);    }    public boolean getEnabled() {        return (Boolean) kp.get("enabled");    }    public void setEnabled(boolean enabled) {        kp.put("enabled",enabled);    }    protected String comment = "";    public String getComment() {        return comment;    }    public void setComment(String comment) {        this.comment = comment;    }        public DecideRule() {    }        /**     * 为一个URI做决策     * @param uri     * @return     */    public DecideResult decisionFor(CrawlURI uri) {    // enabled的状态为false就返回DecideResult.NONE        if (!getEnabled()) {            return DecideResult.NONE;        }        // innerDecide方法才是用来做决策的        DecideResult result = innerDecide(uri);                // 我觉得是废话,如果有谁知道用处,希望告知        if (result == DecideResult.NONE) {            return result;        }        return result;    }        /**     * 真正做决策的方法     * @param uri     * @return     */    protected abstract DecideResult innerDecide(CrawlURI uri);        /**     * 该方法在该规则只有一个决策结果时有用     * @param uri     * @return     */    public DecideResult onlyDecision(CrawlURI uri) {        return null;    }    /**     * 判断是否接受某个URI     * @param uri     * @return     */    public boolean accepts(CrawlURI uri) {    // 通过decisionFor方法的判定结果与DecideResult.ACCEPT作比较    // 来判定是否接受某个URI        return DecideResult.ACCEPT == decisionFor(uri);    }}</span>
enable的值决定了一个处理类是否处理URI,true表示处理,false表示不处理。用来获得该处理类对某个URI的处理结果的方法是decisionFor。这个方法在enable为false时直接返回NONE(它的意义接下来就会给出);如果enable为true,就调用innerDecide方法来对URI进行处理。innerDecide方法在子类中实现。这里还必须提提onlyDecision方法,它在处理类仅会返回一种处理结果时有用。


接下来看看DecideRule中老是出现的DecideResult:

package org.archive.modules.deciderules;/** * The decision of a DecideRule. *  * DecideRule决定 *  * @author pjack */public enum DecideResult {    /** Indicates the URI was accepted. */// 表示这个URI是被接受的    ACCEPT,         /** Indicates the URI was neither accepted nor rejected. */    // 表示这个URI及没有被接受,也没有被拒绝    NONE,         /** Indicates the URI was rejected. */    // 表示这个URI被拒绝了    REJECT;        /**     * 反转结果     * @param result     * @return     */    public static DecideResult invert(DecideResult result) {        switch (result) {            case ACCEPT:                return REJECT;            case REJECT:                return ACCEPT;            default:                return result;        }    }}
它的作用看一眼就明了了,就不多说了。

接下来,选两个DecideRule的具体子类来说说。先看看RejectDecideRule类,它是配置的第一个具体处理类:

package org.archive.modules.deciderules;import org.archive.modules.CrawlURI;/** * 该类对所有URI返回结果都为DecideResult.REJECT * */public class RejectDecideRule extends DecideRule {    private static final long serialVersionUID = 3L;    @Override    protected DecideResult innerDecide(CrawlURI uri) {        return DecideResult.REJECT;    }            @Override    public DecideResult onlyDecision(CrawlURI uri) {        return DecideResult.REJECT;    }}

这个处理类重写了DecideRule的innerDecide方法和onlyDecision方法。从它简短的代码中一眼就能看出,它对所有URI都返回REJECT。

然后看看TooManyHopsDecideRule:

/** * Rule REJECTs any CrawlURIs whose total number of hops (length of the  * hopsPath string, traversed links of any type) is over a threshold. * Otherwise returns PASS. * * 规则拒绝所有这样的CrawlURIs:它们的跳数(深度)大于阈值。对于另外的CrawlURIs, * 既不接受,也不拒绝。 * * @author gojomo */public class TooManyHopsDecideRule extends PredicatedDecideRule {    private static final long serialVersionUID = 3L;    /** default for this class is to REJECT */    /**     * 默认情况下,返回DecideResult.REJECT     */    {        setDecision(DecideResult.REJECT);    }        /**     * Max path depth for which this filter will match.     */    /**     * 设置默认最大深度     */    {            setMaxHops(20);    }    public int getMaxHops() {        return (Integer) kp.get("maxHops");    }    public void setMaxHops(int maxHops) {        kp.put("maxHops", maxHops);    }        /**     * Usual constructor.      */    public TooManyHopsDecideRule() {    }    /**     * Evaluate whether given object is over the threshold number of     * hops.     *      * 评估给的CrawlURI是否超过了设置的最大深度     *      * @param object     * @return true if the mx-hops is exceeded     */    @Override    protected boolean evaluate(CrawlURI uri) {        return uri.getHopCount() > getMaxHops();    }}
要讲这个类,还必须看看它的直接父类的代码:

/** * Rule which applies the configured decision only if a  * test evaluates to true. Subclasses override evaluate() * to establish the test.  *  * 当evaluate方法返回true时,才应用配置的规则。子类需要重写evaluate * 函数。 * * @author gojomo */public abstract class PredicatedDecideRule extends DecideRule {    {        setDecision(DecideResult.ACCEPT);    }    public DecideResult getDecision() {        return (DecideResult) kp.get("decision");    }    public void setDecision(DecideResult decision) {        kp.put("decision",decision);    }        public PredicatedDecideRule() {    }    @Override    protected DecideResult innerDecide(CrawlURI uri) {        if (evaluate(uri)) {            return getDecision();        }        return DecideResult.NONE;    }    protected abstract boolean evaluate(CrawlURI object);}
PredicatedDecideRule重写了DecideRule的innerDecide,而innerDecide方法又把决策委托给evaluate方法去做,evaluate方法在TooManyHopsDecideRule中被重写。TooManyHopsDecideRule在URI的深度小于设置的最大深度时,返回ACCEPT;对其它URI返回NONE。

由这几个类的代码阅读可以看出,处理类用于得出结果的方法是innerDecide;当我们需要写我们自己的URI处理类时,只需要继承DecideRule,并重写innerDecide方法就行。

接下来看看,处理序列中的多个处理类是怎样协同工作的。

(2)

我们看看DecideRuleSequence类:

package org.archive.modules.deciderules;import java.util.List;import java.util.logging.Level;import java.util.logging.Logger;import org.archive.modules.CrawlURI;import org.archive.modules.SimpleFileLoggerProvider;import org.archive.modules.net.CrawlHost;import org.archive.modules.net.ServerCache;import org.json.JSONObject;import org.springframework.beans.factory.BeanNameAware;import org.springframework.beans.factory.annotation.Autowired;import org.springframework.context.Lifecycle;public class DecideRuleSequence extends DecideRule implements BeanNameAware, Lifecycle {    final private static Logger LOGGER =             Logger.getLogger(DecideRuleSequence.class.getName());    private static final long serialVersionUID = 3L;    protected transient Logger fileLogger = null;    /**     * If enabled, log decisions to file named logs/{spring-bean-id}.log. Format     * is: [timestamp] [decisive-rule-num] [decisive-rule-class] [decision]     * [uri] [extraInfo]     *      * Relies on Spring Lifecycle to initialize the log. Only top-level     * beans get the Lifecycle treatment from Spring, so bean must be top-level     * for logToFile to work. (This is true of other modules that support     * logToFile, and anything else that uses Lifecycle, as well.)     */    /**     * 如果logToFile为真,就把决策放到日志文件中:logs/{spring-bean-id}.log。     */    {        setLogToFile(false);    }    public boolean getLogToFile() {        return (Boolean) kp.get("logToFile");    }    public void setLogToFile(boolean enabled) {        kp.put("logToFile",enabled);    }    /**     * Whether to include the "extra info" field for each entry in crawl.log.     * "Extra info" is a json object with entries "host", "via", "source" and     * "hopPath".     */    protected boolean logExtraInfo = false;    public boolean getLogExtraInfo() {        return logExtraInfo;    }    public void setLogExtraInfo(boolean logExtraInfo) {        this.logExtraInfo = logExtraInfo;    }    // provided by CrawlerLoggerModule which is in heritrix-engine, inaccessible    // from here, thus the need for the SimpleFileLoggerProvider interface    protected SimpleFileLoggerProvider loggerModule;    public SimpleFileLoggerProvider getLoggerModule() {        return this.loggerModule;    }    @Autowired    public void setLoggerModule(SimpleFileLoggerProvider loggerModule) {        this.loggerModule = loggerModule;    }    @SuppressWarnings("unchecked")    public List<DecideRule> getRules() {        return (List<DecideRule>) kp.get("rules");    }    /**     * 在这里把规则集合注入了进来     * @param rules     */    public void setRules(List<DecideRule> rules) {        kp.put("rules", rules);    }    protected ServerCache serverCache;    public ServerCache getServerCache() {        return this.serverCache;    }    @Autowired    public void setServerCache(ServerCache serverCache) {        this.serverCache = serverCache;    }    /**     * 真正做决定的方法;     * 从这个方法可以看出,在规则链的后面的规则得出的非DecideResult.NONE决策     * 会覆盖前面的规则得出的决策。     */    public DecideResult innerDecide(CrawlURI uri) {        DecideRule decisiveRule = null;        // 真正做决定的规则        int decisiveRuleNumber = -1;        // 默认既不拒绝,也不接受        DecideResult result = DecideResult.NONE;        List<DecideRule> rules = getRules();        int max = rules.size();        for (int i = 0; i < max; i++) {            DecideRule rule = rules.get(i);            if (rule.onlyDecision(uri) != result) {                DecideResult r = rule.decisionFor(uri);                if (LOGGER.isLoggable(Level.FINEST)) {                    LOGGER.finest("DecideRule #" + i + " " +                             rule.getClass().getName() + " returned " + r + " for url: " + uri);                }                if (r != DecideResult.NONE) {                    result = r;                    decisiveRule = rule;                    decisiveRuleNumber = i;                }            }        }        decisionMade(uri, decisiveRule, decisiveRuleNumber, result);        return result;    }    /**     * 在一个CrawlURI被决定是否接受之后被调用的方法     * @param uri     * @param decisiveRule     * @param decisiveRuleNumber     * @param result     */    protected void decisionMade(CrawlURI uri, DecideRule decisiveRule,            int decisiveRuleNumber, DecideResult result) {        if (fileLogger != null) {            JSONObject extraInfo = null;            if (logExtraInfo) {                CrawlHost crawlHost = getServerCache().getHostFor(uri.getUURI());                String host = "-";                if (crawlHost != null) {                    host  = crawlHost.fixUpName();                }                extraInfo = new JSONObject();                extraInfo.put("hopPath", uri.getPathFromSeed());                extraInfo.put("via", uri.getVia());                extraInfo.put("seed", uri.getSourceTag());                extraInfo.put("host", host);            }            fileLogger.info(decisiveRuleNumber                     + " " + decisiveRule.getClass().getSimpleName()                     + " " + result                     + " " + uri                    + (extraInfo != null ? " " + extraInfo : ""));        }    }    protected String beanName;    public String getBeanName() {        return this.beanName;    }    @Override    public void setBeanName(String name) {        this.beanName = name;    }    protected boolean isRunning = false;    @Override    public boolean isRunning() {        return isRunning;    }    @Override    public void start() {    // 实例化日志        if (getLogToFile() && fileLogger == null) {            fileLogger = loggerModule.setupSimpleLog(getBeanName());        }        isRunning = true;    }    @Override    public void stop() {        isRunning = false;    }}
这个类同样是DecideRule的子类,它重写了innerDecide方法,并从该方法的实现可以看出,当后面的处理类的返回结果不为NONE时,新的结果就会覆盖老的结果。这时,我们终于明白了配置文件中的这句话:

<!-- SCOPE: rules for which discovered URIs to crawl; order is very  
      important because last decision returned other than 'NONE' wins. -->


所以,当我们需要定制我们自己的URI过滤过则时,我们不仅需要控制innerDecide的行为,还需要调整各个处理类的顺序。

(由于各个类的内容少且简单,故把所有代码都贴上来了)

1 1