源码:injectedScore()初读

来源:互联网 发布:linux怎么开启防火墙 编辑:程序博客网 时间:2024/06/05 08:26

Inject类下的InjectMapper中有一句:

try {
         scfilters.injectedScore(value, datum);
        } catch (ScoringFilterException e) {
         if (LOG.isWarnEnabled()) {
         LOG.warn("Cannot filter injected score for url " + url
         + ", using default (" + e.getMessage() + ")");
         }
        }

里面调用的是ScoringFilters类的实例里面的方法:public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException

作用是:通过分数作为度量,计算(初始化)每个通过substitution和filter后的种子url分数。


接下来,说说ScoringFilters类,首先它实现了ScoringFilter接口,并且里面有一个private ScoringFilter[] filters;变量

再看看public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException方法的源代码:

 /** Calculate a new initial score, used when injecting new pages. */
  public void injectedScore(Text url, CrawlDatum datum) throws ScoringFilterException {
    for (int i = 0; i < this.filters.length; i++) {
      this.filters[i].injectedScore(url, datum);
    }
  }

可见ScoringFilters类实际上的injectedScore的工作是通过调用private ScoringFilter[] filters;变量中的每个filter的injectedScore()方法来完成的。

所以,很有必要知道怎么样去指定private ScoringFilter[] filters中的内容的。

留意到ScoringFilters类的public ScoringFilters(Configuration conf) 构造函数。

下面是public ScoringFilters(Configuration conf) 的源代码:(注意为了实验,作者增加了一些测试代码)

public ScoringFilters(Configuration conf) {
    super(conf);
    ObjectCache objectCache = ObjectCache.get(conf);

//读取conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order)获取次序
    String order = conf.get("scoring.filter.order");

//查看缓存中的内容,看是否存在filters(已经包含了次序的);比如,第二次使用时候,就有用了
    this.filters = (ScoringFilter[]) objectCache.getObject(ScoringFilter.class.getName());


    if (this.filters == null) {//第一次启用ScoringFilters类

//通过conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order)获取次序

// String[] 是一个数组,使用次序的;
      String[] orderedFilters = null;
      if (order != null && !order.trim().equals("")) {
        orderedFilters = order.split("\\s+");
      }


      try {
      //test by kaiwii
      System.out.println("try block");
        ExtensionPoint point = PluginRepository.get(conf).getExtensionPoint(ScoringFilter.X_POINT_ID);
        if (point == null) throw new RuntimeException(ScoringFilter.X_POINT_ID + " not found.");
        Extension[] extensions = point.getExtensions();
        HashMap<String, ScoringFilter> filterMap =
          new HashMap<String, ScoringFilter>();
        for (int i = 0; i < extensions.length; i++) {//通过ScoringFilter.X_POINT_ID获取扩展点,从而获取所有的ScoringFilter的实现类

                                                                        //ScoringFilter的实现类与扩展点之间的对应关系,通过遍历插件中的plugin.xml中获得(详情查看PluginRepository类的实现

          Extension extension = extensions[i];
          //for test by kaiwii
          System.out.println("extension_id "+i+extension.getId());
          System.out.println("extension_clazz "+i+extension.getClazz());
          //因为插件的实现采用了lazy_load的方式,所以上面获取extension时,只是获取了一个plugindescriptor(包含的仅是某个plugin的信息而已)

          //这里才真正进行实例化
          ScoringFilter filter = (ScoringFilter) extension.getExtensionInstance();
          if (!filterMap.containsKey(filter.getClass().getName())) {//用一个map来组织起所有filter的实例
            filterMap.put(filter.getClass().getName(), filter);
          }
        }
        if (orderedFilters == null) {//当conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order)为空的时候
          objectCache.setObject(ScoringFilter.class.getName(), filterMap.values().toArray(new ScoringFilter[0]));
        } else {
          ScoringFilter[] filter = new ScoringFilter[orderedFilters.length];
          for (int i = 0; i < orderedFilters.length; i++) {//按照orderedFilters指定的顺序,把filter放进一个临时变量中

            filter[i] = filterMap.get(orderedFilters[i]);
          }
          objectCache.setObject(ScoringFilter.class.getName(), filter);
        }
      } catch (PluginRuntimeException e) {
        throw new RuntimeException(e);
      }

//实际就是上面filter的值而言
      this.filters = (ScoringFilter[]) objectCache.getObject(ScoringFilter.class.getName());
      //for test by kaiwii
      System.out.println("filters content:kaiwii want to know:");
      for(ScoringFilter v:this.filters){
      System.out.println("filter:"+v.toString());
      }
    }
  }

到此,至于怎么样的ScoringFilter实现类会放到里面去,应该大概有个明白了吧?

其实,还要留意一下conf文件(conf/nutch-default.xml或者conf/nutch-site.xml)下的配置文件选项(scoring.filter.order):

<!-- scoring filters properties -->


<property>
  <name>scoring.filter.order</name>
  <value></value>
  <description>The order in which scoring filters are applied.
  This may be left empty (in which case all available scoring
  filters will be applied in the order defined in plugin-includes
  and plugin-excludes), or a space separated list of implementation
  classes.
  </description>
</property>

通过上面的英文介绍,你应该可以知道为空的时候,怎么样的ScoringFilter实现类会被调用,次序怎么样,就是 plugin.includes和plugin-excludes说了算。

就以默认的设置来看看吧,首先,scoring.filter.order为空;

然后 plugin.includes设置为:(相关内容高亮出来了!)

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(text|html|js|tika)|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

最后,plugin.excludes为空。

结合修改后的测试代码,看看run inject类的console怎么样:(相关内容高亮出来!)

Injector: starting at 2011-09-04 09:11:13
Injector: crawlDb: crawldb
Injector: urlDir: urls.txt
Injector: Converting injected urls to crawl db entries.
try block
extension_id 0org.apache.nutch.scoring.opic.OPICScoringFilter
extension_clazz 0org.apache.nutch.scoring.opic.OPICScoringFilter
filters content:kaiwii want to know:
filter:org.apache.nutch.scoring.opic.OPICScoringFilter@12a3793

using method:regexNormalize()
kaiwii want to know confiFilenull
use global rules
Injector: Merging injected urls into crawl db.
Injector: finished at 2011-09-04 09:11:19, elapsed: 00:00:05

通过,上面的console显示得知,只有org.apache.nutch.scoring.opic.OPICScoringFilter被调用了。

所以,要使用相关的ScoringFilter实现类,就要按照下面的操作进行了:

The order in which scoring filters are applied.
  This may be left empty (in which case all available scoring
  filters will be applied in the order defined in plugin-includes
  and plugin-excludes), or a space separated list of implementation
  classes.

总结:熟悉插件机制!!!!!!!!!!!!!!!!