Nutch中CrawlDatum的FetchTime的设置了解

来源:互联网 发布:端口号怎么telnet 编辑:程序博客网 时间:2024/05/14 03:30

昨天看错了,实际上对于爬取成功的url,在update()阶段,程序会将url的FetchTime+FetchInterval作为最终的下次FetchTime,这个FetchTime已经不再代表网页成功Fetch的时间,而是作为下次Fetch的时间,如果在小于新的FetchTime的时间内对该url进行爬去,程序将会过滤掉该url。

在CrawlDbReducer中的reduce函数:

    case CrawlDatum.STATUS_FETCH_SUCCESS:         // succesful fetch    case CrawlDatum.STATUS_FETCH_REDIR_TEMP:      // successful fetch, redirected    case CrawlDatum.STATUS_FETCH_REDIR_PERM:    case CrawlDatum.STATUS_FETCH_NOTMODIFIED:     // successful fetch, notmodified      // determine the modification status      int modified = FetchSchedule.STATUS_UNKNOWN;      if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) {        modified = FetchSchedule.STATUS_NOTMODIFIED;      } else {        if (oldSet && old.getSignature() != null && signature != null) {          if (SignatureComparator._compare(old.getSignature(), signature) != 0) {            modified = FetchSchedule.STATUS_MODIFIED;          } else {            modified = FetchSchedule.STATUS_NOTMODIFIED;          }        }      }      // set the schedule      System.err.println("1:result.fetchtime="+result.getFetchTime());      result = schedule.setFetchSchedule((Text)key, result, prevFetchTime,          prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified);      // set the result status and signature      System.err.println("2:result.fetchtime="+result.getFetchTime());      if (modified == FetchSchedule.STATUS_NOTMODIFIED) {        result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED);        if (oldSet) result.setSignature(old.getSignature());      } else {        switch (fetch.getStatus()) {        case CrawlDatum.STATUS_FETCH_SUCCESS:          result.setStatus(CrawlDatum.STATUS_DB_FETCHED);          break;        case CrawlDatum.STATUS_FETCH_REDIR_PERM:          result.setStatus(CrawlDatum.STATUS_DB_REDIR_PERM);          break;        case CrawlDatum.STATUS_FETCH_REDIR_TEMP:          result.setStatus(CrawlDatum.STATUS_DB_REDIR_TEMP);          break;        default:          LOG.warn("Unexpected status: " + fetch.getStatus() + " resetting to old status.");          if (oldSet) result.setStatus(old.getStatus());          else result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED);        }        result.setSignature(signature);        if (metaFromParse != null) {            for (Entry<Writable, Writable> e : metaFromParse.entrySet()) {              result.getMetaData().put(e.getKey(), e.getValue());            }          }      }      // if fetchInterval is larger than the system-wide maximum, trigger      // an unconditional recrawl. This prevents the page to be stuck at      // NOTMODIFIED state, when the old fetched copy was already removed with      // old segments.      if (maxInterval < result.getFetchInterval())        result = schedule.forceRefetch((Text)key, result, false);      break;

通过跟踪打印result的FetchTime值的情况,可以发现,程序在调用schedule.setFetchSchedule()函数之后,值FetchTime的值发生了变化,所以可以肯定是该函数改变了当前url的状态类CrawlDatum的FetchTime状态。

CrawlDbReducer类中,调用的FetchSchedule的扩展为DefaultFetchSchedule类,他的源代码:

public class DefaultFetchSchedule extends AbstractFetchSchedule {  @Override  public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum,          long prevFetchTime, long prevModifiedTime,          long fetchTime, long modifiedTime, int state) {//System.err.println("+++++++++++++++++++555555555555555+++++++++++++>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>");    datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime,        fetchTime, modifiedTime, state);    if (datum.getFetchInterval() == 0 ) {      datum.setFetchInterval(defaultInterval);    }    datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);    datum.setModifiedTime(modifiedTime);    return datum;  }}

可以看到该类中,只有一个方法setFetchSchedule(),该函数最终将datum的FetchTime的值设置为 datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);