Nutch中CrawlDatum的FetchTime的设置了解
来源:互联网 发布:端口号怎么telnet 编辑:程序博客网 时间:2024/05/14 03:30
昨天看错了,实际上对于爬取成功的url,在update()阶段,程序会将url的FetchTime+FetchInterval作为最终的下次FetchTime,这个FetchTime已经不再代表网页成功Fetch的时间,而是作为下次Fetch的时间,如果在小于新的FetchTime的时间内对该url进行爬去,程序将会过滤掉该url。
在CrawlDbReducer中的reduce函数:
case CrawlDatum.STATUS_FETCH_SUCCESS: // succesful fetch case CrawlDatum.STATUS_FETCH_REDIR_TEMP: // successful fetch, redirected case CrawlDatum.STATUS_FETCH_REDIR_PERM: case CrawlDatum.STATUS_FETCH_NOTMODIFIED: // successful fetch, notmodified // determine the modification status int modified = FetchSchedule.STATUS_UNKNOWN; if (fetch.getStatus() == CrawlDatum.STATUS_FETCH_NOTMODIFIED) { modified = FetchSchedule.STATUS_NOTMODIFIED; } else { if (oldSet && old.getSignature() != null && signature != null) { if (SignatureComparator._compare(old.getSignature(), signature) != 0) { modified = FetchSchedule.STATUS_MODIFIED; } else { modified = FetchSchedule.STATUS_NOTMODIFIED; } } } // set the schedule System.err.println("1:result.fetchtime="+result.getFetchTime()); result = schedule.setFetchSchedule((Text)key, result, prevFetchTime, prevModifiedTime, fetch.getFetchTime(), fetch.getModifiedTime(), modified); // set the result status and signature System.err.println("2:result.fetchtime="+result.getFetchTime()); if (modified == FetchSchedule.STATUS_NOTMODIFIED) { result.setStatus(CrawlDatum.STATUS_DB_NOTMODIFIED); if (oldSet) result.setSignature(old.getSignature()); } else { switch (fetch.getStatus()) { case CrawlDatum.STATUS_FETCH_SUCCESS: result.setStatus(CrawlDatum.STATUS_DB_FETCHED); break; case CrawlDatum.STATUS_FETCH_REDIR_PERM: result.setStatus(CrawlDatum.STATUS_DB_REDIR_PERM); break; case CrawlDatum.STATUS_FETCH_REDIR_TEMP: result.setStatus(CrawlDatum.STATUS_DB_REDIR_TEMP); break; default: LOG.warn("Unexpected status: " + fetch.getStatus() + " resetting to old status."); if (oldSet) result.setStatus(old.getStatus()); else result.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); } result.setSignature(signature); if (metaFromParse != null) { for (Entry<Writable, Writable> e : metaFromParse.entrySet()) { result.getMetaData().put(e.getKey(), e.getValue()); } } } // if fetchInterval is larger than the system-wide maximum, trigger // an unconditional recrawl. This prevents the page to be stuck at // NOTMODIFIED state, when the old fetched copy was already removed with // old segments. if (maxInterval < result.getFetchInterval()) result = schedule.forceRefetch((Text)key, result, false); break;
通过跟踪打印result的FetchTime值的情况,可以发现,程序在调用schedule.setFetchSchedule()函数之后,值FetchTime的值发生了变化,所以可以肯定是该函数改变了当前url的状态类CrawlDatum的FetchTime状态。
CrawlDbReducer类中,调用的FetchSchedule的扩展为DefaultFetchSchedule类,他的源代码:
public class DefaultFetchSchedule extends AbstractFetchSchedule { @Override public CrawlDatum setFetchSchedule(Text url, CrawlDatum datum, long prevFetchTime, long prevModifiedTime, long fetchTime, long modifiedTime, int state) {//System.err.println("+++++++++++++++++++555555555555555+++++++++++++>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>"); datum = super.setFetchSchedule(url, datum, prevFetchTime, prevModifiedTime, fetchTime, modifiedTime, state); if (datum.getFetchInterval() == 0 ) { datum.setFetchInterval(defaultInterval); } datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000); datum.setModifiedTime(modifiedTime); return datum; }}
可以看到该类中,只有一个方法setFetchSchedule(),该函数最终将datum的FetchTime的值设置为 datum.setFetchTime(fetchTime + (long)datum.getFetchInterval() * 1000);
- Nutch中CrawlDatum的FetchTime的设置了解
- nutch下的CrawlDatum作用
- nutch中调用CrawlDatum的set()函数的地方
- nutch爬取结果中为什么最后的链接状态CrawlDatum错了?
- 如何设置nutch中摘要的长度
- Nutch中metadata的分析
- Nutch中MapReduce的分析
- Nutch中MapReduce的分析
- Nutch中MapReduce的分析
- Nutch中MapReduce的分析
- Nutch中MapReduce的分析
- nutch中bin/crawl和bin/nutch crawl的用法
- [Nutch]Nutch抓取过程中生成的目录内容分析
- Nutch如何读取CrawlDb中的<Text,CrawlDatum>键值对
- 【Nutch】Nutch的抓取流程
- nutch中插件是如何调用的?
- Nutch中需要重写的部分
- nutch-1.4中IndexingFilter的变化
- action name="plainText" 直接读取jsp,中文存储
- MapKit学习笔记及源码分享
- winform textbox 输入状态下隐藏光标
- assert()函数用法总结
- 开始了总比没开始好
- Nutch中CrawlDatum的FetchTime的设置了解
- CountDownTimer用法详解
- 算法策略特点总结
- 快速排序和归并排序区别
- ios开发之NSUserDefaults
- cocos2d实现了以下CCEaseAction类
- HTTPCookie 的使用和讲解
- JVM内存溢出分析
- ZOJ Monthly, July 2012浙大月赛解题报告