crawler4j源码分析（二）Frontier

来源：互联网发布：标签纸打印软件编辑：程序博客网时间：2024/06/06 02:58

这节我们来看看crawler4j中的URL管理机制--Frontier的设计和实现。

总体上来讲，crawler4j的Frontier提供了如下一些功能，大部分都是URL队列应该有的基本功能：

（1）URL的获取，URL的保存，以及删除。

（2）不同类型URL的统计，如放入队列的URL总数和已处理的URL总数

（3）恢复上次爬取中没有处理完的URL（已经从URL队列中取出，但还没有处理）

URL的保存使用了oracle 的je数据库，其内部采用BTree结构实现，支持重复键，事务和锁机制，因此，不仅查询和插入都比较块，并且支持高的并发性，同步什么都不是问题。在crawler4j中除了URL之外，还有实现URL去重的结构，记录正在处理的URL以及计数器都是通过JE来实现。

首先谈谈第一个和第二个功能

所有即将爬取的URL都放在一个待爬取队列WorkQueues中，这个队列内部的URL按照如下顺序排列：优先级，深度，docID，即：优先级越高，深度越小，docID越小

的URL越早被爬取。这里的优先级用户可以自己基于某个准则指定，深度为当前URL在整个爬取过程中被发现的层次，docID为被发现的顺序。由此可知，当一个新的URL被插入到队列中后，其所处位置也就已经确定，这个插入操作是由JE内部来保证的，在JE内部每个URL都以字节来存放，JE会按照上述三元组构成的KEY来对插入的URL排序。

/* * The key that is used for storing URLs determines the order * they are crawled. Lower key values results in earlier crawling. * Here our keys are 6 bytes. The first byte comes from the URL priority. * The second byte comes from depth of crawl at which this URL is first found. * The rest of the 4 bytes come from the docid of the URL. As a result, * URLs with lower priority numbers will be crawled earlier. If priority * numbers are the same, those found at lower depths will be crawled earlier. * If depth is also equal, those found earlier (therefore, smaller docid) will * be crawled earlier. */protected DatabaseEntry getDatabaseEntryKey(WebURL url) {byte[] keyData = new byte[6];keyData[0] = url.getPriority();keyData[1] = (url.getDepth() > Byte.MAX_VALUE ? Byte.MAX_VALUE : (byte) url.getDepth());Util.putIntInByteArray(url.getDocid(), keyData, 2);return new DatabaseEntry(keyData);}public void put(WebURL url) throws DatabaseException {DatabaseEntry value = new DatabaseEntry();webURLBinding.objectToEntry(url, value);Transaction txn;if (resumable) {txn = env.beginTransaction(null, null);} else {txn = null;}urlsDB.put(txn, getDatabaseEntryKey(url), value);if (resumable) {if (txn != null) {txn.commit();}}}

每插入一个新的URL，调度计数器也就加一。

if (newScheduledPage > 0) {scheduledPages += newScheduledPage;counters.increment(Counters.ReservedCounterNames.SCHEDULED_PAGES, newScheduledPage);}

if (maxPagesToFetch < 0 || scheduledPages < maxPagesToFetch) {workQueues.put(url);scheduledPages++;counters.increment(Counters.ReservedCounterNames.SCHEDULED_PAGES);}

从WorkQueues中取出URL时，首先放入InProcessPagesDB中，这个DB用来记录正在处理还尚未处理完成的URL，同时，每处理完一个URL，完成计数器也会加1，

List<WebURL> curResults = workQueues.get(max);workQueues.delete(curResults.size());if (inProcessPages != null) {for (WebURL curPage : curResults) {inProcessPages.put(curPage);}}

public void setProcessed(WebURL webURL) {counters.increment(ReservedCounterNames.PROCESSED_PAGES);if (inProcessPages != null) {if (!inProcessPages.removeURL(webURL)) {logger.warn("Could not remove: " + webURL.getURL() + " from list of processed pages.");}}}

再来看看支持可恢复爬取的情形下，是恢复爬取上次未处理完的URL的，前面提到，所有正在处理的URL都会放在InProcessPagesDB中，这个数据结构和WorkQueues一样，继承自后者，都是采用的JE存储，因此每次爬取如果结束时，尚有未处理完成的URL都会保存在这个DB中。具体的恢复流程如下图所示：

上述恢复流程在Frontier创建时就会执行。

关于Frontier就这么多了，下节我们来看看Fetcher的工作流程

0 0