Heritrix3 控制爬取链接

来源:互联网 发布:c语言循环队列 编辑:程序博客网 时间:2024/05/29 08:27
/** * An interface for URI Frontiers. * *     <li>What URIs have been discovered *     <li>What URIs are being processed (fetched) *     <li>What URIs have been processed *     <li>In what order unprocessed URIs will be processed * * <p>The Frontier is also responsible for enforcing any politeness restrictions * that may have been applied to the crawl. Such as limiting simultaneous * connection to the same host, server or IP number to 1 (or any other fixed * amount), delays between connections etc. * * <p>A URIFrontier is created by the * {@link org.archive.crawler.framework.CrawlController CrawlController} which * is in turn responsible for providing access to it. Most significant among * those modules interested in the Frontier are the * {@link org.archive.crawler.framework.ToeThread ToeThreads} who perform the * actual work of processing a URI. * * <p>The methods defined in this interface are those required to get URIs for * processing, report the results of processing back (ToeThreads) and to get * access to various statistical data along the way. The statistical data is * of interest to {@link org.archive.crawler.framework.StatisticsTracker * Statistics Tracking} modules. A couple of additional methods are provided * to be able to inspect and manipulate the Frontier at runtime. * * <p>The statistical data exposed by this interface is: * <ul> *     <li> {@link #discoveredUriCount() Discovered URIs} *     <li> {@link #queuedUriCount() Queued URIs} *     <li> {@link #finishedUriCount() Finished URIs} *     <li> {@link #succeededFetchCount() Successfully processed URIs} *     <li> {@link #failedFetchCount() Failed to process URIs} *     <li> {@link #disregardedUriCount() Disregarded URIs} *     <li> {@link #totalBytesWritten() Total bytes written} * </ul> * * <p>In addition the frontier may optionally implement an interface that * exposes information about hosts. * * <p>Furthermore any implementation of the URI Frontier should trigger * {@link org.archive.crawler.event.CrawlURIDispositionEvent * CrawlURIDispostionEvents} on the ApplicationContext to allow * statistics modules or other interested observers to collect info * about each completed URI's processing. * * <p>All URI Frontiers inherit from * {@link org.archive.crawler.settings.ModuleType ModuleType} * and therefore creating settings follows the usual pattern of pluggable modules * in Heritrix. * * @author Gordon Mohr * @author Kristinn Sigurdsson * * @see org.archive.crawler.framework.CrawlController * @see org.archive.crawler.framework.CrawlController#fireCrawledURIDisregardEvent(CrawlURI) * @see org.archive.crawler.framework.CrawlController#fireCrawledURIFailureEvent(CrawlURI) * @see org.archive.crawler.framework.CrawlController#fireCrawledURINeedRetryEvent(CrawlURI) * @see org.archive.crawler.framework.CrawlController#fireCrawledURISuccessfulEvent(CrawlURI) * @see org.archive.crawler.framework.StatisticsTracker * @see org.archive.crawler.framework.ToeThread * @see org.archive.crawler.settings.ModuleType */public interface Frontier extends Lifecycle, MultiReporter {    /**
     * 获取下一条应该被处理的链接,如果没有合适的链接就返回NULL     * @return the next URI that should be processed.     * @throws InterruptedException     */    CrawlURI next() throws InterruptedException;    /**     * 如果没有多的链接可爬取,就返回true     * @return true if the frontier contains no more URIs to crawl.     */    boolean isEmpty();    /**
     * 将链接马上去爬取     * @param caURI The URI to schedule.     *     * @see CrawlURI#setSchedulingDirective(int)     */    public void schedule(CrawlURI caURI);    /**     * ToeThread爬完报告     * <p>This method is synchronized.     *     * @param cURI The URI that has finished processing.     */    public void finished(CrawlURI cURI);    /**     * Number of <i>discovered</i> URIs.     *     * @return Number of discovered URIs.     */    public long discoveredUriCount();    /**     * Number of URIs <i>queued</i> up and waiting for processing.     *     * @return Number of queued URIs.     */    public long queuedUriCount();        /**     * @return Number of URIs not currently queued/eligible but scheduled for future     */    public long futureUriCount();         /**     * Ordinal position of the 'deepest' URI eligible      * for crawling. Essentially, the length of the longest     * frontier internal queue.      *      * @return long URI count to deepest URI     */    public long deepestUri(); // aka longest queue        /**     * Average depth of the last URI in all eligible queues.     * That is, the average length of all eligible queues.     *      * @return long average depth of last URIs in queues      */    public long averageDepth(); // aka average queue length        /**     * Ratio of number of threads that would theoretically allow     * maximum crawl progress (if each was as productive as current     * threads), to current number of threads.     *      * @return float congestion ratio      */    public float congestionRatio(); // multiple of threads needed for max progress        /**     * Number of URIs that have <i>finished</i> processing.     *     * <p>Includes both those that were processed successfully and failed to be     * processed (excluding those that failed but will be retried). Does not     * include those URIs that have been 'forgotten' (deemed out of scope when     * trying to fetch, most likely due to operator changing scope definition).     *     * @return Number of finished URIs.     */    public long finishedUriCount();    /**     * Number of <i>successfully</i> processed URIs.     *     * <p>Any URI that was processed successfully. This includes URIs that     * returned 404s and other error codes that do not originate within the     * crawler.     *     * @return Number of <i>successfully</i> processed URIs.     */    public long succeededFetchCount();    /**     * Number of URIs that <i>failed</i> to process.     *     * <p>URIs that could not be processed because of some error or failure in     * the processing chain. Can include failure to acquire prerequisites, to     * establish a connection with the host and any number of other problems.     * Does not count those that will be retried, only those that have     * permenantly failed.     *     * @return Number of URIs that failed to process.     */    public long failedFetchCount();    /**     * Number of URIs that were scheduled at one point but have been     * <i>disregarded</i>.     *     * <p>Counts any URI that is scheduled only to be disregarded     * because it is determined to lie outside the scope of the crawl. Most     * commonly this will be due to robots.txt exclusions.     *     * @return The number of URIs that have been disregarded.     */    public long disregardedUriCount();    /**     * Load URIs from a file, for scheduling and/or considered-included      * status (if from a recovery log).      *     * <p> The 'params' Map describes the source file to use and options     * in effect regarding its format and handling. Significant keys      * are:     *      * <p>"path": full path to source file. If the path ends '.gz', it      * will be considered to be GZIP compressed.     * <p>"format": one of "onePer", "crawlLog", or "recoveryLog"     * <p>"forceRevisit": if non-null, URIs will be force-scheduled even     * if already considered included     * <p>"scopeSchedules": if non-null, any URI imported be checked     * against the frontier's configured scope before scheduling      *      * <p>If the "format" is "recoveryLog", 7 more keys are significant:     *      * <p>"includeSuccesses": if non-null, success lines ("Fs") in the log     * will be considered-included. (Usually, this is the aim of     * a recovery-log import.)     * <p>"includeFailures": if non-null, failure lines ("Ff") in the log     * will be considered-included. (Sometimes, this is desired.)     * <p>"includeScheduleds": If non-null, scheduled lines ("F+") in the      * log will be considered-included. (Atypical, but an option for      * completeness.)     * <p>"scopeIncludes": if non-null, any of the above will be checked     * against the frontier's configured scope before consideration     *     * <p>"scheduleSuccesses": if non-null, success lines ("Fs") in the log     * will be schedule-attempted. (Atypical, as all successes     * are preceded by "F+" lines.)     * <p>"scheduleFailures": if non-null, failure lines ("Ff") in the log     * will be schedule-attempted. (Atypical, as all failures     * are preceded by "F+" lines.)     * <p>"scheduleScheduleds": if non-null, scheduled lines ("F+") in the      * log will be considered-included. (Usually, this is the aim of a     * recovery-log import.)     *      * TODO: add parameter for auto-unpause-at-good-time     *      * @param params Map describing source file and options as above     * @throws IOException If problems occur reading file.     * @throws JSONException      */    public void importURIs(            String params)throws IOException;        /**     * Import URIs from the given file (in recover-log-like format, with     * a 3-character 'type' tag preceding a URI with optional hops/via).     *      * If 'includeOnly' is true, the URIs will only be imported into      * the frontier's alreadyIncluded structure, without being queued.     *      * Only imports URIs if their first tag field matches the acceptTags      * pattern.     *      * @param source File recovery log file to use (may be .gz compressed)     * @param applyScope whether to apply crawl scope to URIs     * @param includeOnly whether to only add to included filter, not schedule     * @param forceFetch whether to force fetching, even if already seen      * (ignored if includeOnly is set)     * @param acceptTags String regex; only lines whose first field      * match will be included     * @return number of lines in recovery log (for reference)     * @throws IOException     */    public long importRecoverFormat(File source, boolean applyScope,             boolean includeOnly, boolean forceFetch, String acceptTags)     throws IOException;        /**     * Get a <code>URIFrontierMarker</code> initialized with the given     * regular expression at the 'start' of the Frontier.     * @param regex The regular expression that URIs within the frontier must     *                match to be considered within the scope of this marker     * @param inCacheOnly If set to true, only those URIs within the frontier     *                that are stored in cache (usually this means in memory     *                rather then on disk, but that is an implementation     *                detail) will be considered. Others will be entierly     *                ignored, as if they dont exist. This is usefull for quick     *                peeks at the top of the URI list.     * @return A URIFrontierMarker that is set for the 'start' of the frontier's     *                URI list.     *///    public FrontierMarker getInitialMarker(String regex,//                                              boolean inCacheOnly);    /**     * Returns a list of all uncrawled URIs starting from a specified marker     * until <code>numberOfMatches</code> is reached.     *     * <p>Any encountered URI that has not been successfully crawled, terminally     * failed, disregarded or is currently being processed is included. As     * there may be duplicates in the frontier, there may also be duplicates     * in the report. Thus this includes both discovered and pending URIs.     *     * <p>The list is a set of strings containing the URI strings. If verbose is     * true the string will include some additional information (path to URI     * and parent).     *     * <p>The <code>URIFrontierMarker</code> will be advanced to the position at     * which it's maximum number of matches found is reached. Reusing it for     * subsequent calls will thus effectively get the 'next' batch. Making     * any changes to the frontier can invalidate the marker.     *     * <p>While the order returned is consistent, it does <i>not</i> have any     * explicit relation to the likely order in which they may be processed.     *     * <p><b>Warning:</b> It is unsafe to make changes to the frontier while     * this method is executing. The crawler should be in a paused state before     * invoking it.     *     * @param marker     *            A marker specifing from what position in the Frontier the     *            list should begin.     * @param numberOfMatches     *            how many URIs to add at most to the list before returning it     * @param verbose     *            if set to true the strings returned will contain additional     *            information about each URI beyond their names.     * @return a list of all pending URIs falling within the specification     *            of the marker     * @throws InvalidFrontierMarkerException when the     *            <code>URIFronterMarker</code> does not match the internal     *            state of the frontier. Tolerance for this can vary     *            considerably from one URIFrontier implementation to the next.     * @see FrontierMarker     * @see #getInitialMarker(String, boolean)     */    public CompositeData getURIsList(            String marker,            int numberOfMatches,            String regex,            boolean verbose);    /**     * Delete any URI that matches the given regular expression from the list     * of discovered and pending URIs. This does not prevent them from being     * rediscovered.     *     * <p>Any encountered URI that has not been successfully crawled, terminally     * failed, disregarded or is currently being processed is considered to be     * a pending URI.     *     * <p><b>Warning:</b> It is unsafe to make changes to the frontier while     * this method is executing. The crawler should be in a paused state before     * invoking it.     *     * @param match A regular expression, any URIs that matches it will be     *              deleted.     * @return The number of URIs deleted     */    public long deleteURIs(            String queueRegex,            String match);    /**     * Notify Frontier that a CrawlURI has been deleted outside of the     * normal next()/finished() lifecycle.      *      * @param curi Deleted CrawlURI.     */    public void deleted(CrawlURI curi);    /**     * Notify Frontier that it should consider the given UURI as if     * already scheduled.     *      * @param u UURI instance to add to the Already Included set.     */    public void considerIncluded(CrawlURI curi);    /**     * Notify Frontier that it should not release any URIs, instead     * holding all threads, until instructed otherwise.      */    public void pause();    /**     * Resumes the release of URIs to crawl, allowing worker     * ToeThreads to proceed.      */    public void unpause();    /**     * Notify Frontier that it should end the crawl, giving     * any worker ToeThread that askss for a next() an      * EndedException.      */    public void terminate();        /**     * @return Return the instance of {@link FrontierJournal} that     * this Frontier is using.  May be null if no journaling.     */    public FrontierJournal getFrontierJournal();        /**     * @param cauri CrawlURI for which we're to calculate and     * set class key.     * @return Classkey for <code>cauri</code>.     */    public String getClassKey(CrawlURI cauri);    /*     * Return the internally-configured crawl 'scope' (rules for     * deciding whether a URI is crawled or not).      */    public DecideRule getScope();    /**     * Request that Frontier allow crawling to begin. Usually     * just unpauses Frontier, if paused.      */    public void run();    /**     * Get the 'frontier group' (usually queue) for the given      * CrawlURI.      * @param curi CrawlURI to find matching group     * @return FrontierGroup for the CrawlURI     */    public FrontierGroup getGroup(CrawlURI curi);        /**     * Generic interface representing the internal groupings      * of a Frontier's URIs -- usually queues. Currently only      * offers the HasCrawlSubstats interface.      */    public interface FrontierGroup     extends FetchStats.HasFetchStats, FetchStats.CollectsFetchStats, IdentityCacheable {    }        /**     * Request the Frontier reach the given state as soon as possible. (Only     * when a later notification is given the CrawlController has the state     * actually been reached.)     *      * @param target Frontier.State to pursue     */    public void requestState(State target);        /**     * Enumeration of possible target states.      */    public enum State {         RUN,  // juggle/prioritize/emit; usual state        EMPTY, // running/ready but no URIs queued/scheduled        HOLD, // NOT YET USED enter a consistent, stable, checkpointable state ASAP        PAUSE, // enter a stable state where no URIs are in-progress; unlike               // HOLD requires all in-process URIs to complete        FINISH  // end and cleanup; may not return to any other state after                  // this state is requested/reached    }    /**     * Inform frontier that a block of processing that should complete atomically     * with respect to checkpoints is about to begin. Callers should ensure an     * endDisposition() call soon follows; a mismatch risks freezing the frontier     * if a checkpoint is requested.      * @param curi     */    public void beginDisposition(CrawlURI curi);    /**     * Inform frontier the processing signalled by an earlier pending      * beginDisposition() call has finished. Implementors should be resilient      * against extra endDisposition calls, as callers dealing with exceptional     * conditions need to be free to call this 'just in case'.      */    public void endDisposition();}

0 0
原创粉丝点击