对Nutch源码的分析记录,Crwal类及Fetcher类(1)

来源：互联网发布：手写速记知乎编辑：程序博客网时间：2024/06/04 18:08

本文的分析仅用于记录本人的个人思考,如有错误请大家指正!谢谢大家

因为Nutch的爬行命令为bin/nutch crwal ...所以很容易就找到了nutch爬行程序的入口，即Crwal类的main方法，该方法的代码如下

Path rootUrlDir = null; Path dir = new Path("crawl-" + getDate()); int threads = job.getInt("fetcher.threads.fetch", 10); int depth = 5; long topN = Long.MAX_VALUE; String indexerName = "lucene"; String solrUrl = null;

如有些参数没有设置，则设置默认值

for (int i = 0; i < args.length; i++) { if ("-dir".equals(args[i])) { dir = new Path(args[i+1]); i++; } else if ("-threads".equals(args[i])) { threads = Integer.parseInt(args[i+1]); i++; } else if ("-depth".equals(args[i])) { depth = Integer.parseInt(args[i+1]); i++; } else if ("-topN".equals(args[i])) { topN = Integer.parseInt(args[i+1]); i++; } else if ("-solr".equals(args[i])) { indexerName = "solr"; solrUrl = StringUtils.lowerCase(args[i + 1]); i++; } else if (args[i] != null) { rootUrlDir = new Path(args[i]); }}

将已经设置值的参数赋值。dir爬行后存放数据的路径；threads是爬虫程序的线程数；depth参数是爬行深度，即从初始网址开始进入超链接的深度；topN下载符合条件的前N个页面。

injector.inject(crawlDb, rootUrlDir); int i; for (i = 0; i < depth; i++) { // generate new segment Path[] segs = generator.generate(crawlDb, segments, -1, topN, System .currentTimeMillis()); if (segs == null) { LOG.info("Stopping at depth=" + i + " - no more URLs to fetch."); break; }

将需要爬行的URL地址注入到crawlDb中。

fetcher.fetch(segs[0], threads, org.apache.nutch.fetcher.Fetcher.isParsing(conf)); // fetch it

爬行该网址，在Crawl类中，直接与爬虫相关的就只有这一句代码，Crawl类暂时看到这里。现在看看fetcher是怎么执行网址的爬行任务。

直接看Fetcher类中的fetch方法：

checkConfiguration(); SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); long start = System.currentTimeMillis(); if (LOG.isInfoEnabled()) { LOG.info("Fetcher: starting at " + sdf.format(start)); LOG.info("Fetcher: segment: " + segment); } // set the actual time for the timelimit relative // to the beginning of the whole job and not of a specific task // otherwise it keeps trying again if a task fails long timelimit = getConf().getLong("fetcher.timelimit.mins", -1); if (!= -1) { timelimit = System.currentTimeMillis() + (timelimit * 60 * 1000); LOG.info("Fetcher Timelimit set for : " + timelimit); getConf().setLong("fetcher.timelimit.mins", timelimit); }

开始时候检测设置，然后调用SimpleDateFormat保存当前时间，并判断是否需要记录日志，如果需要则记录。Nutch里使用了LOG4J记录日志，以前在给郭老师写书时候接触过LOG4J，但是由于觉得自己写的程序规模不大，没有必要使用LOG4J，所以之后就一直没有接触过，跑题了。根据保存的时间可以记录各个步骤执行的时间。

timelimit参数用于保存浏览网页的超时时间，单位是秒。 getConf().setLong("fetcher.timelimit.mins", timelimit);用于在configuration里设置超时时间。

JobConf job = new NutchJob(getConf()); job.setJobName("fetch " + segment); job.setInt("fetcher.threads.fetch", threads); job.set(Nutch.SEGMENT_NAME_KEY, segment.getName()); job.setBoolean("fetcher.parse", parsing); // for politeness, don't permit parallel execution of a single task job.setSpeculativeExecution(false); FileInputFormat.addInputPath(job, new Path(segment, CrawlDatum.GENERATE_DIR_NAME)); job.setInputFormat(InputFormat.class); job.setMapRunnerClass(Fetcher.class); FileOutputFormat.setOutputPath(job, segment); job.setOutputFormat(FetcherOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(NutchWritable.class); JobClient.runJob(job);

以上的代码使用了hadoop的分布式运算对网页进行分布式抓取，因为互联网的爬行是一个巨大的工程，通常需要几台服务器运行几天的时间才能完成全网爬行，所以分布式运算成了爬行的不二选择。可以将job看作一个工作，job中封装了爬行程序需要的设置。最后一行代码JobClient.runJob(job);就是执行分布式计算运行这个工作。

long end = System.currentTimeMillis();LOG.info("Fetcher: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));

fetch方法的最后两句，打印出爬行所消耗的时间。