nutch源码分析---1
来源:互联网 发布:vb opc 西门子 编辑:程序博客网 时间:2024/06/05 04:18
nutch源码分析—inject
本章开始分析nutch 1.12版本的源码,nutch在爬取网页时分为inject、generate、fetch、parse、updatedb五个步骤,本章先来看inject命令,nutch官网教程给出的实例如下,
bin/nutch inject crawl/crawldb urls
urls目录中的文件seed.txt包含了其实的url地址。
编译nutch源码后,在目录runtime/local/bin/的nutch脚本里可以看到如下一段代码,
...elif [ "$COMMAND" = "inject" ] ; then CLASS=org.apache.nutch.crawl.Injectorelif [ "$COMMAND" = "generate" ] ; then CLASS=org.apache.nutch.crawl.Generatorelif [ "$COMMAND" = "fetch" ] ; then CLASS=org.apache.nutch.fetcher.Fetcherelif [ "$COMMAND" = "parse" ] ; then CLASS=org.apache.nutch.parse.ParseSegmentelif [ "$COMMAND" = "updatedb" ] ; then CLASS=org.apache.nutch.crawl.CrawlDb...exec "${EXEC_CALL[@]}" $CLASS "$@"
EXEC_CALL是执行Java程序的命令,因此对于inject命令,最终执行org.apache.nutch.crawl.Injector类的main函数。
Injector::main
public static void main(String[] args) throws Exception { int res = ToolRunner.run(NutchConfiguration.create(), new Injector(), args); System.exit(res); }
ToolRunner是hadoop的一个工具,该段代码最终会调用Injector类的run函数,
Injector::main->Injector::run
public int run(String[] args) throws Exception { ... inject(new Path(args[0]), new Path(args[1]), overwrite, update); ... } public void inject(Path crawlDb, Path urlDir, boolean overwrite, boolean update) throws IOException, ClassNotFoundException, InterruptedException { ... Configuration conf = getConf(); conf.setLong("injector.current.time", System.currentTimeMillis()); conf.setBoolean("db.injector.overwrite", overwrite); conf.setBoolean("db.injector.update", update); conf.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false); FileSystem fs = FileSystem.get(conf); Path current = new Path(crawlDb, CrawlDb.CURRENT_NAME); if (!fs.exists(current)) fs.mkdirs(current); Path tempCrawlDb = new Path(crawlDb, "crawldb-" + Integer.toString(new Random().nextInt(Integer.MAX_VALUE))); Path lock = new Path(crawlDb, CrawlDb.LOCK_NAME); LockUtil.createLockFile(fs, lock, false); Job job = Job.getInstance(conf, "inject " + urlDir); job.setJarByClass(Injector.class); job.setMapperClass(InjectMapper.class); job.setReducerClass(InjectReducer.class); job.setOutputFormatClass(MapFileOutputFormat.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(CrawlDatum.class); job.setSpeculativeExecution(false); MultipleInputs.addInputPath(job, current, SequenceFileInputFormat.class); MultipleInputs.addInputPath(job, urlDir, KeyValueTextInputFormat.class); FileOutputFormat.setOutputPath(job, tempCrawlDb); job.waitForCompletion(true); CrawlDb.install(job, crawlDb); }
传入的参数crawlDb为crawl/crawldb
创建hadoop的Configuration,作相应的设置。
在crawl/crawldb下创建“current”、“crawldb-随机数”和“.locked”目录,其中“crawldb-随机数”为临时目录,后面会删除。
再接下来创建Job,设置Mapper和Reducer的处理类,并添加数据源为current目录里的数据和urls文件下的文本数据,然后调用其waitForCompletion函数被hadoop框架调用。
最后执行CrawlDb的install函数,替换old和current目录,并删除锁文件。
Job提交到hadoop框架后,会首先调用InjectMapper的map函数处理。
InjectMapper::map
public void map(Text key, Writable value, Context context) throws IOException, InterruptedException { if (value instanceof Text) { String url = key.toString().trim(); url = filterNormalize(url); if (url == null) { context.getCounter("injector", "urls_filtered").increment(1); } else { CrawlDatum datum = new CrawlDatum(); datum.setStatus(CrawlDatum.STATUS_INJECTED); datum.setFetchTime(curTime); datum.setScore(scoreInjected); datum.setFetchInterval(interval); String metadata = value.toString().trim(); if (metadata.length() > 0) processMetaData(metadata, datum, url); key.set(url); scfilters.injectedScore(key, datum); context.getCounter("injector", "urls_injected").increment(1); context.write(key, datum); } } else if (value instanceof CrawlDatum) { CrawlDatum datum = (CrawlDatum) value; String url = filterNormalize(key.toString()); key.set(url); context.write(key, datum); } }
根据前面的分析,inject函数向hadoop框架注册了两个数据源,因此map函数分两种情况处理,map函数的参数key是对应的url地址,value则是url地址后面跟着的url信息。
当value是Text类型时,表示数据源是urls文件夹下的seed.txt文件,这种情况下,首先读取url地址,并调用filterNormalize函数对url规范化,得到统一的格式,接下来创建CrawlDatum,并调用processMetaData函数处理url信息,scfilters的类型为ScoringFilters,其injectedScore用来为url打分,再往下就调用hadoop的Context的write函数交由Reducer继续处理。
当value的类型是CrawlDatum时,表示之前已经对该url进行了处理,此时仅对url规范化,就继续交由Reducer处理了。
因此,无论数据源为何类型,map函数最终返回key为url地址,value为CrawlDatum的数据交由Reducer继续处理。
InjectMapper::map->processMetaData
private void processMetaData(String metadata, CrawlDatum datum, String url) { String[] splits = metadata.split(TAB_CHARACTER); for (String split : splits) { int indexEquals = split.indexOf(EQUAL_CHARACTER); String metaname = split.substring(0, indexEquals); String metavalue = split.substring(indexEquals + 1); if (metaname.equals(nutchScoreMDName)) { datum.setScore(Float.parseFloat(metavalue)); } else if (metaname.equals(nutchFetchIntervalMDName)) { datum.setFetchInterval(Integer.parseInt(metavalue)); } else if (metaname.equals(nutchFixedFetchIntervalMDName)) { int fixedInterval = Integer.parseInt(metavalue); if (fixedInterval > -1) { datum.getMetaData().put(Nutch.WRITABLE_FIXED_INTERVAL_KEY, new FloatWritable(fixedInterval)); datum.setFetchInterval(fixedInterval); } } else { datum.getMetaData().put(new Text(metaname), new Text(metavalue)); } } }
TAB_CHARACTER的默认值是“\t”,EQUAL_CHARACTER的默认值是“=”,processMetaData函数根据TAB_CHARACTER提取每组url信息,每组url信息又通过等号划分属性名metaname和属性值metavalue ,然后将其设置进CrawlDatum中。
map函数处理完,hadoop框架继而调用InjectReducer的reduce函数继续处理,
InjectReducer::reduce
public void reduce(Text key, Iterable<CrawlDatum> values, Context context) throws IOException, InterruptedException { for (CrawlDatum val : values) { if (val.getStatus() == CrawlDatum.STATUS_INJECTED) { injected.set(val); injected.setStatus(CrawlDatum.STATUS_DB_UNFETCHED); injectedSet = true; } else { old.set(val); oldSet = true; } } CrawlDatum result; if (injectedSet && (!oldSet || overwrite)) { result = injected; } else { result = old; if (injectedSet && update) { old.putAllMetaData(injected); old.setScore(injected.getScore() != scoreInjected ? injected.getScore() : old.getScore()); old.setFetchInterval(injected.getFetchInterval() != interval ? injected.getFetchInterval() : old.getFetchInterval()); } } context.write(key, result); }
reduce函数简而言之,要么覆盖之前某个url对应的CrawlDatum结构,要么只是通过putAllMetaData、setScore和setFetchInterval设置CrawlDatum中的对应信息,并不重写。
reduce函数执行成功后,就要向HDFS文件系统(前面注册的tempCrawlDb目录)中写入处理结果了。这里简单看一下CrawlDatum是如何写入的,CrawlDatum实现了hadoop的WritableComparable的write函数。
CrawlDatum::write
public void write(DataOutput out) throws IOException { out.writeByte(CUR_VERSION); // store current version out.writeByte(status); out.writeLong(fetchTime); out.writeByte(retries); out.writeInt(fetchInterval); out.writeFloat(score); out.writeLong(modifiedTime); if (signature == null) { out.writeByte(0); } else { out.writeByte(signature.length); out.write(signature); } if (metaData != null && metaData.size() > 0) { out.writeBoolean(true); metaData.write(out); } else { out.writeBoolean(false); } }
再回头看CrawlDb的install函数,当hadoop处理完数据后,就会调用该函数进行最后的处理,
public static void install(Job job, Path crawlDb) throws IOException { Configuration conf = job.getConfiguration(); boolean preserveBackup = conf.getBoolean("db.preserve.backup", true); FileSystem fs = FileSystem.get(conf); Path old = new Path(crawlDb, "old"); Path current = new Path(crawlDb, CURRENT_NAME); Path tempCrawlDb = org.apache.hadoop.mapreduce.lib.output.FileOutputFormat .getOutputPath(job); FSUtils.replace(fs, old, current, true); FSUtils.replace(fs, current, tempCrawlDb, true); Path lock = new Path(crawlDb, LOCK_NAME); LockUtil.removeLockFile(fs, lock); if (!preserveBackup && fs.exists(old)) { fs.delete(old, true); } } public static void replace(FileSystem fs, Path current, Path replacement, boolean removeOld) throws IOException { Path old = new Path(current + ".old"); if (fs.exists(current)) { fs.rename(current, old); } fs.rename(replacement, current); if (fs.exists(old) && removeOld) { fs.delete(old, true); } } public static boolean removeLockFile(FileSystem fs, Path lockFile) throws IOException { return fs.delete(lockFile, false); }
install函数将原来的old目录替换为current目录,将current目录替换为最新的tempCrawlDb即“crawldb-随机数”目录,然后删除锁文件。
- Nutch 源码分析 (1)
- nutch源码分析---1
- nutch源码分析1------inject
- nutch源码分析1------inject(续)
- nutch源码分析1------inject(续续)
- Nutch 源码分析
- Nutch 2.3 源码分析
- nutch源码分析---2
- nutch源码分析---3
- nutch源码分析---4
- nutch源码分析---5
- nutch源码分析---6
- nutch源码分析---7
- nutch v1.9源码分析(1)——分析目标
- Nutch 1.3 源码分析 ParseSegment
- Nutch 1.3 源码分析 4 Generate 类
- Nutch 1.3 源码分析 5 Fetcher流程
- Nutch 1.3 源码分析 7 CrawlDb - updatedb
- Java并发编程:synchronized
- Java并发编程:Lock
- Python progressbar源码修改(支持设置进度条样式)
- 第 8 章 计时器
- 8.1 计时器的基本知识
- nutch源码分析---1
- 二叉树遍历(flist) 中序和按层
- 8.2 使用计时器的三种方法
- 8.3 使用计时器作为时钟
- 类的构造函数,析构函数,动态生成对象
- 8.4 在状态报告上使用计时器
- 洛谷P1240 诸侯安置 递推
- expdp时出现错误:ORA-39006: internal error
- 简历投递中知识问题