Nutch2.2.3源码阅读之injectorJob
来源:互联网 发布:python源代码下载 编辑:程序博客网 时间:2024/06/18 06:58
injectorJob是Nutch中第一个模块。它的功能是对URL进行优先值排序并存到本地列表中。
nutch中URL是按行存储的,每行的结构如下: http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 * \t userType=open_source
程序的入口如下:
int res = ToolRunner.run(NutchConfiguration.create(), new InjectorJob(),args);
其中NutchCofiguration是一个配置nutch的类,在这不详述。主要是看injectorJob方法。在InjectorJob类中,定义了一个UrlMapper用来继承Mapper类并实现map函数,在这里面没有reduce函数。
@Override protected void setup(Context context) throws IOException, InterruptedException { urlNormalizers = new URLNormalizers(context.getConfiguration(), URLNormalizers.SCOPE_INJECT); interval = context.getConfiguration().getInt("db.fetch.interval.default",2592000); filters = new URLFilters(context.getConfiguration()); scfilters = new ScoringFilters(context.getConfiguration()); scoreInjected = context.getConfiguration().getFloat("db.score.injected", 1.0f); curTime = context.getConfiguration().getLong("injector.current.time", System.currentTimeMillis()); }
这是Mapper类中的setup函数,在此过程中进行了参数定义。分别是归一化方法(urlNormalizers),间隔(interval???),两个过滤器(URLFilters、ScoringFilters),以及当前时间和分数写入。
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String url = value.toString().trim(); // value is line of text if (url != null && (url.length() == 0 || url.startsWith("#"))) { /* Ignore line that start with # */ return; } // if tabs : metadata that could be stored // must be name=value and separated by \t float customScore = -1f; int customInterval = interval; Map<String, String> metadata = new TreeMap<String, String>(); if (url.indexOf("\t") != -1) { String[] splits = url.split("\t"); url = splits[0]; for (int s = 1; s < splits.length; s++) { // find separation between name and value int indexEquals = splits[s].indexOf("="); if (indexEquals == -1) { // skip anything without a = continue; } String metaname = splits[s].substring(0, indexEquals); String metavalue = splits[s].substring(indexEquals + 1); if (metaname.equals(nutchScoreMDName)) { try { customScore = Float.parseFloat(metavalue); } catch (NumberFormatException nfe) { } } else if (metaname.equals(nutchFetchIntervalMDName)) { try { customInterval = Integer.parseInt(metavalue); } catch (NumberFormatException nfe) { } } else metadata.put(metaname, metavalue); } } try { url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT); url = filters.filter(url); // filter the url } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e); url = null; } if (url == null) { context.getCounter("injector", "urls_filtered").increment(1); return; } else { // if it passes String reversedUrl = TableUtil.reverseUrl(url); // collect it WebPage row = WebPage.newBuilder().build(); row.setFetchTime(curTime); row.setFetchInterval(customInterval); // now add the metadata Iterator<String> keysIter = metadata.keySet().iterator(); while (keysIter.hasNext()) { String keymd = keysIter.next(); String valuemd = metadata.get(keymd); row.getMetadata().put(new Utf8(keymd), ByteBuffer.wrap(valuemd.getBytes())); } if (customScore != -1) row.setScore(customScore); else row.setScore(scoreInjected); try { scfilters.injectedScore(url, row); } catch (ScoringFilterException e) { if (LOG.isWarnEnabled()) { LOG.warn("Cannot filter injected score for url " + url + ", using default (" + e.getMessage() + ")"); } } context.getCounter("injector", "urls_injected").increment(1); row.getMarkers() .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0))); Mark.INJECT_MARK.putMark(row, YES_STRING); context.write(reversedUrl, row); } } }
这是主要的map函数,在这个函数中做了以下事情。
1.是对每个读入的URL进行格式处理,包括取出两边的空格
String url = value.toString().trim();
忽略开头的’#’
if (url != null && (url.length() == 0 || url.startsWith("#"))) { /* Ignore line that start with # */ return; }
已知URL的格式
http://www.nutch.org/ \t nutch.score=10 \t nutch.fetchInterval=2592000 * \t userType=open_source
对其进行如下处理
Map<String, String> metadata = new TreeMap<String, String>(); if (url.indexOf("\t") != -1) { String[] splits = url.split("\t"); url = splits[0]; for (int s = 1; s < splits.length; s++) { // find separation between name and value int indexEquals = splits[s].indexOf("="); if (indexEquals == -1) { // skip anything without a = continue; } String metaname = splits[s].substring(0, indexEquals); String metavalue = splits[s].substring(indexEquals + 1); if (metaname.equals(nutchScoreMDName)) { try { customScore = Float.parseFloat(metavalue); } catch (NumberFormatException nfe) { } } else if (metaname.equals(nutchFetchIntervalMDName)) { try { customInterval = Integer.parseInt(metavalue); } catch (NumberFormatException nfe) { } } else metadata.put(metaname, metavalue); } }
首先将其按照’\t’分片,第一个显然是url,然后根据’=’解析出name和value值。
try { url = urlNormalizers.normalize(url, URLNormalizers.SCOPE_INJECT); url = filters.filter(url); // filter the url } catch (Exception e) { LOG.warn("Skipping " + url + ":" + e); url = null; } if (url == null) { context.getCounter("injector", "urls_filtered").increment(1); return; } else { // if it passes String reversedUrl = TableUtil.reverseUrl(url); // collect it WebPage row = WebPage.newBuilder().build(); row.setFetchTime(curTime); row.setFetchInterval(customInterval); // now add the metadata Iterator<String> keysIter = metadata.keySet().iterator(); while (keysIter.hasNext()) { String keymd = keysIter.next(); String valuemd = metadata.get(keymd); row.getMetadata().put(new Utf8(keymd), ByteBuffer.wrap(valuemd.getBytes())); } if (customScore != -1) row.setScore(customScore); else row.setScore(scoreInjected); try { scfilters.injectedScore(url, row); } catch (ScoringFilterException e) { if (LOG.isWarnEnabled()) { LOG.warn("Cannot filter injected score for url " + url + ", using default (" + e.getMessage() + ")"); } } context.getCounter("injector", "urls_injected").increment(1); row.getMarkers() .put(DbUpdaterJob.DISTANCE, new Utf8(String.valueOf(0))); Mark.INJECT_MARK.putMark(row, YES_STRING); context.write(reversedUrl, row); }
对URL进行归一化以及过滤,若URL为空,则context计数器加一,否则进行如下处理。
首先翻转URL
String org.apache.nutch.util.TableUtil.reverseUrl(String urlString) throws MalformedURLExceptionReverses a url's domain. This form is better for storing in hbase. Because scans within the same domain are faster.E.g. "http://bar.foo.com:8983/to/index.html?a=b" becomes "com.foo.bar:8983:http/to/index.html?a=b".
这里用到reverseURL,API写的很清楚。
Iterator<String> keysIter = metadata.keySet().iterator(); while (keysIter.hasNext()) { String keymd = keysIter.next(); String valuemd = metadata.get(keymd); row.getMetadata().put(new Utf8(keymd), ByteBuffer.wrap(valuemd.getBytes())); }
在这里遍历metadata树图,对每个key值写入相应的value。
之后是对其写入Filter score,最后写入context中的是翻转后的URL机webpage类的row。
之后是负责创建Job的run()函数
public Map<String, Object> run(Map<String, Object> args) throws Exception { getConf().setLong("injector.current.time", System.currentTimeMillis()); Path input; Object path = args.get(Nutch.ARG_SEEDDIR); if (path instanceof Path) { input = (Path) path; } else { input = new Path(path.toString()); } numJobs = 1; currentJobNum = 0; currentJob = NutchJob.getInstance(getConf(), "inject " + input); FileInputFormat.addInputPath(currentJob, input); currentJob.setMapperClass(UrlMapper.class); currentJob.setMapOutputKeyClass(String.class); currentJob.setMapOutputValueClass(WebPage.class); currentJob.setOutputFormatClass(GoraOutputFormat.class); DataStore<String, WebPage> store = StorageUtils.createWebStore( currentJob.getConfiguration(), String.class, WebPage.class); GoraOutputFormat.setOutput(currentJob, store, true); // NUTCH-1471 Make explicit which datastore class we use Class<? extends DataStore<Object, Persistent>> dataStoreClass = StorageUtils .getDataStoreClass(currentJob.getConfiguration()); LOG.info("InjectorJob: Using " + dataStoreClass + " as the Gora storage class."); currentJob.setReducerClass(Reducer.class); currentJob.setNumReduceTasks(0); currentJob.waitForCompletion(true); ToolUtil.recordJobStatus(null, currentJob, results); // NUTCH-1370 Make explicit #URLs injected @runtime long urlsInjected = currentJob.getCounters() .findCounter("injector", "urls_injected").getValue(); long urlsFiltered = currentJob.getCounters() .findCounter("injector", "urls_filtered").getValue(); LOG.info("InjectorJob: total number of urls rejected by filters: " + urlsFiltered); LOG.info("InjectorJob: total number of urls injected after normalization and filtering: " + urlsInjected); return results; }
首先是对输入路径进行处理,然后是创建Map Job,
currentJob是实例化的一个Nutch Job,对currentJob进行map/reduce设置。
currentJob.setMapperClass(UrlMapper.class); currentJob.setMapOutputKeyClass(String.class); currentJob.setMapOutputValueClass(WebPage.class); currentJob.setOutputFormatClass(GoraOutputFormat.class);
GoraOutPutFormat是apache的一个项目,定义了一种输出格式。
DataStore<String, WebPage> store = StorageUtils.createWebStore( currentJob.getConfiguration(), String.class, WebPage.class); GoraOutputFormat.setOutput(currentJob, store, true);
建立一个Gora形式的储存样例,之后将输出形式设置为当前的job,以及储存地点。
currentJob.setReducerClass(Reducer.class); currentJob.setNumReduceTasks(0); currentJob.waitForCompletion(true); ToolUtil.recordJobStatus(null, currentJob, results);
设置无educe过程,及向系统记录job状态
之后一个run函数负责开始一段injector job,在此不说。
- Nutch2.2.3源码阅读之injectorJob
- Nutch2 之 InjectorJob
- nutch2.3.1源码分析——InjectorJob
- nutch-2.0源码之InjectorJob
- Nutch2 之 GeneratorJob
- 源码阅读之ArrayList
- 源码阅读之Vector
- nutch2.2.1之hbase部署
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
- 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
- 【Nutch2.2.1基础教程之6】Nutch2.2.1抓取流程
- 【Nutch2.2.1基础教程之3】Nutch2.2.1配置文件
- Spring 源码阅读之BeanFactory
- Azkaban源码阅读之AzkabanApplication
- Azkaban源码阅读之CachingFlowManager
- 源码阅读之函数指针
- JDK源码阅读之ArrayList
- JDK源码阅读之LinkedList
- 2016"百度之星" - 初赛(Astar Round2A) 1002
- Unity的StartCoroutines
- linux基本命令(32)——gzip命令
- 分治算法介绍
- DirectFB 之 字体显示(2)
- Nutch2.2.3源码阅读之injectorJob
- 『机器学习——周志华』学习笔记——第一章
- (4)表达式
- 看鸟哥写的计算机概论收获
- Python 2.7.x 和 3.x 版本的重要区别
- Foundation => Objective-C - NSString
- Nginx多域名配置
- Objective-c中数字型字符串转换NSNumber方法
- jQuery高级技巧——DOM操作篇