ChainMapper/ChainReducer 的实现原理

来源：互联网发布：java程序分为哪两类编辑：程序博客网时间：2024/04/30 04:19

ChainMapper/ChainReducer 主要为了解决线性链式Mapper 而提出的。也就是说，在Map 或者Reduce 阶段存在多个Mapper，这些Mapper 像Linux 管道一样，前一个Mapper的输出结果直接重定向到下一个Mapper 的输入，形成一个流水线，形式类似于[MAP+REDUCE MAP*]。图1展示了一个典型的ChainMapper/ChainReducer 的应用场景：在Map 阶段，数据依次经过Mapper1 和Mapper2 处理；在Reduce 阶段，数据经过shuffle 和sort 后；交由对应的Reducer 处理，但Reducer 处理之后并没有直接写到HDFS 上，而是交给另外一个Mapper 处理，它产生的结果写到最终的HDFS 输出目录中。

图1 ChainMapper/ChainReducer 应用实例

需要注意的是，对于任意一个MapReduce 作业，Map 和Reduce 阶段可以有无限个Mapper，但Reducer 只能有一个。也就是说，图2 所示的计算过程不能使用 ChainMapper/ChainReducer 完成，而需要分解成两个MapReduce 作业。

图2 一个ChainMapper/ChainReducer 不适用的场景

一个实例：

conf.setJobName("chain");
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
JobConf mapper1Conf = new JobConf(false);
JobConf mapper2Conf = new JobConf(false);
JobConf reduce1Conf = new JobConf(false);
JobConf mapper3Conf = new JobConf(false);
…
ChainMapper.addMapper(conf, Mapper1.class, LongWritable.class, Text.class,Text.
class, Text.class, true, mapper1Conf);
ChainMapper.addMapper(conf, Mapper2.class, Text.class, Text.class,
LongWritable.class, Text.class, false, mapper2Conf);
ChainReducer.setReducer(conf, Reducer.class, LongWritable.class, Text.class,Text.
class, Text.class, true, reduce1Conf);
ChainReducer.addMapper(conf, Mapper3.class, Text.class, Text.class,
LongWritable.class, Text.class, false, null);
JobClient.runJob(conf);
用户通过addMapper 在Map/Reduce 阶段添加多个Mapper。该函数带有8 个输入参数，分别是作业的配置、Mapper 类、Mapper 的输入key 类型、输入value 类型、输出key
类型、输出value 类型、key/value 是否按值传递和Mapper 的配置。其中，第7 个参数需要解释一下：Hadoop MapReduce 有一个约定，函数OutputCollector.collect(key, value) 执行期间不应改变key 和value 的值。这主要是因为函数Mapper.map() 调用完OutputCollector.collect(key, value) 之后，可能会再次使用key 和value 值，如果被改变，可能会造成潜在的错误。为了防止OutputCollector 直接对key/value 修改，ChainMapper 允许用户指定key/value 传递方式。如果用户确定key/value 不会被修改，则可选用按引用传递，否则按值传递。需要注意的是，引用传递可避免对象拷贝，提高处理效率，但需要确保key/value 不会被修改。

原文引自《Hadoop技术内幕-深入解析Mapreduce框架设计与实现原理》