Hadoop——MapReduce

来源：互联网发布：crm系统 php源码编辑：程序博客网时间：2024/06/07 03:54

Hadoop MapReduce

MapReduce最早来源于google论文，之后被应用于Nutch项目，更详细的Hadoop源远不详述，本文只关注最核心的内容。

What?

Hadoop MapReduce是一个处理海量数据的框架，这里的海量指T级别，通过使用这个框架可以简单的编写程序完成数据在分布式集群中的并行处理。它具有可扩展、容错、稳定的特点。

HOW?

HadoopMapReduce 把输入的数据且分为独立的数据块，然后把这些数据库分配给为Map任务，完全并行的处理Map任务。处理完Map任务后进行排序并输出给Reduce。数据的输入输出都是存储在本地文件磁盘系统(Map的输出是临时数据）。再任务执行期间，框架保证了对任务的调度、监控、对失败任务的再执行。

计算结点和存储结点共享一个机器，也就是说一个结点上既运行了MapReduce，同时也运行了HDFS框架。通过配置任务的合理调度，在HDFS存在的结点上就地做MapReduce计算,消除了大量数据在网络上的交互，保证了整体集群最大的计算吞吐量。

MapReduce框架构成

MapReduce采用主从结构，ResourceManager 作为单一主结点进行调度、资源分配等任务；
NodeManager作为从结点.每次分配从结点一个任务会有一个MApplication去保证任务的运行。

The Hadoop job client then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Hadoop客户端向ResourceManager提交任务以及配置，然后ResourceManager把这个任务分布式的分配给从结点，同时调度、监控所有子任务的执行、并反馈给客户端。

这里写图片描述
更详细内容参考：
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YARN.html

MR的输入\输出

job提交的数据任务，以 [

(input) <k1, v1> -> map -> <k2, v2> -> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

接下来，以一个单词统计的示例详细说明MapReduce的执行过程。

WordCount中的MapReduce详细流程图

Input01.txt:
hello world

Input02.txt:
hello hadoop
hello mapreduce

这里写图片描述

WordCount.java:

public class WordCount {    static Logger looger = LoggerFactory.getLogger("WordCount");    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {        private static final IntWritable one = new IntWritable(1);// 用于表示单次出现一次        private Text word = new Text();// 用来存储切分后的单词        public void map(Object key, Text value, Mapper<Object, Text, Text, IntWritable>.Context context)                throws IOException, InterruptedException {            // 输入:<key,value>是 <行首偏移量 ,行内容>            looger.info("map  key:{} value:{}", key, value);            // 将读到的一行(value)进行单词切割            StringTokenizer itr = new StringTokenizer(value.toString());            while (itr.hasMoreTokens()) {// 遍历这一行所有的单词                this.word.set(itr.nextToken());                context.write(this.word, one);// K:单词;V:1            }            // 输出:<word1,1>,<word2,1>,<word1,1>....<wordn,1>        }    }    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {        private IntWritable result = new IntWritable();        public void reduce(Text key, Iterable<IntWritable> values,                Reducer<Text, IntWritable, Text, IntWritable>.Context context)                throws IOException, InterruptedException {            int sum = 0;            // 遍历<Key,value-list> 中value-list            for (IntWritable val : values) {                sum += val.get();            }            this.result.set(sum);            context.write(key, this.result);        }    }    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        String[] otherArgs = new String[] { "F:\\hadoop\\cascading.samples\\wordcount\\data\\input_file1",                "F:\\hadoop\\cascading.samples\\wordcount\\data\\input_file2",                "F:\\hadoop\\cascading.samples\\wordcount\\output" };        Job job = Job.getInstance(conf, "word count");        job.setJarByClass(WordCount.class);        job.setMapperClass(TokenizerMapper.class);        job.setCombinerClass(IntSumReducer.class);        job.setReducerClass(IntSumReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        for (int i = 0; i < otherArgs.length - 1; i++) {            FileInputFormat.addInputPath(job, new Path(otherArgs[i]));        }        FileOutputFormat.setOutputPath(job, new Path(otherArgs[(otherArgs.length - 1)]));        System.exit(job.waitForCompletion(true) ? 0 : 1);    }}

Mapper

多个Mapper任务并行的处理、转换原始数据为中间数据(待Reduce的临时数据)。

The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.？？？？

Mapper的实现通过 Job.setMapperClass(Class)来设置，并提交给这个任务。然后，被InputSplit切分好的k/v数据,依次按个的遍历执行map(WritableComparable, Writable, Context)。

另外，程序也可以通过改写cleanup(Context)，去完成任何需要的清理、释放工作。
最后Mapper函数通过context.write(WritableComparable, Writable)完成输出。输出的类型不需要与输入类型一致。

接着，Mapper输出的中间数据数据被分组排序，传递给Reducer（以完成最终的结果输出）。这里Mapper输出的排序，允许用户通过Job.setGroupingComparatorClass(Class)来指定一个比较器。

Partitioner

key值被用来作分区，典型分区算法是使用hash函数。
mappr会把自己的输出做分割(分区)，分配给多个Reducer任务。分割后的输出个数与Reducer任务的个数一致，这样才好分配嘛。

HashPartitioner是默认的Partitioner。

用户可以通过实现Partitioner来控制分割算法，比如将一些k\v 数据分配到某个具体的Reducer任务中，再比如需要把某些数据分配到同一个Reducer任务来实现全局的排序。

Map的输出数据通常使用这样的格式存储:(key-len, key, value-len, value) 。用户也可以通过Configuration改变、或者压缩(实现CompressionCodec )这些中间数据。

How Many Maps?

一个MapReduce中的Mapper任务数量通常由输入的文件的块数决定，一个结点通常分配10~100个Mapper任务。默认的上线是300.如果你有10T的输入数据，并且blocksize是128M，那么10T/128M=81920就是你Map的任务个数。有81920个map啊！只能通过Configuration.set(MRJobConfig.NUM_MAPS, int)来设置提高Mapper数量

Combiner

这个示例也使用了一个Combiner，因此Map任务处理并key排序后的输出首先会通过本地Combiner的处理（这里的Combiner指定了Reduce的实现类）.
Combiner可以合并数据，减少输出的数据数量。

Reducer(shuffle, sort ,reduce)

将输入的数据按照唯一key做合并。一个job中的Reducer任务数量可通过 Job.setNumReduceTasks(int)设置。
通常重写reduce方法并设置Job.setReducerClass(Class)来定制一个Reducer函数。
与map调用类似，Reducer的输入数据被遍历执行 reduce(WritableComparable, Iterable, Context) 。

同样，这里也可以通过改写cleanup(Context)，去完成任何需要的清理、释放工作。

通常一个Reducer包含三个基本的操作： shuffle, sort ,reduce.

Shuffle

上文说过Reducer的输入是mapper的排序后的输出。在shuffle中，通过HTP，获取到所有匹配这个Reducer的mapper分区数据。

Sort

由于shuffle中，收集了多个mapper的输出（分区后的一个reducer可能对应多个mapper），顺序会不规则。这里，对所有输入这个reducer的数据做了排序。以上所说的都是对key做排序，对value也支持排序，需要额外的设置。

Shuffle与Sort是同时进行的

Reduce

reduce(WritableComparable, Iterable, Context)
通过使用Context.write(WritableComparable, Writable)来（输出）写
入文件系统.

注意：
reducer的输出不像mapper的输出，reducer输出不会自动写入磁盘，需要指定具体的输出路径。

How Many Reduces?

恰当的Reduces任务数量是：
0.95 (或者1.75) * 结点数*每个结点最大container( container执行任务所调用的系统资源的粒度,通常会调用多个container)数.

0.95:使用这个参数，所有的reducer都会立刻准备完成，等待map完成输出后运行。
1.75:使用这个参数，最快的结点会完成第一次reducer的准备。然后再去准备第二次reducer任务。此处待确认！！

With 0.95 all of the reduces can launch immediately and start
transferring map outputs as the maps finish. With 1.75 the faster
nodes will finish their first round of reduces and launch a second
wave of reduces doing a much better job of load balancing

提高reduce的数量，会增大mapreduce整体的开销,但会提升负载均衡度，同时也会降低失败的开销

The scaling factors above are slightly less than whole numbers to
reserve a few reduce slots in the framework for speculative-tasks and
failed tasks.

注意：如果不需要做reducer，把reducer任务数设为0也是支持的！
这种情况下，map得输出直接写入文件系统(输出路径通过FileOutputFormat.setOutputPath(Job, Path)设置)。

这种情况下的output在写入文件之前并不会做排序！

0 0