WordCount工作流程分析与总结

来源：互联网发布：淘宝日用品比较好的店编辑：程序博客网时间：2024/04/29 12:10

笔记目的：

1.总结分析MapReduce的基本流程

2.总结分析WordCount的工作流程

3.总结分析代码WordCount代码

笔记时间：

2012年10月10日

By Yikun

Mail:yikunkero@gmail.com

1关于MapReduce.

1.1摘要

MapReduce is a programming model and an associated implementation for processing and generating large datasets that is amenable to a broad variety of real-world tasks. Users specify the computation in terms of a map and a reduce function, and the underlying runtime system automatically parallelizes the computation across large-scale clusters of machines, handles machine failures, and schedules inter-machine communication to make efficient use of the network and disks. Programmers find the system easy to use: more than ten thousand distinct MapReduce programs have been implemented internally at Google over the past four years, and an average of one hundred thousand MapReduce jobs are executed on Google's clusters every day, processing a total of more than twenty petabytes of data per day.

MapReduce是一个编程模型，也是一个处理和生成大型数据集的模型的相关方法，这个模型方法适用于实际中各种需要解决的任务。用户指定mao/reduce的函数后，内部运行的系统会自动调动大规模的机器集群进行并行运算、处理机器故障，并安排内部的机器通信，使得更有效的利用网络与磁盘。程序员很容易去利用系统：在过去的四年中，有一万多不同的MapReduce程序运行在Google内部的集群上，并且这些集群平均每天运行着十万的MapReduce的任务，这些任务每天处理20PB的数据。

--摘自《MapReduce: Simplied Data Processing on Large Clusters 》(Google, Inc. Jeffrey Deanand, Sanjay Ghemawat)

1.2工作流程概览

根据Google MapReduce的原文，MapReduce的过程主要有以下几个阶段

1.数据分割准备阶段

用户程序首先调用MapReduce库将输入文件分成M个数据片度，每个数据片段的大小一般从 16MB到64MB(可以通过可选的参数来控制每个数据片段的大小)。然后用户程序在机群中创建大量的程序副本。

2.Map/Reduce任务分配阶段

这些程序副本中的有一个特殊的程序–master。副本中其它的程序都是worker程序，由master分配任务。有M个Map任务和R个Reduce任务将被分配，master将一个Map任务或Reduce任务分配给一个空闲worker。
3.worker读取分块数据并处理城中间文件

被分配了map任务 worker程序读取相关的输入数据片段，从输入的数据片段中解析出key/value pair ，然后把key/value pair传递给用户自定义Map函数，由Map函数生成并输出的中间key/value pair ，并缓存在内存中。
4.本地写入阶段，并传回Master，准备Reduce

缓存中key/value pair通过分区函数分成R个区域，之后周期性的写入到本地磁盘上。缓存key/value pair在本地磁盘上的存储位置将被回传给master ，由master负责把这些存储位置再传送给Reduce worker。
5.Reduce读取、排序数据阶段

当Reduce worker程序接收到master程序发来的数据存储位置信息后，使用RPC从Map worker所在主机的磁盘上读取这些缓存数据。当Reduce worker读取了所有的中间数据后，通过对key进行排序后使得具有相同key值的数据聚合在一起。由于许多不同key值会映射到相同Reduce任务上，因此必须进行排序。如果中间数据太大无法在内存中完成排序，那么就要在外部进行排序。
6.Reduce写入输出数据阶段

Reduce worker程序遍历排序后的中间数据，对于每一个唯一中间key值，Reduce worker程序将这个key值和它相关的中间value值的集合传递给用户自定义 Reduce函数。Reduce函数的输出被追加到所属分区的输出文件。
7.完成阶段

当所有Map和Reduce任务都完成之后，master唤醒用户程序。在这个时候，在用户程序里的对MapReduce调用才返回。

2关于wordcount的程序

2.1wordcount工作流程分析

wordcount是下来分析一下wordcount的工作流程，wordcount是一个利用mapreduce实现单词计数的程序。

·因为对wordcount的理解牵扯到了一些关于HDFS工作的情况，所以在这里先简单的总结下HDFS的工作流程。

上图参考《Hadoop权威指南》，我理解为，客户端的一些大型的数据，利用hadoop的命令(hadoop fs -put)将源数据存放在datanode中，而namenode中仅存在的是这些文件的映射，客户端通过namenode的映射地址可以读取到datanode的文件。这样便完成了文件的分割，构成了一个分布式处理的系统。

具体到WordCount中，我个人理解可以分为2个部分，一个是文件分块、部分，读写部分，另外一个是程序处理方面。

2.1.1 文件部分(HDFS)。

首先当执行hadoop fs -put后，数据便分割到每个datanode中了。

当处理的时候，结合后面，工作流程，Map过程,包括产生的中间文件都是存储在datanode的本地存储的，也就是说，不上传到hdfs，直到reduce过程完成之后，才进行最后的写入hdfs。

2.1.2程序处理部分(MapRedeuce)

可以看到，当分割好的文件在datanode中之后，进行Map，Map主要完成源数据到<key,value>的预处理，生成的中间文件，然后进行合并后，由Reduce过程输出到HDFS，最终完成整个过程。

2.2wordcount源码解析

package org.apache.hadoop.examples;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;/**** 描述：WordCount explains by York  * @author Hadoop Dev Group*/public class WordCount {    /**     * 建立Mapper类TokenizerMapper继承自泛型类Mapper     * Mapper类:实现了Map功能基类     * Mapper接口：     * WritableComparable接口：实现WritableComparable的类可以相互比较。所有被用作key的类应该实现此接口。     * Reporter 则可用于报告整个应用的运行进度，本例中未使用。     *     */  public static class TokenizerMapper       extends Mapper<Object, Text, Text, IntWritable>{        /**         * IntWritable, Text 均是 Hadoop 中实现的用于封装 Java 数据类型的类，这些类实现了WritableComparable接口，         * 都能够被串行化从而便于在分布式环境中进行数据交换，你可以将它们分别视为int,String 的替代品。     * 声明one常量和word用于存放单词的变量         */    private final static IntWritable one =new IntWritable(1);    private Text word =new Text();    /**         * Mapper中的map方法：         * void map(K1 key, V1 value, Context context)         * 映射一个单个的输入k/v对到一个中间的k/v对         * 输出对不需要和输入对是相同的类型，输入对可以映射到0个或多个输出对。         * Context：收集Mapper输出的<k,v>对。         * Context的write(k, v)方法:增加一个(k,v)对到context         * 程序员主要编写Map和Reduce函数.这个Map函数使用StringTokenizer函数对字符串进行分隔,通过write方法把单词存入word中     * write方法存入(单词,1)这样的二元组到context中     */     public void map(Object key, Text value, Context context                    ) throws IOException, InterruptedException {      StringTokenizer itr =new StringTokenizer(value.toString());      while (itr.hasMoreTokens()) {        word.set(itr.nextToken());        context.write(word, one);      }    }  }   public static class IntSumReducer       extends Reducer<Text,IntWritable,Text,IntWritable> {    private IntWritable result =new IntWritable();    /**         * Reducer类中的reduce方法：      * void reduce(Text key, Iterable<IntWritable> values, Context context)         * 中k/v来自于map函数中的context,可能经过了进一步处理(combiner),同样通过context输出                   */    public void reduce(Text key, Iterable<IntWritable> values,                       Context context                       ) throws IOException, InterruptedException {      int sum =0;      for (IntWritable val : values) {        sum += val.get();      }      result.set(sum);      context.write(key, result);    }  }  public static void main(String[] args) throws Exception {        /**         * Configuration：map/reduce的j配置类，向hadoop框架描述map-reduce执行的工作         */    Configuration conf =new Configuration();    String[] otherArgs =new GenericOptionsParser(conf, args).getRemainingArgs();    if (otherArgs.length !=2) {      System.err.println("Usage: wordcount <in> <out>");      System.exit(2);    }    Job job =new Job(conf, "word count");    //设置一个用户定义的job名称    job.setJarByClass(WordCount.class);    job.setMapperClass(TokenizerMapper.class);    //为job设置Mapper类    job.setCombinerClass(IntSumReducer.class);    //为job设置Combiner类    job.setReducerClass(IntSumReducer.class);    //为job设置Reducer类    job.setOutputKeyClass(Text.class);        //为job的输出数据设置Key类    job.setOutputValueClass(IntWritable.class);    //为job输出设置value类    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));    //为job设置输入路径    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//为job设置输出路径    System.exit(job.waitForCompletion(true) ?0 : 1);        //运行job  }}