对MapReduce模型的理解

来源：互联网发布：sql distinct 所有字段编辑：程序博客网时间：2024/06/11 19:38

前言

为什么要用MapReduce
MapReduce的流行是有理由的。它非常简单，、易于实现且扩展性强。大家可以通过它轻易地编写出同时在多台主机上运行的程序，也可以使用Ruby、Python、PHP和C++等非Java类语言编写Map或Reduce程序，还可以在任何安装Hadoop的集群中运行同样的程序，不论这个集群有多少台主机。MapReduce适合处理海量数据，因为它会被多台主机同时处理，这样通常会有较快的速度。

进入正题

1、MapReduce模型
           要了解MapReduce,首先需要了解MapReduce的载体是什么。在Hadoop中，用于执行MapReduce任务的机器有两个角色：一是TaskTracker,另一个是JobTracker。JobTracker是用于管理和调度工作的，TaskTracker是用于执行工作的。一是Hadoop集群中只有一台JobTracker。
1、1、MapReduce Job
           在Hadoop中，每个MapReduce任务都被初始化为一个Job。每个Job又可以分为两个阶段：Map阶段和Reduce阶段。这两个阶段分别用两个函数来表示，即Map函数和Reduce函数。Map函数接收一个<key,value>形式的输入，然后产生同样为<key,value>形式的中间输出，Hadoop会负责将所具有相同中间key值的value集合到一起传递个Reduce函数，Reduce函数接收一个<key,(list of values)>形式的输入，然后对这个value集合进行处理并输出结果，Reduce的输出也是<key,value>形式的。
            为了方便理解，分别将三个<key,value>对标记为<k1,v1>、<k2,v2>、<k3,v3>，那么上面所述的过程可以用下面的实例来表示

            input-->k1,v1--Map-->k2,v2--Reduce-->k3,v3-->output

1、1、2、Hadoop的Hello World程序

            上面说的过程是MapReduce的核心，所有的MapReduce程序都具有上面实例的结构。下面我再具体各实例说明MapReduce的执行过程。
            大家初次接触编程时学习的不论是那种语言，看到的第一个示例程序可能都是“Hello World”。在Hadoop中也有一个类似于Hello World的程序。这就是WordCount。

package hadoopDemo;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.net.URI;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class WordCount {public static class WordMap extends Mapper<LongWritable, Text, Text, IntWritable> {private IntWritable one = new IntWritable(1);private Text word = new Text();@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);while (tokenizer.hasMoreTokens()) {String str = tokenizer.nextToken();word.set(str);context.write(word, one);System.out.println(str);}}}public static class WordReduce extends Reducer<Text, IntWritable, Text, IntWritable> {@Overrideprotected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int sum = 0;for (IntWritable intWritable : values) {sum += intWritable.get();}context.write(key, new IntWritable(sum));System.out.println(key + ":" + new IntWritable(sum));}}public static void main(String[] args) throws Exception {Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordMap.class);job.setReducerClass(WordReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.out.println(job.waitForCompletion(true));Thread.sleep(2000);FileSystem fileSystem = FileSystem.get(URI.create(args[1] + "/part-r-00000"), conf);Path path = new Path(args[1] + "/part-r-00000");InputStream inputStream = null;String str = null;StringBuilder builder = new StringBuilder(100);if (fileSystem.exists(path)) {inputStream = fileSystem.open(path);BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));while ((str = reader.readLine()) != null){builder.append(str);builder.append("\n");}inputStream.close();reader.close();fileSystem.close();}System.out.println(builder.toString());}}

看到这个程序，相信很多读者会对众多的预定义类感到很迷茫。其实这些类非常简单明了。首先，WorldCount程序的代码虽多，但是执行过程却很简单，在本例中，它首先将输入的文件读取进来，然后交由Map程序处理，Map程序将输入读入后切出其中的单词，并标记为它的数目为1，形成<word,1>的形式，然后交由Reduce处理，Reduce将相同的key值（也就是word）的value值收集起来，形成<word,list of 1>的形式，之后将这些1值加起来，即为单词的个数，最后将这个<key,value>对以TextOuputFormat的形式输出到HDFS中。

针对这个数据流动过程，我挑出了如下几行代码来表述他的执行过程：

                Configuration conf = new Configuration();Job job = new Job(conf);job.setJarByClass(WordCount.class);job.setJobName("wordcount");job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);job.setMapperClass(WordMap.class);job.setReducerClass(WordReduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));System.out.println(job.waitForCompletion(true));

首先说下Job的初始化过程。Main函数调用创建Configration对象，传给Job得到Job对象类来对MapReduce Job进行初始化，然后设置

job.setJarByClass(WordCount.class);

执行这个Job对应的类，再然后调用

job.setJobName("wordcount");

设置这个Job的名称。对这个Job命名有助于更快地找到Job，以便在JobTracker和TaskTracker的页面中对其进行监视。接着就会调用

<pre name="code" class="java">FileInputFormat.addInputPath(job, new Path(args[0]));

和

<pre name="code" class="java">FileOutputFormat.setOutputPath(job, new Path(args[1]));

设置输入输出的路径。

接下来结合Wordount程序重点说下

job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);

以及Map()和Reduce()这几种方法

inputFormat()和inputSplit()

0 0