MapReduce工作原理

来源：互联网发布：sql isnull用法编辑：程序博客网时间：2024/06/05 10:54

今天体验了一下mapreduce的运算过程，写出来总结一下。

首先我们要了解两个节点一个是ResourceManager:整个运算过程资源调度（整个）；另一个是NodeManager：每一个节点运算上面资源的管理（个体）

我们通过通过命令sbin/start-yarn.sh 开启集群的资源节点输入jps查看

hadoop@master:/mysoftware/hadoop-2.7.3/share/hadoop/mapreduce$ jps
1218 SecondaryNameNode
1048 NameNode
1802 Jps
1435 ResourceManager

进入到/mysoftware/hadoop-2.7.3/share/hadoop/mapreduce下有许多jar包如下：

hadoop@master:/mysoftware/hadoop-2.7.3/share/hadoop/mapreduce$ ls
hadoop-mapreduce-client-app-2.7.3.jar         hadoop-mapreduce-client-jobclient-2.7.3-tests.jar
hadoop-mapreduce-client-common-2.7.3.jar      hadoop-mapreduce-client-shuffle-2.7.3.jar
hadoop-mapreduce-client-core-2.7.3.jar        hadoop-mapreduce-examples-2.7.3.jar
hadoop-mapreduce-client-hs-2.7.3.jar          lib
hadoop-mapreduce-client-hs-plugins-2.7.3.jar lib-examples
hadoop-mapreduce-client-jobclient-2.7.3.jar   sources

我们输入： hadoop jar hadoop-mapreduce-examples-2.7.3.jar 可以查看这个jar包下面有有哪些命令如下：

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.

现在的一些信息提示都蛮人性化的，写的都是能够理解的这里我们看到有一个wordcout 这里采用它来进行进一步学习

我们输入 hadoop jar hadoop-mapreduce-examples-2.7.3.jar wordcount /data/test02.txt /data_out_1 /data/test02.txt表示要处理数据文件位置 /data_out_1表示处理后文件的存放位置

等待运行过程中，我们可以通过图形界面进行查看 http://master:8088/cluster进行查看

这里会显示我们的刚刚运行的运算这里已经运行完成我们点击hoitory还可以查看历史记录不过我们需要开启jobhistoryserver节点

sbin/mr-jobhistory-daemon.sh start historyserver

这样在图形界面上我们可以查看到之前的历史记录

在远程上面看到我们刚刚执行的结果

不难看出这里创建了一个data_out_1的文件夹右边是wordcount执行后的输出结果统计了 /data/test02.txt 文本中里面单词的个数

接下来我们通过java代码来了解其中的过程

package com.yc.hadoop.mapreduce;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MyWordCount {    //默认将数据文件中的每一行数据，拆分成    //Key（是每一行数据在文件中首字符出现的位置）    //Value(每一行数据)        //对数据进行拆分    public static class MyWordCountMapred extends Mapper<LongWritable, Text,  Text, IntWritable> {        public static final IntWritable ONE = new IntWritable(1);                @Override        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text,  Text ,IntWritable>.Context context)                throws IOException, InterruptedException {            System.out.println("--------> key:"+ key + ", value:" + value);  //这里我们可以查看key与value是什么值 有附图            //super.map(key, value, context);            String[] words = value.toString().trim().split("\\s+");  //根据空格拆分                        for(String word:words){                context.write(new Text(word), ONE);  //把拆分数据存放到map上下文 （全局共享变量，所存放的map对象 简称上下文）            }                    }    }        //1、对map输出的key排序    //2、合并相同的key，value组成集合  如：hello ==>key： hello  value： {1,1,1} （这里hello本来有3个）    public static class MyWordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{        @Override        protected void reduce(Text key, Iterable<IntWritable> values,                Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {            /*StringBuilder vstr = new StringBuilder();            for(IntWritable v : values){                vstr.append(v.get() + ", ");            }            System.out.println("*********************>key:" + key + ", value:" + vstr);*/  //这一段代码主要是了解过程 有附图                        int count = 0;            for(IntWritable v : values){                count += v.get();  //统计单词的个数            }                        context.write(key, new IntWritable(count));            //super.reduce(key, values, context);        }    }        public static void main(String[] args) throws Exception {        //hadoop运算        Configuration conf = new Configuration();  //配置文件对象        Job job = Job.getInstance(conf,"mywordcount");   //mapreduce 作业对象                job.setJarByClass(MyWordCount.class);  //设置作业处理类                //设置Map操作  数据拆分操作        job.setMapperClass(MyWordCountMapred.class);        job.setMapOutputKeyClass(Text.class);  //设置拆分后，输出数据key的类型        job.setMapOutputValueClass(IntWritable.class);  //设置拆分后，输出数据Value的类型              //设置Reduce操作        //job.setReducerClass(MyWordCountReducer.class);        job.setNumReduceTasks(2); //设置处理后文件的个数 默认为1        job.setOutputKeyClass(Text.class); //设置合并后，输出数据key的类型        job.setOutputValueClass(MyWordCountReducer.class);  //设置合并后，输出value的类型        //设置处理数据文件的位置        FileInputFormat.setInputPaths(job, new Path("hdfs://master:9000/data/test01.txt"));          //设置处理后文件的存放位置        FileOutputFormat.setOutputPath(job, new Path("hdfs://master:9000/data_out_" + System.currentTimeMillis()));        //开始执行作业        job.waitForCompletion(true);                    }}

map对应的key value

map拆分后的执行结果

reduce过程

运行结果：

实现了对单词统计的运算总的来说mapreduce 分为了map操作和reduce操作

先拆分再合并

整个MapReduce的过程大致分为 Map--》Shuffle（排序）--》Combine（组合）--》Reduce

map先拆分将key value值相同的进行排序再组合因为我这里只有一个test01.txt 如果data下面有两个或多个文件如有一个test02.txt共同进行单词统计的话是一样的最后在合并的时候单个文件统计好的的每个单词的个数再一起进行排序组合成一个新的key value 可以说mapreduce就是一个反复执行排序组合的过程！

补充：在执行这段java代码的时候可能会报一个关于NativeIO的错误我们主要是体会mapreduce的过程

我们复制NativeIO.class的所有代码在Acess的函数下这里需要强调的时根据原本的文件内容创建包不要改里面的路径否则读取不到的

把本来的注释改成return true; 就可以跳过继续运行有些可以自己跳过有些则需要自己修改一下

以上是关于mapreduce的一点学习。继续深入努力认知！

阅读全文

0 0