走向云计算之MapReduce的代码辅助优化和改善

来源：互联网发布：浙江华为知乎编辑：程序博客网时间：2024/06/04 17:58

一、概述

hadoop的MapReduce在运行时，hadoop框架在幕后为我们完成了许多重要的工作，这部分内容对用户是透明的，一般我们不必去关心其运行。但是在不同的应用场景中，可能需要对其中的一些小地方进行优化或者修改，以更好的解决当前的场景问题。下面就介绍几个实际开发中可能会遇到的情况。

二、hadoop计数器

计数器是hadoop用来记录job任务的执行进度和状态的。它的作用可以理解为日志。我们通常可以在程序的某个位置插入计数器，用来记录数据或者进度的变化情况，它比日志更便利进行分析。

1、内置计数器

Hadoop其实内置了很多计数器，我们通过单词计数实例来看一下。
HDFS上的源文件：

hello youhello me

WordCount.Java：

package com.kang;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class WordCount {    public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {        private final static IntWritable one = new IntWritable(1);        private Text word = new Text();        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {            StringTokenizer itr = new StringTokenizer(value.toString());            while (itr.hasMoreTokens()) {                word.set(itr.nextToken());                context.write(word, one);            }        }    }    public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {        private IntWritable result = new IntWritable();        public void reduce(Text key, Iterable<IntWritable> values, Context context)                throws IOException, InterruptedException {            int sum = 0;            for (IntWritable val : values) {                sum += val.get();            }            result.set(sum);            context.write(key, result);        }    }    public static void main(String[] args) throws Exception {        Configuration conf = new Configuration();        Job job = new Job(conf, "word count");        job.setJarByClass(WordCount.class);        job.setMapperClass(TokenizerMapper.class);        job.setCombinerClass(IntSumReducer.class);        job.setReducerClass(IntSumReducer.class);        job.setOutputKeyClass(Text.class);        job.setOutputValueClass(IntWritable.class);        FileInputFormat.addInputPath(job, new Path("hdfs://sparkproject1:9000/root/input/"));        FileOutputFormat.setOutputPath(job, new Path("hdfs://sparkproject1:9000/root/output/"));        System.exit(job.waitForCompletion(true) ? 0 : 1);    }}

运行该实例，在控制台可以看到如下内容：
这里写图片描述

其详细信息如下：

17/06/14 08:55:25 INFO mapreduce.Job: Counters: 49// 表示本次job共49个计数器      File System Counters // 文件系统计数器          FILE: Number of bytes read=37        FILE: Number of bytes written=207823        FILE: Number of read operations=0        FILE: Number of large read operations=0        FILE: Number of write operations=0        HDFS: Number of bytes read=129        HDFS: Number of bytes written=19        HDFS: Number of read operations=6        HDFS: Number of large read operations=0        HDFS: Number of write operations=2    Job Counters // 作业计数器          Launched map tasks=1// 启动的map数为1          Launched reduce tasks=1// 启动的reduce数为1          Data-local map tasks=1        Total time spent by all maps in occupied slots (ms)=7223        Total time spent by all reduces in occupied slots (ms)=7891        Total time spent by all map tasks (ms)=7223        Total time spent by all reduce tasks (ms)=7891        Total vcore-seconds taken by all map tasks=7223        Total vcore-seconds taken by all reduce tasks=7891        Total megabyte-seconds taken by all map tasks=7396352        Total megabyte-seconds taken by all reduce tasks=8080384    Map-Reduce Framework//MapReduce框架计数器          Map input records=2//map读入的记录行数，读取两行记录,”hello you”,”hello me”        Map output records=4//map输出的记录行数，输出4行记录        Map output bytes=35        Map output materialized bytes=37        Input split bytes=110        Combine input records=4// 合并输入的记录数        Combine output records=3// 合并输出的记录数        Reduce input groups=3// reduce函数接收的key数量，即归并后的k2数量        Reduce shuffle bytes=37// 规约分区的字节数        Reduce input records=3// reduce输入的记录行数。<helllo,{1,1}>,<you,{1}>,<me,{1}>        Reduce output records=3//reduce输出的记录行数。<helllo,2>,<you,1>,<me,1>        Spilled Records=6        Shuffled Maps =1        Failed Shuffles=0        Merged Map outputs=1        GC time elapsed (ms)=326        CPU time spent (ms)=1090        Physical memory (bytes) snapshot=222588928        Virtual memory (bytes) snapshot=787947520        Total committed heap usage (bytes)=126754816    Shuffle Errors// Shuffle错误计数器          BAD_ID=0        CONNECTION=0        IO_ERROR=0        WRONG_LENGTH=0        WRONG_MAP=0        WRONG_REDUCE=0    File Input Format Counters // 文件输入格式计数器          Bytes Read=19 // Map从HDFS上读取的字节数，共19个字节      File Output Format Counters         Bytes Written=19//Reduce输出到HDFS上的字节数，共19个字节

上面的信息就是内置计数器的一些信息，包括：
文件系统计数器（File System Counters）
作业计数器（Job Counters）
MapReduce框架计数器（Map-Reduce Framework）
Shuffle 错误计数器（Shuffle Errors）
文件输入格式计数器（File Output Format Counters）
文件输出格式计数器（File Input Format Counters）

2、自定义计数器

Hadoop也支持自定义计数器，在Hadoop2.x中可以使用Context的getCounter()方法（其实是接口TaskAttemptContext的方法，Context继承了该接口）得到自定义计数器。

public Counter getCounter(Enum<?> counterName)：Get the Counter for the given counterNamepublic Counter getCounter(String groupName, String counterName)：Get the Counter for the given groupName and counterName

由此可见，可以通过枚举或者字符串来得到计数器。
计数器常见的方法有几下几个：

String getName()：Get the name of the counterString getDisplayName()：Get the display name of the counterlong getValue()：Get the current valuevoid setValue(long value)：Set this counter by the given valuevoid increment(long incr)：Increment this counter by the given value

现在假设我们需要对文件中的敏感词做一个统计，即对敏感词在文件中出现的次数做一个记录。这里，我们还是以下面这个文件为例：

hello youhello me

文本内容很简单，这里我们指定hello是一个敏感词，显而易见这里出现了两次hello，即两次敏感词需要记录下来。
为了实现这个功能，我们在WordCount程序的基础之上，改写Mapper类中的map方法，统计Hello出现的次数，如下代码所示：

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {        private final static IntWritable one = new IntWritable(1);        private Text word = new Text();        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {            Counter sensitiveCounter = context.getCounter("Sensitive Words:", "hello");// 创建一个组是Sensitive Words，名是hello的计数器            StringTokenizer itr = new StringTokenizer(value.toString());            while (itr.hasMoreTokens()) {                if (itr.nextToken().equalsIgnoreCase("hello")) {//如果出现了hello，则计数器值加1                      sensitiveCounter.increment(1L);                  }                word.set(itr.nextToken());                context.write(word, one);            }        }    }

其他代码不变，运行后可以看到结果如下：
这里写图片描述
发现控制台中确实多了一组计数器Sensitive Words:，其中有一个名叫hello的计数器，值为2。

三、hadoop利用自定义数据类型

1、情景描述

现在我们利用hadoop来进行手机日志内容的分析，这个日志文件的内容如下：

1363157985066   13726230503 00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com        24  27  2481    24681   2001363157995052   13826544101 5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4            4   0   264 0   2001363157991076   13926435656 20-10-7A-28-CC-0A:CMCC  120.196.100.99          2   4   132 1512    2001363154400022   13926251106 5C-0E-8B-8B-B1-50:CMCC  120.197.40.4            4   0   240 0   2001363157993044   18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99  iface.qiyi.com  视频网站    15  12  1527    2106    2001363157995074   84138413    5C-0E-8B-8C-E8-20:7DaysInn  120.197.40.4    122.72.52.12        20  16  4116    1432    2001363157993055   13560439658 C4-17-FE-BA-DE-D9:CMCC  120.196.100.99          18  15  1116    954 2001363157995033   15920133257 5C-0E-8B-C7-BA-20:CMCC  120.197.40.4    sug.so.360.cn   信息安全    20  20  3156    2936    2001363157983019   13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82          4   0   240 0   2001363157984041   13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4    s19.cnzz.com    站点统计    24  9   6960    690 2001363157973098   15013685858 5C-0E-8B-C7-F7-90:CMCC  120.197.40.4    rank.ie.sogou.com   搜索引擎    28  27  3659    3538    2001363157986029   15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99  www.umeng.com   站点统计    3   3   1938    180 2001363157992093   13560439658 C4-17-FE-BA-DE-D9:CMCC  120.196.100.99          15  9   918 4938    2001363157986041   13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4            3   3   180 180 2001363157984040   13602846565 5C-0E-8B-8B-B6-00:CMCC  120.197.40.4    2052.flash2-http.qq.com 综合门户    15  12  1938    2910    2001363157995093   13922314466 00-FD-07-A2-EC-BA:CMCC  120.196.100.82  img.qfc.cn      12  12  3008    3720    2001363157982040   13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99  y0.ifengimg.com 综合门户    57  102 7335    110349  2001363157986072   18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99  input.shouji.sogou.com  搜索引擎    21  18  9531    2412    2001363157990043   13925057413 00-1F-64-E1-E6-9A:CMCC  120.196.100.55  t3.baidu.com    搜索引擎    69  63  11058   48243   2001363157988072   13760778710 00-FD-07-A4-7B-08:CMCC  120.196.100.82          2   2   120 120 2001363157985079   13823070001 20-7C-8F-70-68-1F:CMCC  120.196.100.99          6   3   360 180 2001363157985069   13600217502 00-1F-64-E2-E8-B1:CMCC  120.196.100.55          18  138 1080    186852  200

文件的内容已经经过了优化，格式比较规整，便于学习研究。该日志文件的每个记录，一共有11个字段，例如：

1363157986072   18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99  input.shouji.sogou.com  搜索引擎    21  18  9531    2412    200

每个字段的含义如下图所示。
这里写图片描述
现在我们需要统计每个电话的流量数据信息。

2、思路分析

我们要统计这个文件中，同一手机号的流量汇总。而我们可以从所给的数据中发现，记录中有四个字段以不同的形式表示手机的流量（上行数据包数，下行数据包数，上行总流量，下行总流量），这时该如何处理呢？一个比较好的解决方法是使用面向对象的概念，我们可以自定义一个数据类型去包含这几个值，用类中的属性，来表示这几个字段，从而方便我们对数据的操作。
首先我们有未经过处理的原始文件(相当于<k1,v1>)，这个文件里存储着我需要的数据就是,那就是一个手机的流量的汇总数据（相当于<k3,v3>）,而要从原始数据获得我们最终想要的数据，这中间需要经过一个过程，对原始数据进行初步加工处理，形成中间结果(相当于<k2,V2>),而<K2,V2>这时候代表什么呢？不难看出，将所有的原始数据经过map()函数的分组排序处理后，得到一个中间结果，这个中间结果是一个键值对<K2,V2>,而这里的K2应该就是电话号码，V2就是我们的自定义类型表示手机流量，最后将中间数据经过reduce()函数的归一化处理，得到我们的最终结果。

3、源码

package com.kang;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;public class KpiApp {    static final String INPUT_PATH = "hdfs://sparkproject1:9000/root/input/";    static final String OUT_PATH = "hdfs://sparkproject1:9000/root/output/";    public static void main(String[] args) throws Exception {        final Job job = new Job(new Configuration(), KpiApp.class.getSimpleName());        FileInputFormat.setInputPaths(job, INPUT_PATH);// 指定输入文件路径        job.setInputFormatClass(TextInputFormat.class);// 指定哪个类用来格式化输入文件        job.setMapperClass(MyMapper.class);// 指定自定义的Mapper类        job.setMapOutputKeyClass(Text.class);// 指定map输出的key值类型        job.setMapOutputValueClass(KpiWritable.class);//指定map输出的value值类型（我们自定义）        job.setNumReduceTasks(1);//指定reduce数目，可选        job.setReducerClass(MyReducer.class);// 指定自定义的reduce类        job.setOutputKeyClass(Text.class);// 指定reduce输出的key值类型        job.setOutputValueClass(KpiWritable.class);//指定reduce输出的value值类型        FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));//指定输出目录        job.setOutputFormatClass(TextOutputFormat.class);// 设定输出文件的格式化类        System.exit(job.waitForCompletion(true) ? 0 : 1);    }    static class MyMapper extends Mapper<LongWritable, Text, Text, KpiWritable> {        protected void map(LongWritable key, Text value,                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, KpiWritable>.Context context)                throws IOException, InterruptedException {            final String[] splited = value.toString().split("\t");            final String msisdn = splited[1];            final Text k2 = new Text(msisdn);            final KpiWritable v2 = new KpiWritable(splited[6], splited[7], splited[8], splited[9]);            context.write(k2, v2);        };    }    static class MyReducer extends Reducer<Text, KpiWritable, Text, KpiWritable> {        //k2表示整个文件中不同的手机号码        //v2s表示该手机号在不同时段的流量的集合        protected void reduce(Text k2, java.lang.Iterable<KpiWritable> v2s,                org.apache.hadoop.mapreduce.Reducer<Text, KpiWritable, Text, KpiWritable>.Context context)                throws IOException, InterruptedException {            long upPackNum = 0L;            long downPackNum = 0L;            long upPayLoad = 0L;            long downPayLoad = 0L;            for (KpiWritable kpiWritable : v2s) {                upPackNum += kpiWritable.upPackNum;                downPackNum += kpiWritable.downPackNum;                upPayLoad += kpiWritable.upPayLoad;                downPayLoad += kpiWritable.downPayLoad;            }            final KpiWritable v3 = new KpiWritable(upPackNum + "", downPackNum + "", upPayLoad + "", downPayLoad + "");            context.write(k2, v3);        };    }}class KpiWritable implements Writable {//自定义输出数据类型，实现Writable接口    long upPackNum;    long downPackNum;    long upPayLoad;    long downPayLoad;    public KpiWritable() {    }    public KpiWritable(String upPackNum, String downPackNum, String upPayLoad, String downPayLoad) {        this.upPackNum = Long.parseLong(upPackNum);        this.downPackNum = Long.parseLong(downPackNum);        this.upPayLoad = Long.parseLong(upPayLoad);        this.downPayLoad = Long.parseLong(downPayLoad);    }    @Override    public void readFields(DataInput in) throws IOException {        this.upPackNum = in.readLong();        this.downPackNum = in.readLong();        this.upPayLoad = in.readLong();        this.downPayLoad = in.readLong();    }    @Override    public void write(DataOutput out) throws IOException {        out.writeLong(upPackNum);        out.writeLong(downPackNum);        out.writeLong(upPayLoad);        out.writeLong(downPayLoad);    }    @Override    public String toString() {        return upPackNum + "\t" + downPackNum + "\t" + upPayLoad + "\t" + downPayLoad;    }}

运行后得到的数据结果如下：

13480253104 3   3   180 18013502468823 57  102 7335    11034913560439658 33  24  2034    589213600217502 18  138 1080    18685213602846565 15  12  1938    291013660577991 24  9   6960    69013719199419 4   0   240 013726230503 24  27  2481    2468113760778710 2   2   120 12013823070001 6   3   360 18013826544101 4   0   264 013922314466 12  12  3008    372013925057413 69  63  11058   4824313926251106 4   0   240 013926435656 2   4   132 151215013685858 28  27  3659    353815920133257 20  20  3156    293615989002119 3   3   1938    18018211575961 15  12  1527    210618320173382 21  18  9531    241284138413    20  16  4116    1432

4、hadoop实现自定义数据类型的方法

Hadoop的自定制数据类型一般有两个办法，一种较为简单的是只针对值，另外一种更为完整的是对于键和值都适应的方法。

针对值的自定义类型

如果这个自定义数据类型是作为key-value中的value来用的，那么其实现较为简单，只需实现Writable接口即可：

/* DataInput and DataOutput 类是java.io的类 */public interface Writable {    void readFields(DataInput in);    void write(DataOutput out);}

下面是一个小例子：

public class Point3D implement Writable {    public float x, y, z;    public Point3D(float fx, float fy, float fz) {        this.x = fx;        this.y = fy;        this.z = fz;    }    public Point3D() {        this(0.0f, 0.0f, 0.0f);    }    public void readFields(DataInput in) throws IOException {        x = in.readFloat();        y = in.readFloat();        z = in.readFloat();    }    public void write(DataOutput out) throws IOException {        out.writeFloat(x);        out.writeFloat(y);        out.writeFloat(z);    }    public String toString() {        return Float.toString(x) + ", "            + Float.toString(y) + ", "            + Float.toString(z);    }}

针对键和值的自定义类型
对于键来说，还需要指定排序规则。对此办法是实现WritableComparable这个泛型接口，WritableComparable，顾名思义了一半是Writable，一半是Comparable。：

public interface WritableComparable<T> {    public void readFields(DataInput in);    public void write(DataOutput out);    public int compareTo(T other);}

先给出下面的简单例子：

public class Point3D inplements WritableComparable {    public float x, y, z;    public Point3D(float fx, float fy, float fz) {        this.x = fx;        this.y = fy;        this.z = fz;    }    public Point3D() {        this(0.0f, 0.0f, 0.0f);    }    public void readFields(DataInput in) throws IOException {        x = in.readFloat();        y = in.readFloat();        z = in.readFloat();    }    public void write(DataOutput out) throws IOException {        out.writeFloat(x);        out.writeFloat(y);        out.writeFloat(z);    }    public String toString() {        return Float.toString(x) + ", "            + Float.toString(y) + ", "            + Float.toString(z);    }    public float distanceFromOrigin() {        return (float) Math.sqrt( x*x + y*y +z*z);    }    public int compareTo(Point3D other) {        return Float.compareTo(            distanceFromOrigin(),             other.distanceFromOrigin());    }    public boolean equals(Object o) {        if( !(o instanceof Point3D)) {            return false;        }        Point3D other = (Point3D) o;        return this.x == o.x             && this.y == o.y             && this.z == o.z;    }    /* 实现 hashCode() 方法很重要     * Hadoop的Partitioners会用到这个方法     */    public int hashCode() {        return Float.floatToIntBits(x)            ^ Float.floatToIntBits(y)            ^ Float.floatToIntBits(z);    }}

自定义Hadoop数据类型后，需要明确告诉Hadoop来使用它们。这是 JobConf 所能担当的了。使用setOutputKeyClass() 和 setOutputValueClass()方法即可。通常(默认条件下)，这个两个方法的设置对Map和Reduce阶段的输出都起到作用，当然也有专门的 setMapOutputKeyClass() 和 setReduceOutputKeyClass() 接口用来定义map端的输入输出类型。

四、设计combine处理类

在前面关于MapReduce的原理分析中，提到了一般map的输出可以进行combine操作，其作用相当于map端的“本地educe”操作。

1、为什么要进行combine

我们知道，Hadoop框架使用Mapper将数据处理成一个个的<key,value>键值对，在网络节点间对其进行整理(shuffle)，然后使用Reducer处理数据并进行最终输出。在上述过程中，存在至少两个性能瓶颈：
（1）如果我们有10亿个数据，Mapper会生成10亿个键值对在网络间进行传输，但如果我们只是对数据求最大值，那么很明显的Mapper只需要输出当前mapper所知道的最大值即可。这样做不仅可以减轻网络压力，同样也可以大幅度提高程序效率。
（2）假设使用美国这样一个专利数据集中的国家来阐述数据倾斜这个定义，这样的数据不是一致性的或者说平衡分布的，因为大多数的专利国家都是美国，这样Mapper中的键值对和中间阶段(shuffle)的键值存在很多相同的键，而这些具有相同的键的key-value对最终会聚集于一个单一的Reducer之上，压倒这个Reducer，从而大大降低程序的性能。
基于以上两个问题，使用combine就可以较好的解决。与mapper和reducer不同的是，combiner没有默认的实现，需要显式的设置在程序中才有作用。而且并不是所有的job都适用combiner，只有操作满足结合律的才可设置combiner。combine操作类似于：opt(opt(1, 2, 3), opt(4, 5, 6))。如果opt为求和、求最大值的话，可以使用，但是如果是求中值的话，不适用。　

2、combine的原理和设计

每一个map都可能会产生大量的本地输出，Combiner的作用就是对map端的输出先做一次合并，以减少在map和reduce节点之间的数据传输量，以提高网络IO性能，是MapReduce的一种优化手段之一，其具体的作用如下所述。
（1）Combiner最基本是实现本地key的聚合，对map输出的key排序，value进行迭代。如下所示：

　map: (K1, V1) → list(K2, V2) 　combine: (K2, list(V2)) → list(K2, V2) 　reduce: (K2, list(V2)) → list(K3, V3)

（2）Combiner还有本地reduce功能（其本质上就是一个reduce），例如Hadoop自带的wordcount的例子和找出value的最大值的程序，combiner和reduce完全一致，如下所示：

map: (K1, V1) → list(K2, V2) combine: (K2, list(V2)) → list(K3, V3) reduce: (K3, list(V3)) → list(K4, V4)

对于wordcount示例我们，我们就是用了combine操作：

job.setMapperClass(TokenizerMapper.class);job.setCombinerClass(IntSumReducer.class);job.setReducerClass(IntSumReducer.class);

在最后的计数器输出上也可以看到如下信息：

Combine input records=4// 合并输入的记录数Combine output records=3// 合并输出的记录数

不过这里我们使用的是recent方法来进行combine操作，我们完全可以自行定义一个单独的combine类来进行该操作，不过定义的时候我们通常还是让其继承reduce类。如下所示

public static class MyCombiner extends            Reducer<Text, LongWritable, Text, LongWritable> {        protected void reduce(                Text key,                java.lang.Iterable<LongWritable> values,                org.apache.hadoop.mapreduce.Reducer<Text, LongWritable, Text, LongWritable>.Context context)                throws java.io.IOException, InterruptedException {            // 显示次数表示规约函数被调用了多少次，表示k2有多少个分组            System.out.println("Combiner输入分组<" + key.toString() + ",N(N>=1)>");            long count = 0L;            for (LongWritable value : values) {                count += value.get();                // 显示次数表示输入的k2,v2的键值对数量                System.out.println("Combiner输入键值对<" + key.toString() + ","                        + value.get() + ">");            }            context.write(key, new LongWritable(count));            // 显示次数表示输出的k2,v2的键值对数量            System.out.println("Combiner输出键值对<" + key.toString() + "," + count                    + ">");        };    }

你可能会有个疑虑：Combiner本身已经执行了reduce操作，为什么在Reducer阶段还要执行reduce操作？
这是因为combiner操作实际上发生在map端的（每个map都会有一个combine对应），它只能处理自己对应的map任务中的数据，不能跨map任务执行；只有reduce可以接收多个map任务处理的数据。

五、设计Partitioner处理类

通过前面的介绍我们知道Mapper最终处理的键值对<key, value>，是需要送到Reducer去合并的。合并的时候，有相同key的键/值对会送到同一个Reducer节点中进行归并。哪个key到哪个Reducer的分配过程，是由Partitioner规定的。在一些集群应用中，例如分布式缓存集群中，缓存的数据大多都是靠哈希函数来进行数据的均匀分布的，在Hadoop中也不例外。

1、MapReduce默认的Partitioner

1.2 Hadoop内置Partitioner
MapReduce的使用者通常会指定Reduce任务和Reduce任务输出文件的数量（R）。用户在中间key上使用分区函数来对数据进行分区，之后在输入到后续任务执行进程。一个默认的分区函数式使用hash方法（比如常见的：hash(key) mod R）进行分区。hash方法能够产生非常平衡的分区，鉴于此，Hadoop中自带了一个默认的分区类HashPartitioner，它继承了Partitioner类，提供了一个getPartition的方法，它的定义如下所示：

/** Partition keys by their {@link Object#hashCode()}. */public class HashPartitioner<K, V> extends Partitioner<K, V> {  /** Use {@link Object#hashCode()} to partition. */  public int getPartition(K key, V value,                          int numReduceTasks) {    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;  }}

上述代码中关键代码就一句：

return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;

这段代码实现的目的是将key均匀分布在Reduce Tasks上，例如：如果Key为Text的话，Text的hashcode方法跟String的基本一致，都是采用的Horner公式计算，得到一个int整数。但是，如果string太大的话这个int整数值可能会溢出变成负数，所以和整数的上限值Integer.MAX_VALUE（即0111111111111111）进行与运算，然后再对reduce任务个数取余，这样就可以让key均匀分布在reduce上。

2、自定义Partitioner处理方法

下面我们尝试自定义一个分区，来处理一下上面的手机日志数据。该日至内容如下：

1363157985066   13726230503 00-FD-07-A4-72-B8:CMCC  120.196.100.82  i02.c.aliimg.com        24  27  2481    24681   2001363157995052   13826544101 5C-0E-8B-C7-F1-E0:CMCC  120.197.40.4            4   0   264 0   2001363157991076   13926435656 20-10-7A-28-CC-0A:CMCC  120.196.100.99          2   4   132 1512    2001363154400022   13926251106 5C-0E-8B-8B-B1-50:CMCC  120.197.40.4            4   0   240 0   2001363157993044   18211575961 94-71-AC-CD-E6-18:CMCC-EASY 120.196.100.99  iface.qiyi.com  视频网站    15  12  1527    2106    2001363157995074   84138413    5C-0E-8B-8C-E8-20:7DaysInn  120.197.40.4    122.72.52.12        20  16  4116    1432    2001363157993055   13560439658 C4-17-FE-BA-DE-D9:CMCC  120.196.100.99          18  15  1116    954 2001363157995033   15920133257 5C-0E-8B-C7-BA-20:CMCC  120.197.40.4    sug.so.360.cn   信息安全    20  20  3156    2936    2001363157983019   13719199419 68-A1-B7-03-07-B1:CMCC-EASY 120.196.100.82          4   0   240 0   2001363157984041   13660577991 5C-0E-8B-92-5C-20:CMCC-EASY 120.197.40.4    s19.cnzz.com    站点统计    24  9   6960    690 2001363157973098   15013685858 5C-0E-8B-C7-F7-90:CMCC  120.197.40.4    rank.ie.sogou.com   搜索引擎    28  27  3659    3538    2001363157986029   15989002119 E8-99-C4-4E-93-E0:CMCC-EASY 120.196.100.99  www.umeng.com   站点统计    3   3   1938    180 2001363157992093   13560439658 C4-17-FE-BA-DE-D9:CMCC  120.196.100.99          15  9   918 4938    2001363157986041   13480253104 5C-0E-8B-C7-FC-80:CMCC-EASY 120.197.40.4            3   3   180 180 2001363157984040   13602846565 5C-0E-8B-8B-B6-00:CMCC  120.197.40.4    2052.flash2-http.qq.com 综合门户    15  12  1938    2910    2001363157995093   13922314466 00-FD-07-A2-EC-BA:CMCC  120.196.100.82  img.qfc.cn      12  12  3008    3720    2001363157982040   13502468823 5C-0A-5B-6A-0B-D4:CMCC-EASY 120.196.100.99  y0.ifengimg.com 综合门户    57  102 7335    110349  2001363157986072   18320173382 84-25-DB-4F-10-1A:CMCC-EASY 120.196.100.99  input.shouji.sogou.com  搜索引擎    21  18  9531    2412    2001363157990043   13925057413 00-1F-64-E1-E6-9A:CMCC  120.196.100.55  t3.baidu.com    搜索引擎    69  63  11058   48243   2001363157988072   13760778710 00-FD-07-A4-7B-08:CMCC  120.196.100.82          2   2   120 120 2001363157985079   13823070001 20-7C-8F-70-68-1F:CMCC  120.196.100.99          6   3   360 180 2001363157985069   13600217502 00-1F-64-E2-E8-B1:CMCC  120.196.100.55          18  138 1080    186852  200

从数据中我们可以发现，在第二列上并不是所有的数据都是手机号。例如这条数据中的84138413就不是手机号码：

1363157995074   84138413    5C-0E-8B-8C-E8-20:7DaysInn  120.197.40.4    122.72.52.12        20  16  4116    1432    200

我们任务就是在统计手机流量时，将手机号码和非手机号输出到不同的文件中。我们的分区是按手机和非手机号码来分的，所以我们可以按该字段的长度来划分，其分区操作方法如下：

static class KpiPartitioner extends HashPartitioner<Text, KpiWritable>{        @Override        public int getPartition(Text key, KpiWritable value, int numReduceTasks) {            return (key.toString().length()==11)?0:1;        }    }

我们将其应用到MapReduce处理中去，完整源码如下：

package com.kang;import java.io.DataInput;import java.io.DataOutput;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.Writable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;public class KpiApp {    static final String INPUT_PATH = "hdfs://sparkproject1:9000/root/input/";    static final String OUT_PATH = "hdfs://sparkproject1:9000/root/output/";    public static void main(String[] args) throws Exception {        final Job job = new Job(new Configuration(), KpiApp.class.getSimpleName());        FileInputFormat.setInputPaths(job, INPUT_PATH);// 指定输入文件路径        job.setInputFormatClass(TextInputFormat.class);// 指定哪个类用来格式化输入文件        job.setMapperClass(MyMapper.class);// 指定自定义的Mapper类        job.setMapOutputKeyClass(Text.class);// 指定map输出的key值类型        job.setMapOutputValueClass(KpiWritable.class);//指定map输出的value值类型（我们自定义）        job.setPartitionerClass(KpiPartitioner.class);//指定分区操作类        job.setNumReduceTasks(2);//这里设置reduce的数目为2        job.setReducerClass(MyReducer.class);// 指定自定义的reduce类        job.setOutputKeyClass(Text.class);// 指定reduce输出的key值类型        job.setOutputValueClass(KpiWritable.class);//指定reduce输出的value值类型        FileOutputFormat.setOutputPath(job, new Path(OUT_PATH));//指定输出目录        job.setOutputFormatClass(TextOutputFormat.class);// 设定输出文件的格式化类        System.exit(job.waitForCompletion(true) ? 0 : 1);    }    static class MyMapper extends Mapper<LongWritable, Text, Text, KpiWritable> {        protected void map(LongWritable key, Text value,                org.apache.hadoop.mapreduce.Mapper<LongWritable, Text, Text, KpiWritable>.Context context)                throws IOException, InterruptedException {            final String[] splited = value.toString().split("\t");            final String msisdn = splited[1];            final Text k2 = new Text(msisdn);            final KpiWritable v2 = new KpiWritable(splited[6], splited[7], splited[8], splited[9]);            context.write(k2, v2);        };    }    static class MyReducer extends Reducer<Text, KpiWritable, Text, KpiWritable> {        //k2表示整个文件中不同的手机号码        //v2s表示该手机号在不同时段的流量的集合        protected void reduce(Text k2, java.lang.Iterable<KpiWritable> v2s,                org.apache.hadoop.mapreduce.Reducer<Text, KpiWritable, Text, KpiWritable>.Context context)                throws IOException, InterruptedException {            long upPackNum = 0L;            long downPackNum = 0L;            long upPayLoad = 0L;            long downPayLoad = 0L;            for (KpiWritable kpiWritable : v2s) {                upPackNum += kpiWritable.upPackNum;                downPackNum += kpiWritable.downPackNum;                upPayLoad += kpiWritable.upPayLoad;                downPayLoad += kpiWritable.downPayLoad;            }            final KpiWritable v3 = new KpiWritable(upPackNum + "", downPackNum + "", upPayLoad + "", downPayLoad + "");            context.write(k2, v3);        };    }    static class KpiPartitioner extends HashPartitioner<Text, KpiWritable>{        @Override        public int getPartition(Text key, KpiWritable value, int numReduceTasks) {            return (key.toString().length()==11)?0:1;        }    }}class KpiWritable implements Writable {//自定义输出数据类型，实现Writable接口    long upPackNum;    long downPackNum;    long upPayLoad;    long downPayLoad;    public KpiWritable() {    }    public KpiWritable(String upPackNum, String downPackNum, String upPayLoad, String downPayLoad) {        this.upPackNum = Long.parseLong(upPackNum);        this.downPackNum = Long.parseLong(downPackNum);        this.upPayLoad = Long.parseLong(upPayLoad);        this.downPayLoad = Long.parseLong(downPayLoad);    }    @Override    public void readFields(DataInput in) throws IOException {        this.upPackNum = in.readLong();        this.downPackNum = in.readLong();        this.upPayLoad = in.readLong();        this.downPayLoad = in.readLong();    }    @Override    public void write(DataOutput out) throws IOException {        out.writeLong(upPackNum);        out.writeLong(downPackNum);        out.writeLong(upPayLoad);        out.writeLong(downPayLoad);    }    @Override    public String toString() {        return upPackNum + "\t" + downPackNum + "\t" + upPayLoad + "\t" + downPayLoad;    }}

运行后可以看到MapReduce产生了两个输出文件：
这里写图片描述

其中part-r-00000的内容如下：

13480253104 3   3   180 18013502468823 57  102 7335    11034913560439658 33  24  2034    589213600217502 18  138 1080    18685213602846565 15  12  1938    291013660577991 24  9   6960    69013719199419 4   0   240 013726230503 24  27  2481    2468113760778710 2   2   120 12013823070001 6   3   360 18013826544101 4   0   264 013922314466 12  12  3008    372013925057413 69  63  11058   4824313926251106 4   0   240 013926435656 2   4   132 151215013685858 28  27  3659    353815920133257 20  20  3156    293615989002119 3   3   1938    18018211575961 15  12  1527    210618320173382 21  18  9531    2412

而part-r-00001内容如下：

84138413    20  16  4116    1432

至此，我们完成了前面要求。分区的主要用处如下：
第一、根据业务需要，产生多个输出文件。
第二、多个reduce任务在并发运行，可以提高整体job的运行效率。

六、map和reduce的数量设置

map和reduce是Hadoop的核心功能，hadoop正是通过多个map和reduce的并行运行来实现任务的分布式并行计算，从这个观点来看，如果将map和reduce的数量设置为1，那么用户的任务就没有并行执行，但是map和reduce的数量也不能过多，数量过多虽然可以提高任务并行度，但是太多的map和reduce也会导致整个hadoop框架因为过度的系统资源开销而使任务失败。所以用户在提交map/reduce作业时应该在一个合理的范围内，这样既可以增强系统负载匀衡，也可以降低任务失败的开销。那么该如何设置呢？

1、map的数量

map的数量通常是由hadoop集群的DFS块大小确定的，也就是输入文件的总块数，正常的map数量的并行规模大致是每一个Node是10~100个，对于CPU消耗较小的作业可以设置Map数量为300个左右，但是由于hadoop的每一个任务在初始化时需要一定的时间，因此通常情况是每个map执行的时间至少超过1分钟。
具体的数据分片是这样的，InputFormat在默认情况下会根据hadoop集群的DFS块大小进行分片，每一个分片会由一个map任务来进行处理，当然用户还是可以通过参数mapred.min.split.size参数在作业提交客户端进行自定义设置。
还有一个重要参数就是mapred.map.tasks，这个参数设置的map数量仅仅是一个提示，只有当InputFormat决定了map任务的个数比mapred.map.tasks值小时才起作用。同样，Map任务的个数也能通过使用JobConf 的conf.setNumMapTasks(int num)方法来手动地设置。这个方法能够用来增加map任务的个数，但是不能设定任务的个数小于Hadoop系统通过分割输入数据得到的值。当然为了提高集群的并发效率，可以设置一个默认的map数量，当用户的map数量较小或者比本身自动分割的值还小时可以使用一个相对较大的默认值，从而提高整体hadoop集群的效率。

2、reduece的数量

reduce在运行时往往需要从相关map端复制数据到reduce节点来处理，因此相比于map任务。reduce节点资源是相对比较缺少的，同时相对运行较慢，正确的reduce任务的个数应该是0.95或者1.75 *（节点数 ×mapred.tasktracker.tasks.maximum参数值）。如果任务数是节点个数的0.95倍，那么所有的reduce任务能够在 map任务的输出传输结束后同时开始运行。如果任务数是节点个数的1.75倍，那么高速的节点会在完成他们第一批reduce任务计算之后开始计算第二批 reduce任务，这样的情况更有利于负载均衡。
同时需要注意增加reduce的数量虽然会增加系统的资源开销，但是可以改善负载匀衡，降低任务失败带来的负面影响。同样，Reduce任务也能够与 map任务一样，通过设定JobConf 的conf.setNumReduceTasks(int num)方法来增加任务个数。

3、reduce数量为0

有些作业不需要进行归约进行处理，那么就可以设置reduce的数量为0来进行处理，这种情况下用户的作业运行速度相对较高，map的输出会直接写入到 SetOutputPath(path)设置的输出目录，而不是作为中间结果写到本地。同时Hadoop框架在写入文件系统前并不对之进行排序。map red.tasktracker.map.tasks.maximum 这个是一个task tracker中可同时执行的map的最大个数，默认值为2，

4、总结

Map个数取决于文件分块的个数，可以手动设置Map的数量，但是必须不能小于文件分块的数量。
Reduce个数一般为0.95或1.75×（节点数 ×mapred.tasktracker.tasks.maximum参数值），map red.tasktracker.map.tasks.maximum 这个是一个task tracker中可同时执行的map的最大个数，默认值为2。也可以手动设置，增加Reduce个数。当不需要规约处理时，可以设置Reduce数量为0。

七、hadoop中的压缩

其实在进行shuffle过程中，我们可以在map端在写磁盘的时候采用压缩的方式将map的输出结果进行压缩，这也是一个减少网络开销很有效的方法。在Hadoop中已为我们提供了一些压缩算法的实现。

1、Codec

Codec是Hadoop中关于压缩，解压缩的算法的实现，在Hadoop中，codec由CompressionCode的实现来表示。下面是一些常见压缩算法实现，如下图所示：
这里写图片描述
其每种压缩算法的特点如下：

2、在MapReduce中使用压缩

输入的文件的压缩
如果输入的文件是压缩过的，那么在被MapReduce读取时，它们会被自动解压，根据文件扩展名来决定应该使用哪一个压缩解码器。
MapReduce作业的输出的压缩
MapReduce的输出属性如下所示。如果要压缩MapReduce作业的输出，请在作业配置文件中将mapred.output.compress属性设置为true。将mapred.output.compression.codec属性设置为自己打算使用的压缩编码/解码器的类名。如果为输出使用了一系列文件，可以设置mapred.output.compression.type属性来控制压缩类型，默认为RECORD，它压缩单独的记录。将它改为BLOCK，则可以压缩一组记录。由于它有更好的压缩比，所以推荐使用。
map作业输出结果的压缩
即使MapReduce应用使用非压缩的数据来读取和写入，我们也可以受益于压缩map阶段的中间输出。因为map作业的输出会被写入磁盘并通过网络传输到reducer节点，所以如果使用LZO之类的快速压缩，能得到更好的性能，因为传输的数据量大大减少了。以下代码显示了启用rnap输出压缩和设置压缩格式的配置属性。
示例：

        Configuration conf = new Configuration();        //map端输出压缩        conf.setBoolean("mapred.compress.map.output", true);        //reduce端输出压缩        conf.setBoolean("mapred.output.compress", true);        //设置压缩格式        conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class);        Job job = new Job(conf, "word count");

阅读全文

0 0