MapReduce 要点

来源：互联网发布：淘宝聚划算提前购物车编辑：程序博客网时间：2024/04/30 04:44

本文参考《Hadoop 权威指南》第三版 “第2章关于MapReduce”

MapReduce:分布式数据处理模型和执行环境。

HDFS：分布式文件系统。

Hive：一种分布式的，按列存储的数据仓库，管理HDFS的数据，提供基于SQL的查询语言用于查询。（运行时引擎翻译成MapReduce作业）

HBase：一种分布式的，按列存储的数据库，使用HDFS作为底层存储，同时支持MapReduce的批量式计算和点查询。

Zookeeper：一种分布式的、可用性高的协调服务。

Sqoop：用于关系型数据库和HDFS之间高效批量传输数据。

MapReduce的两个阶段：map阶段和reduce阶段。每个阶段都以键值对作为输入和输出。map相关的类继承Mapper，写一个map()方法。reduce相关的类继承Reducer，写一个reduce()方法。

Hadoop将Job分成若干个task。task分为map阶段的task和reduce节点的task。

控制job执行过程的两类节点：jobtracker和tasktracker。

1、jobtracker用于协调任务。

2、tasktracker用于执行任务，并向jobtracker汇报任务的进度。

3、jobtracker记录所有任务的进度，如果某个任务失败，jobtracker在另一个tasktracker上重新调度该任务。

关于分片：

1、MapReduce将输入数据划分为等长的分片，为每个分片构建一个map任务。

2、分片分得越细，负载均衡越高。

3、合理的分配趋向于HDFS块的大小，默认64M（128M）（如果分片跨越两个数据块，任何一个HDFS节点上基本上不会同时存储这两个数据块，部分数据必会通过网络传输，消耗宝贵的带宽资源）

map的输出写入本地磁盘，而非HDFS，这是因为：map输出的是中间结果，一旦job结束就可以删除，如果把它存储在HDFS上进行备份，有点小题大作。

多个Reduce任务的数据流：

1、MapReduce将输入数据划分为等长的分片，为每个分片构建一个map任务。

2、如果有多个reduce任务，map任务就会针对输出进行分区（partition）。

3、排序后的map输出通过网络发送到运行reduce的节点。

4、数据在reduce端合并，然后由reduce函数进行处理

5、reduce的输出保存到HDFS中(第一个复本存储在本地节点，其他复本存储在其他节点)

注意：map和reduce之间的数据流称为shuffle

combiner函数

1、为减少map和reduce之间的数据传输(现在map端进行一次reduce)。

2、combiner的输出作为reducer的输入。

3、combiner属于优化方案，不管调用combiner多少次，reducer的输出都是一样的。

4、combiner函数不能作为reducer函数。仍然需要reduce函数处理不同的map输出。

一MapReduce的关键代码：

 Java Code 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

public class MaxTempMapper extends Mapper<IntWritable, Text, Text, IntWritable> {
    @Override
    protected void map(IntWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        String year = line.substring(21, 25);
        int temperature = Integer.parseInt(line.substring(34, 36));
        context.write(new Text(year), new IntWritable(temperature));
    }
}

public class MaxTempReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context)
            throws IOException, InterruptedException {
        int maxTemp = Integer.MIN_VALUE;
        for(IntWritable value : values) {
            maxTemp = Math.max(value.get(), maxTemp);
        }
        context.write(key, new IntWritable(maxTemp));
    }
}

public class MaxTemp {
    public static void main(String[] args) throws IOException {
        Job job = new Job();
        job.setJarByClass(MaxTemp.class);
        job.setJobName("Max temperature");

        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        job.setMapperClass(MaxTempMapper.class);
        job.setReducerClass(MaxTempReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
    }
}

0 0