HDPCD-Java-复习笔记（2）

来源：互联网发布：客户资料搜集软件编辑：程序博客网时间：2024/05/18 01:51

2.编写MapReduce应用程序（Writing MapReduce Applications）

一个MapReduce程序由两个主要阶段组成（A MapReduce program consists of two main phases）：

Map phase -- 数据输入到Mapper, 由Mapper转换并将转换后的数据提供给Reducer. (Data is input into the Mapper, where it is transformed and prepared for the Reducer.)

Reduce phase -- 从Mapper接收数据，并完成期望的计算或者分析. (Retrieves the data from the Mapper and performs the desired computations or analyses.)

编写一个MapReduce程序，需要定义一个Mapper类用来处理map阶段，一个Reducer类来处理reduce阶段.(To write a MapReduce program, you define a Mapper class to handle the map phase and a Reducer class to handle the Reduce phase.)

当所有的Mapper处理完成后，中间结果<key, value> 对经过一个shuffle 和sort 阶段（所有键相同的值被组合在一起并且被发送到同一个Reducer）(After all of the Mappers finish executing, the intermediate <key, value> pairs go through a shuffle and sort phase where all the values that share a key are combined and sent to the same Reducer.)

Mapper的数量由InputFormat决定.(The number of Mappers is determined by the InputFormat.The number of map tasks in a MapReduce job is based on the number of Input Splits.)

Reducer的数量由MapReduce job 配置决定.（The number of Reducers is determined by the MapReduce job configuration.The number of Reducers is determined by the mapreduce.job.reduces property.）

Partioner用来决定键值对被发往哪个Reducer.(A Partitioner is used to determine which <key,value> pairs are sent to which Reducer.)

Combiner可以被配置用来组合Mapper的输出，以此减少shuffle和sort阶段的网络流量，从而提升性能.(A Combiner can be optionally configured to combine the output of the Mapper, which can increase performance by decreasing the network traffic of the shuffle and sort phase.)

The Key/Value Pairs of MapReduce

The Word Count Example

MapReduce工作任务的 “Hello, World”就是word count，一个工作任务接收一个文本文件作为输入并且输出文件中的每个单词和每个单词出现的次数。（The “Hello, World” of MapReduce jobs is word count, a job that inputs a text file and outputs every word in the file, along with the number of occurrences of each word.）

public class WordCountMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
@Override
protected void map(LongWritable key, Text value,
Context context)
throws IOException, InterruptedException {
String currentLine = value.toString();
String [] words = currentLine.split(" ");
for(String word : words) {
Text outputKey = new Text(word);
context.write(outputKey, new IntWritable(1));
}
}
}

public class WordCountReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
protected void reduce(Text key,
Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
int sum = 0;
for(IntWritable count : values) {
sum += count.get();
}
IntWritable outputValue = new IntWritable(sum);
context.write(key, outputValue);
}
}

public class WordCountJob extends Configured
implements Tool {
@Override
public int run(String[] args) throws Exception {
Job job = Job.getInstance(getConf(), "WordCountJob");
Configuration conf = job.getConfiguration();
job.setJarByClass(getClass());
Path in = new Path(args[0]);
Path out = new Path(args[1]);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
job.setMapperClass(WordCountMapper.class);
job.setReducerClass(WordCountReducer.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
return job.waitForCompletion(true)?0:1;
}
public static void main(String[] args) {
int result;
try {
result = ToolRunner.run(new Configuration(),
new WordCountJob(), args);
System.exit(result);
} catch (Exception e) {
e.printStackTrace();
}
}
}

Running a MapReduce Job:

1.将文件放入HDFS中（Put the input files into HDFS.）

2.如果输出文件夹存在，则删除（If the output directory exists, delete it.）

3.使用hadoop执行工作任务（Use hadoop to execute the job.）

4.查看输出文件（View the output files.）

hadoop jar wordcount.jar my.WordCountJobinput/file.txt result

The Map Phase

Output Memory Buffer（Mapper）（环形缓冲区，已经被序列化）

Mapper的输出内存缓冲区的大小是由mapreduce.task.io.sort.mb 属性配置，当缓冲区达到一个特定的容量就会发生溢出，将内容写到磁盘，这个特定的容量由mapreduce.map.sort.spill.percent 配置。（The size of the Mapper’s output memory buffer is configurable with themapreduce.task.io.sort.mb property. A spill occurs when the buffer reaches a certain capacity configured by the mapreduce.map.sort.spill.percent property.）

The Reduce Phase

Reduce阶段实际上被分成三个阶段（The reduce phase can actually be broken down in three phases）:

Shuffle -- 也被叫做fetch阶段，Reducers使用Netty获取Mappers的输出,所有相同键的记录被发送至同一个Reducer.(Also referred to as the fetch phase,this is when Reducers retrieve the output of the Mappers using Netty. All records with the same key are combined and sent to the same Reducer.)

Sort -- 这个阶段与shuffle阶段同时进行，当记录被获取和合并时，他们也通过键排好了序.(This phase happens simultaneously with the shuffle phase. As the records are fetched and merged, they are sorted by key.)

Reduce -- Reduce 方法由每个键激发，所有相同键的记录被合并成一个集合。（The reduce method is invoked for each key, with the records combined into an iterable collection.）

A few open-source projects that are currently being ported onto YARN for use inHadoop 2.x：

Tez -- Improves the execution of MapReduce jobs.

Slider -- For deploying existing distributed applications onto YARN.

Storm -- For real-time computing.

Spark -- A MapReduce-like cluster computing framework designed for low-latency iterative jobs and interactive use from an interpreter.

OpenMPI -- A high performance Message Passing Library that implements MPI-2.

ApacheGiraph -- A graph processing platform.

YARN Components