Learning Hadoop (1)

来源：互联网发布：自然语言分析 python 编辑：程序博客网时间：2024/06/03 21:54

1 Java Mapreduce简介与例程

1.1 Mapper

map 函数由Mapper 接口实现来表示，后者声明了一个map()方法。一个例程如下：

public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> {// class OldMaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>{private static final int MISSING = 9999;@Overridepublic void map(LongWritable key, Text value, Context context)throws IOException, InterruptedException {// public void map(LongWritable key, Text value,// OutputCollector<Text, IntWritable> output, Reporter reporter)// throws IOException {String line = value.toString();String year = line.substring(15, 19);int airTemperature;if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signsairTemperature = Integer.parseInt(line.substring(88, 92));} else {airTemperature = Integer.parseInt(line.substring(87, 92));}String quality = line.substring(92, 93);if (airTemperature != MISSING && quality.matches("[01459]")) {context.write(new Text(year), new IntWritable(airTemperature));// output.collect(new Text(year), new IntWritable(airTemperature));}}}

该Mapper 接口是一个泛型类型，它有四个形参类型，分别指定map 函数的输入键、输入值、输出键和输出值的类型。Hadoop 自身提供一套可优化网络序列化传输的基本类型，而不直接使用Java 内嵌的类型。这些类型均可在org.apache.hadoop.io包中找到。LongWritable 类型(相当于Java 中的Long 类型)、Text 类型(相当于Java 中的String 类型) 和 IntWritable 类型(相当于Java 中的Integer 类型)。

map()方法的输入是一个键和一个值。我们首先将包含有一行输入的Text 值转换成Java 的String 类型，之后使用substring()方法提取我们感兴趣的列。

在旧版本的Hadoop中：map()方法还提供了OutputCollector 实例用于输出内容的写入。

新版本hadoop中：map() 方法提供了一个实例化的东西。 context.write(new Text(year), new IntWritable(airTemperature));这个与旧版本中的写入collect中不同。

1.2 Reduce

reduce 函数通过Reducer 进行类似的定义。

import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> {@Overridepublic void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {int maxValue = Integer.MIN_VALUE;for (IntWritable value : values) {maxValue = Math.max(maxValue, value.get());}context.write(key, new IntWritable(maxValue));}}// static class OldMaxTemperatureReducer extends MapReduceBase// implements Reducer<Text, IntWritable, Text, IntWritable> {//@Override//public void reduce(Text key, Iterator<IntWritable> values,//OutputCollector<Text, IntWritable> output, Reporter reporter)//throws IOException {//int maxValue = Integer.MIN_VALUE;//while (values.hasNext()) {//maxValue = Math.max(maxValue, values.next().get());//}//output.collect(key, new IntWritable(maxValue));//}//}

同样，针对reduce 函数也有四个形式参数类型用于指定其输入和输出类型。reduce函数的输入类型必须与map 函数的输出类型相匹配：即Text 类型和IntWritable 类型。在这种情况下，reduce 函数的输出类型也必须是Text 和IntWritable 这两种类型，分别输出年份和最高气温。

1.3 Mapreduce

import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MaxTemperature {public static void main(String[] args) throws Exception {if (args.length != 2) {System.err.println("Usage: MaxTemperature <input path> <output path>");System.exit(-1);}Job job = new Job();job.setJarByClass(MaxTemperature.class);job.setJobName("Max temperature");FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(MaxTemperatureMapper.class);job.setReducerClass(MaxTemperatureReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);System.exit(job.waitForCompletion(true) ? 0 : 1);}}// public static void main(String[] args) throws IOException {// if (args.length != 2) {// System.err.println("Usage: OldMaxTemperature <input path> <output path>");// System.exit(-1);// }// JobConf conf = new JobConf(OldMaxTemperature.class);// conf.setJobName("Max temperature");// FileInputFormat.addInputPath(conf, new Path(args[0]));// FileOutputFormat.setOutputPath(conf, new Path(args[1]));// conf.setMapperClass(OldMaxTemperatureMapper.class);// conf.setReducerClass(OldMaxTemperatureReducer.class);// conf.setOutputKeyClass(Text.class);// conf.setOutputValueClass(IntWritable.class);// JobClient.runJob(conf);// }// }

以下论述针对旧版本：JobConf 对象指定了作业执行规范。我们可以用它来控制整个作业的运行。在Hadoop 集群上运行这个作业时，．我们需要将代码打包成一个JAR 文件(Hadoop 会在集群上分发这个文件)。我们无需明确指定JAR 文件的名称，而只需在JobConf的构造函数中传递一个类， Hadoop 将通过该类查找包含有该类的JAR 文件进而找到相关的JAR 文件。

新版本中：我们将job的定义和初始化更间接地写成了：Job job = new Job();

随后，我们需要指定输入和输出数据的路径。调用FileinputFormat 类的静态函数addinputPath()来定义输入数据的路径，该路径可以是单个文件、目录(此时，将目录下所有文件当作输入)或符合特定文件模式的一组文件。由函数名可知，可以多次调用addinputPath （）实现多路径的输入。通过调用FileOutputFormat 类中的静态函数setOutputPath()来指定输出路径。该函数指定了reduce 函数输出文件的写人目录。在运行任务前该目录不应该存在，否则Hadoop 会报错并拒绝运行该任务。这种预防措施是为了防止数据丢失(一个长时间运行任务的结果被意外地覆盖将是非常恼人的)。

接着，通过setMapperClass()和setReducerClass()指定map 和reduce 类型。
setOutputKeyClass()和setOutputValueClass()控制map 和reduce 函数的输出类型，正如本例所示，这两个输出类型往往相同。如果不同， map函数的输出类型则通过setMapOutputKeyClass()和setMapOutputValueClass()函数来设置。
输入的类型通过InputFormat类来控制，我们的例子中没有设置，因为使用的是默认的TextinputFormat (文本输入格式)。

在设置定义map 和reduce 函数的类后，便可以开始运行任务。JobClient 类的静态函数runJob()会提交作业井等待完成，最后将其进展情况写到控制台。

1.4 新增的Java MapReduce API

Hadoop 的版本0.20.0 包含有一个新的Java MapReduce API，旨在使API 在今后更容易扩展。新的API 在类型上不兼容先前的，所以，需要重写以前的应用程序才能使新的API 发挥作用。新增的API 和旧的API 之间，有下面几个明显的区别。

新的API 倾向于使用虚类，而不是接口，因为这更容易扩展。例如，可以无需修改类的实现而在虚类中添加一个方位（即用默认的实现）。在新的API 中，mapper 和reducer 现在都是虚类。
新的API 放在org.apache.hadoop.mapreduce 包（和子包）中。之前版本的API 依旧放在org.apache.hadoop.mapred 中。
新的API 充分使用上下文对象，使用户代码能与MapReduce 系统通信。例如，MapContext 基本具备了JobConf 、OutputCollector 和Reporter 的功能。
新的API 同时支持“推”（push ）和“拉”（pull）式的迭代。这两类API ，均可以将键／值对记录推给mapper ，但除此之外，新的API 也允许把记录从map()方法中拉出。对reducer 来说是一样的。“拉”式处理数据的好处是可以实现数据的批量处理，而非逐条记录地处理。
新增的API 实现了配置的统一。旧API 通过一个特殊的JobConf 对象配置作业，该对象是Hadoop 配置对象的一个扩展(用于配置守护进程，详情请参见第130 页的“API 配置”小节)。在新的API 中，我们丢弃这种区分，所有作业的配置均通过Configurat ion 来完成。
新API 中作业控制由Job 类实现，而非JobClient 类，新API 中删除了JobClient 类。
输出文件的命名方式稍有不同。map 的输出文件名为part-m-nnnnn ，而reduce的输出为part-r-nnnnn （其中nnnnn 表示分块序号，为整数，且从0 开始算）。

TIPS: 将旧API 写的Mapper 和Reducer 类转换为新API 时，记住将map()和reduce()的签名转换为新形式。如果只是将类的继承修改为对新的Mapper 和Reducer 类的继承，编译的时候也不会报错或显示警告信息，因为新的Mapper 和Reducer 类同样也提供了等价的map()和reduce （）函数。但是，自己写的mapper 或reducer 代码是不会被调用的，这会导致难以诊断的错误。

2. Hadoop 工作原理

2.1 数据流

Mapreduce作业是工作单元，Hadoop将job分成若干的task，其中包含两类的任务：map任务和reduce任务。Hadoop会将输入数据分解成等长的小数据块，成为分片。

每个分片构建一个map任务。合理的分片大小应该是一个block的大小，一般为64MB。使用本地数据来运行mapreduce，即：map任务将其输出写入本地的分区，而不是依靠和HDFS的互动，显然效率会更高。

reduce任务不可以进行数据本地化——单个reduce任务的输入通常来自所有的mapper的输出。reduce任务的数量是有个人决定的，但是也有一定的技巧。

如有多个reduce 任务，则每个map 任务都会对其输出进行分区（partition），即为每个reduce 任务建一个分区。每个分区有许多键（及其对应值），但每个键对应的键/值对记录都在同一分区中。分区由用户定义的分区函数控制，但通常用默认的分区器(partitioner ，文中有时也称“分区函数”)通过哈希函数来分区，这种方法很高效。一般情况下，多个reduce 任务的数据流如图所示。该图清楚地表明了为什么map 任务和reduce 任务之间的数据流称为shuffle，因为每个reduce 任务的输入都来自许多map 任务。shuffle一般比此图所示的更复杂，井且调整shuffle参数对作业总执行时间会有非常大的影响。

2.2 Combiner

集群上的可用带宽限制了MapReduce 作业的数量，因此最重要的一点是尽量避免map 任务和reduce 任务之间的数据传输。Hadoop 允许用户针对map 任务的输出指定一个合并函数一一合并函数的输出作为reduce 函数的输入。

由于合并函数是一个优化方案，所以Hadoop 无陆确定针对map 任务输出中任一条记录需要调用多少次合并函数（如果需要）。换言之，不管调用combiner多少次， 0 次、1 次或多次， reducer 的输出结果都应一致。可以理解为combiner是在mapper和reducer之间的一个二级缓冲。那是因为mapper和reducer之间，是本地和HDFS之间的互动，性能较差。

制定一个combiner的方法为：

job.setCombinerClass(MaxTemperatureReducer.class);

0 0