MapReduce 程序模板（采用新/旧API）

来源：互联网发布：培育发展新动力优化编辑：程序博客网时间：2024/05/18 03:03

最近在学习MapReduce编程，在仔细阅读了《Hadoop in Action》和《Hadoop: The Definitive Guide》两本书后，终于成功运行了一个自己写的MapReduce程序。 MapReduce程序一般都是在一个模板上进行修改拓展的，所以我这里将MapReduce模板贴出来。

还有一个关键点： MapReduce的API在hadoop-0.20.0前后，发生了如下变化：

（1）新的API倾向于使用抽象类，而不是接口。新的API中Mapper和Reducer是抽象类。

　　（2）新的API在org.apache.hadoop.mapreduce包和子包中，旧版的API放在org.apache.hadoop.mapred中。在编程中一定要注意两个包不要混用或者用错，程序中要正确统一的的import进新包或者旧包。我在刚开始写代码的时候由于没有注意这一点，程序出现过错误，尤其是在刚建map或reduce类以及job的配置时。

　　（3）新的API中广泛使用context object，例如MapContext基本上充当这JobConf的OutputCollector和Reporter的角色。

　　（4）新的API同时支持“推”和“拉”式的迭代。

　　（5）新的API同一了配置。旧API使用JobConf对象进行作业配置，新API中作业配置通过Configuration来完成。

　　（6）新API中作业控制执行有Job类来负责，旧版使用JobClient。这也是写代码时要注意的地方。

以上内容来自《Hadoop: The Definitive Guide》

在代码中，我加上了一些自己认为重要的注意点，希望能有用。

旧API版的模板：

import java.io.IOException;import java.util.Iterator;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapred.JobClient;import org.apache.hadoop.mapred.Mapper;import org.apache.hadoop.mapred.JobConf;import org.apache.hadoop.mapred.MapReduceBase;import org.apache.hadoop.mapred.OutputCollector;import org.apache.hadoop.mapred.Reducer;import org.apache.hadoop.mapred.Reporter;import org.apache.hadoop.mapred.FileInputFormat;import org.apache.hadoop.mapred.FileOutputFormat;import org.apache.hadoop.mapred.KeyValueTextInputFormat;import org.apache.hadoop.mapred.TextOutputFormat;/** * 这里类的引用存在新旧版本的差异，在mapred里可以用JobConf，但是mapreduce里只有Job * 。而且FileInputOutFormat，FileOutputFormat是在两个类中都存在的，所以会导致下面 * 的错误，只要将所有的都制定在一个类中，就可以了 *///import org.apache.hadoop.mapreduce.Job;//import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;//import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;//import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;//import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;/** *  * @author napoleongjc * @version 1.0 *//* * 是一个Map/Reduce框架提供的用于收集 Mapper或Reducer输出数据的通用机制 * （包括中间输出结果和作业的输出结果）。 * Reporter是用于Map/Reduce应用程序报告进度，设定应用级别的状态消息，  * 更新Counters（计数器）的机制。 */public class MyJob extends Configured implements Tool{//記牢Map Reduce的类签名格式public static class MapClass extends MapReduceBaseimplements Mapper<Text, Text, Text, Text>{public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException{output.collect(value, key); //输出键值对}}public static class Reduce extends MapReduceBaseimplements Reducer<Text, Text, Text, Text>{//<K3,V3>一定要是Writable的子类，确保Hadoop的序列化接口可以吧数据在分布式集群上发                                                            //送，这里没有按照这个注意点public void reduce(Text key, Iterator<Text> values,OutputCollector<Text, Text> output,Reporter reporter) throws IOException{String csv = "";while (values.hasNext()){ //这里的迭代方式和前面的Iterator一致if (csv.length() > 0)csv += ",";csv += values.next().toString();}output.collect(key, new Text(csv)); //输出前将count强制转化成IntWritable}}/** * run被称为是框架的核心，也称为Driver，里面实例化，配置了一个JobConf对象，也就是说： * 在run里面，是整个任务的前提环境。 */public int run(String[] args) throws Exception{Configuration conf = getConf();//Job job = new Job(conf, "MyJob");//Path in = new Path(args[0]);//Path out = new Path(args[1]);//FileInputFormat.setInputPaths(job, in);//FileOutputFormat.setOutputPath(job, out);/** * 关于上面这段代码，我觉得还有以下的写法，上面是都采用mapred类， * 下面就是采用mapreduce类 */JobConf job = new JobConf(conf, MyJob.class);Path in = new Path(args[0]);Path out = new Path(args[1]);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);job.setJobName("MyJob"); //设置文件的类型名job.setMapperClass(MapClass.class);//Map类是哪个job.setReducerClass(Reduce.class);//Reduce类是哪个job.setInputFormat(KeyValueTextInputFormat.class); //设置Job的分割与读取文件的方式job.set("key.value.separator.in.input.line", ","); //  设置在读取数据时，采用那种输出分割符job.setOutputFormat(TextOutputFormat.class);//设置Job输出数据到文件的格式,不是Reduce的输出，是整个Job的输出job.setOutputKeyClass(Text.class); //Map Reduce的输出Key格式，要是Map和Reduce中的类型不同，Map可以用setMapOutputKeyClassjob.setOutputValueClass(Text.class);//MapReduce输出Value格式,要是Map和Reduce中的类型不同，Map可以用setMapOutputValueClassJobClient.runJob(job); //运行作业return 0;}public static void main(String[] args) throws Exception{int res = ToolRunner.run(new Configuration(), new MyJob(), args);System.exit(res);}}

新API模板：MyJobNew.java

import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;/** *  * @author napoleongjc * @写这个模板是为了体现出两个版本之间的API之间的不同，他们的不同主要体现在MapReduce之间的差距 */public class MyJobNew extends Configured implements Tool{/* * 首先，Map类和Reduce类的签名就不一样， * 1.直接继承自Mapper和Reducer类，不用集成MapReduceBase再Implements Mapper * 2.中间输出和Reduce输出采用Context上下文对象，不再是OutputCollector + Reporter * 3.输出键/值对，采用context对象的write（）方法，不是原来的collector.collect() */public static class MapClassextends Mapper<LongWritable, Text, Text, Text>{ //直接继承自Mapperpublic void map(LongWritable key, Text value, Context context) //输出直接采用Context的对象throws IOException, InterruptedException { //除了IOException，还有InterruptedExceptionString[] citation = value.toString().split(",");context.write(new Text(citation[1]), new Text(citation[0]));//输出<K2,V2>键值对}}/** * Combiner的作用就是将Mapper的结果先部分合并，减小Mapper和Reducer之间的 * 网络流量，节省资源 * 1.通过reducer的接口来定义，所以他基本上与reducer函数相同 * 2.只需在run中设置的时候添加setCombinerClass(xxxx.class)就行了 */public static class Combine extends Reducer<Text, Text, Text, Text>{public void reduce(Text key, Iterable<Text> values, Context context)//迭代方式是Iterable，不同与以前的Iteratorthrows IOException, InterruptedException{String csv = "";for (Text val:values){ //所以迭代方式都是不一样的if (csv.length() > 0)csv += ",";csv += val.toString();}context.write(key, new Text(csv)); //输出}}public static class Reduce extends Reducer<Text, Text, Text, Text>{public void reduce(Text key, Iterable<Text> values, Context context)//迭代方式是Iterable，不同与以前的Iteratorthrows IOException, InterruptedException{String csv = "";for (Text val:values){ //所以迭代方式都是不一样的if (csv.length() > 0)csv += ",";csv += val.toString();}context.write(key, new Text(csv)); //输出}}/* *Driver也不同。 *1.Configuration和JobConf的功能被Configuration和Job替代，所以定义设置和作业的方式不同了 *2.设置Job读取文件/输出文件的方法名变了 setInputFormat -> setInputFormatClass *3.提交作业，以前是JobClient.runJob(job),现在是 System.exit(job.waitForCompletion(true)?0:1) */public int run(String[] args) throws Exception{/* * Configuration是为了设置作业 */Configuration conf = getConf(); /* * Job定义和控制一个作业 */Job job = new Job(conf, "MyJobNew");job.setJarByClass(MyJobNew.class);Path in = new Path(args[0]);Path out = new Path(args[1]);FileInputFormat.setInputPaths(job, in);FileOutputFormat.setOutputPath(job, out);job.setMapperClass(MapClass.class);job.setCombinerClass(Combine.class);job.setReducerClass(Reduce.class);job.setInputFormatClass(TextInputFormat.class);job.setOutputFormatClass(TextOutputFormat.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(Text.class);System.exit(job.waitForCompletion(true)?0:1); //run the jobreturn 0;}public static void main(String[] args) throws Exception{int res = ToolRunner.run(new Configuration(), new MyJobNew(), args);System.exit(res);}}

MapReduce 程序模板 （采用 新/旧API）

MapReduce 程序模板（采用新/旧API）