[读书笔记]初暼MapReduce

来源：互联网发布：简单的图像分割算法编辑：程序博客网时间：2024/05/22 12:48

Note: 此篇post的大部分内容来源于"Hadoop权威指南", 在此记录下一些阅读心得，作为今后快速查询之用。

Sample Data

0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999

红色表示年份，紫色表示天气温度(有符号)，绿色表示数据质量.

MapReduce programming

Mapper

public class MaxTemperatureMapper extends MapReduceBase  implements Mapper<LongWritable, Text, Text, IntWritable> {  private static final int MISSING = 9999;  public void map(LongWritable key, Text value,      OutputCollector<Text, IntWritable> output, Reporter reporter)      throws IOException {    String line = value.toString();    String year = line.substring(15, 19);    int airTemperature;    if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs      airTemperature = Integer.parseInt(line.substring(88, 92));    } else {      airTemperature = Integer.parseInt(line.substring(87, 92));    }    String quality = line.substring(92, 93);    if (airTemperature != MISSING && quality.matches("[01459]")) {      output.collect(new Text(year), new IntWritable(airTemperature));    }  }}

Reducer

public class MaxTemperatureReducer extends MapReduceBase  implements Reducer<Text, IntWritable, Text, IntWritable> {  public void reduce(Text key, Iterator<IntWritable> values,      OutputCollector<Text, IntWritable> output, Reporter reporter)      throws IOException {        int maxValue = Integer.MIN_VALUE;    while (values.hasNext()) {      maxValue = Math.max(maxValue, values.next().get());    }    output.collect(key, new IntWritable(maxValue));  }}

Main Driver

public class MaxTemperature {  public static void main(String[] args) throws IOException {    if (args.length != 2) {      System.err.println("Usage: MaxTemperature <input path> <output path>");      System.exit(-1);    }    JobConf conf = new JobConf(MaxTemperature.class);    conf.setJobName("Max temperature");    FileInputFormat.addInputPath(conf, new Path(args[0]));    FileOutputFormat.setOutputPath(conf, new Path(args[1]));    conf.setMapperClass(MaxTemperatureMapper.class);    conf.setReducerClass(MaxTemperatureReducer.class);    conf.setOutputKeyClass(Text.class);    conf.setOutputValueClass(IntWritable.class);    JobClient.runJob(conf);  }}

MapReduce new api

在hadoop 0.20.0引入了一套全新的MapReduce API, 同以前的MapReduce api相比，区别如下:

1> 新的api使用抽象类，而不是接口

2> 新的api在org.apache.hadoop.mapreduce包中，旧有的在org.apache.hadoop.mapred中

3> 新的api使用context对象，旧有的使用OutputCollector/Reporter

4> 新的api同时支持"push"和"pull"的模式，允许把记录从map()中pull出来从而分批处理record，旧的只支持push模式，一个一个处理record

5> 新的api统一了配置，使用Configuration，旧有的使用JobConf用于作业配置

6> 作业控制由Job类负责，而不是JobClient

public class NewMaxTemperature {  static class NewMaxTemperatureMapper    /*[*/extends Mapper<LongWritable, Text, Text, IntWritable>/*]*/ {    private static final int MISSING = 9999;    public void map(LongWritable key, Text value, /*[*/Context context/*]*/)        throws IOException, /*[*/InterruptedException/*]*/ {      String line = value.toString();      String year = line.substring(15, 19);      int airTemperature;      if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs        airTemperature = Integer.parseInt(line.substring(88, 92));      } else {        airTemperature = Integer.parseInt(line.substring(87, 92));      }      String quality = line.substring(92, 93);      if (airTemperature != MISSING && quality.matches("[01459]")) {        /*[*/context.write/*]*/(new Text(year), new IntWritable(airTemperature));      }    }  }  static class NewMaxTemperatureReducer    /*[*/extends Reducer<Text, IntWritable, Text, IntWritable>/*]*/ {    public void reduce(Text key, /*[*/Iterable/*]*/<IntWritable> values,        /*[*/Context context/*]*/)        throws IOException, /*[*/InterruptedException/*]*/ {      int maxValue = Integer.MIN_VALUE;      for (IntWritable value : values) {        maxValue = Math.max(maxValue, value.get());      }      /*[*/context.write/*]*/(key, new IntWritable(maxValue));    }  }  public static void main(String[] args) throws Exception {    if (args.length != 2) {      System.err.println("Usage: NewMaxTemperature <input path> <output path>");      System.exit(-1);    }    /*[*/Job job = new Job();    job.setJarByClass(NewMaxTemperature.class);/*]*/    FileInputFormat.addInputPath(job, new Path(args[0]));    FileOutputFormat.setOutputPath(job, new Path(args[1]));    job.setMapperClass(NewMaxTemperatureMapper.class);    job.setReducerClass(NewMaxTemperatureReducer.class);    job.setOutputKeyClass(Text.class);    job.setOutputValueClass(IntWritable.class);    /*[*/System.exit(job.waitForCompletion(true) ? 0 : 1);/*]*/  }}

运行代码

在Hadoop的Standalone模式下运行测试代码

$ javac -cp $HADOOP_INSTALL/hadoop-core-{version}.jar -d build/classes src/java/mapred/*.java$ export HADOOP_CLASSPATH=build/classes$ hadoop MaxTemperature input/ncdc/sample.txt output/

控制台输出

11/10/03 19:52:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.11/10/03 19:52:37 WARN mapred.JobClient: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).11/10/03 19:52:37 INFO mapred.FileInputFormat: Total input paths to process : 111/10/03 19:52:37 INFO mapred.JobClient: Running job: job_local_000111/10/03 19:52:37 INFO mapred.MapTask: numReduceTasks: 111/10/03 19:52:37 INFO mapred.MapTask: io.sort.mb = 10011/10/03 19:52:38 INFO mapred.MapTask: data buffer = 79691776/9961472011/10/03 19:52:38 INFO mapred.MapTask: record buffer = 262144/32768011/10/03 19:52:38 INFO mapred.MapTask: Starting flush of map output11/10/03 19:52:38 INFO mapred.MapTask: Finished spill 011/10/03 19:52:38 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting11/10/03 19:52:38 INFO mapred.JobClient:  map 0% reduce 0%11/10/03 19:52:40 INFO mapred.LocalJobRunner: file:/home/xiyu/Tuto/tomwhite-hadoop-book-32dae01/input/ncdc/sample.txt:0+52911/10/03 19:52:40 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.11/10/03 19:52:40 INFO mapred.LocalJobRunner: 11/10/03 19:52:40 INFO mapred.Merger: Merging 1 sorted segments11/10/03 19:52:40 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 57 bytes11/10/03 19:52:40 INFO mapred.LocalJobRunner: 11/10/03 19:52:40 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting11/10/03 19:52:40 INFO mapred.LocalJobRunner: 11/10/03 19:52:40 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now11/10/03 19:52:40 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/home/xiyu/Tuto/tomwhite-hadoop-book-32dae01/ch02/output11/10/03 19:52:41 INFO mapred.JobClient:  map 100% reduce 0%11/10/03 19:52:43 INFO mapred.LocalJobRunner: reduce > reduce11/10/03 19:52:43 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.11/10/03 19:52:44 INFO mapred.JobClient:  map 100% reduce 100%11/10/03 19:52:44 INFO mapred.JobClient: Job complete: job_local_000111/10/03 19:52:44 INFO mapred.JobClient: Counters: 1711/10/03 19:52:44 INFO mapred.JobClient:   File Input Format Counters 11/10/03 19:52:44 INFO mapred.JobClient:     Bytes Read=52911/10/03 19:52:44 INFO mapred.JobClient:   File Output Format Counters 11/10/03 19:52:44 INFO mapred.JobClient:     Bytes Written=2911/10/03 19:52:44 INFO mapred.JobClient:   FileSystemCounters11/10/03 19:52:44 INFO mapred.JobClient:     FILE_BYTES_READ=147911/10/03 19:52:44 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=6137311/10/03 19:52:44 INFO mapred.JobClient:   Map-Reduce Framework11/10/03 19:52:44 INFO mapred.JobClient:     Map output materialized bytes=6111/10/03 19:52:44 INFO mapred.JobClient:     Map input records=511/10/03 19:52:44 INFO mapred.JobClient:     Reduce shuffle bytes=011/10/03 19:52:44 INFO mapred.JobClient:     Spilled Records=1011/10/03 19:52:44 INFO mapred.JobClient:     Map output bytes=4511/10/03 19:52:44 INFO mapred.JobClient:     Map input bytes=52911/10/03 19:52:44 INFO mapred.JobClient:     SPLIT_RAW_BYTES=12411/10/03 19:52:44 INFO mapred.JobClient:     Combine input records=011/10/03 19:52:44 INFO mapred.JobClient:     Reduce input records=511/10/03 19:52:44 INFO mapred.JobClient:     Reduce input groups=211/10/03 19:52:44 INFO mapred.JobClient:     Combine output records=011/10/03 19:52:44 INFO mapred.JobClient:     Reduce output records=211/10/03 19:52:44 INFO mapred.JobClient:     Map output records=5

其中job_local_0001为提交的作业名，attempt_local_0001_m_000000_0为唯一启动的map任务，attempt_local_0001_r_000000_0为唯一启动的reduce任务.

在任务完成之后，输出了一些job counter的信息.