[读书笔记]初暼MapReduce
来源:互联网 发布:简单的图像分割算法 编辑:程序博客网 时间:2024/05/22 12:48
Note: 此篇post的大部分内容来源于"Hadoop权威指南", 在此记录下一些阅读心得,作为今后快速查询之用。
Sample Data
0067011990999991950051507004+68750+023550FM-12+038299999V0203301N00671220001CN9999999N9+00001+99999999999
0043011990999991950051512004+68750+023550FM-12+038299999V0203201N00671220001CN9999999N9+00221+99999999999
0043011990999991950051518004+68750+023550FM-12+038299999V0203201N00261220001CN9999999N9-00111+99999999999
0043012650999991949032412004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+01111+99999999999
0043012650999991949032418004+62300+010750FM-12+048599999V0202701N00461220001CN0500001N9+00781+99999999999
红色表示年份,紫色表示天气温度(有符号),绿色表示数据质量.
MapReduce programming
Mapper
public class MaxTemperatureMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { output.collect(new Text(year), new IntWritable(airTemperature)); } }}
Reducer
public class MaxTemperatureReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int maxValue = Integer.MIN_VALUE; while (values.hasNext()) { maxValue = Math.max(maxValue, values.next().get()); } output.collect(key, new IntWritable(maxValue)); }}
Main Driver
public class MaxTemperature { public static void main(String[] args) throws IOException { if (args.length != 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } JobConf conf = new JobConf(MaxTemperature.class); conf.setJobName("Max temperature"); FileInputFormat.addInputPath(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); conf.setMapperClass(MaxTemperatureMapper.class); conf.setReducerClass(MaxTemperatureReducer.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); JobClient.runJob(conf); }}
MapReduce new api
在hadoop 0.20.0引入了一套全新的MapReduce API, 同以前的MapReduce api相比,区别如下:
1> 新的api使用抽象类,而不是接口
2> 新的api在org.apache.hadoop.mapreduce包中,旧有的在org.apache.hadoop.mapred中
3> 新的api使用context对象,旧有的使用OutputCollector/Reporter
4> 新的api同时支持"push"和"pull"的模式,允许把记录从map()中pull出来从而分批处理record,旧的只支持push模式,一个一个处理record
5> 新的api统一了配置,使用Configuration,旧有的使用JobConf用于作业配置
6> 作业控制由Job类负责,而不是JobClient
public class NewMaxTemperature { static class NewMaxTemperatureMapper /*[*/extends Mapper<LongWritable, Text, Text, IntWritable>/*]*/ { private static final int MISSING = 9999; public void map(LongWritable key, Text value, /*[*/Context context/*]*/) throws IOException, /*[*/InterruptedException/*]*/ { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { /*[*/context.write/*]*/(new Text(year), new IntWritable(airTemperature)); } } } static class NewMaxTemperatureReducer /*[*/extends Reducer<Text, IntWritable, Text, IntWritable>/*]*/ { public void reduce(Text key, /*[*/Iterable/*]*/<IntWritable> values, /*[*/Context context/*]*/) throws IOException, /*[*/InterruptedException/*]*/ { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } /*[*/context.write/*]*/(key, new IntWritable(maxValue)); } } public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: NewMaxTemperature <input path> <output path>"); System.exit(-1); } /*[*/Job job = new Job(); job.setJarByClass(NewMaxTemperature.class);/*]*/ FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(NewMaxTemperatureMapper.class); job.setReducerClass(NewMaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); /*[*/System.exit(job.waitForCompletion(true) ? 0 : 1);/*]*/ }}
运行代码
在Hadoop的Standalone模式下运行测试代码
$ javac -cp $HADOOP_INSTALL/hadoop-core-{version}.jar -d build/classes src/java/mapred/*.java$ export HADOOP_CLASSPATH=build/classes$ hadoop MaxTemperature input/ncdc/sample.txt output/
控制台输出
11/10/03 19:52:37 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.11/10/03 19:52:37 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).11/10/03 19:52:37 INFO mapred.FileInputFormat: Total input paths to process : 111/10/03 19:52:37 INFO mapred.JobClient: Running job: job_local_000111/10/03 19:52:37 INFO mapred.MapTask: numReduceTasks: 111/10/03 19:52:37 INFO mapred.MapTask: io.sort.mb = 10011/10/03 19:52:38 INFO mapred.MapTask: data buffer = 79691776/9961472011/10/03 19:52:38 INFO mapred.MapTask: record buffer = 262144/32768011/10/03 19:52:38 INFO mapred.MapTask: Starting flush of map output11/10/03 19:52:38 INFO mapred.MapTask: Finished spill 011/10/03 19:52:38 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting11/10/03 19:52:38 INFO mapred.JobClient: map 0% reduce 0%11/10/03 19:52:40 INFO mapred.LocalJobRunner: file:/home/xiyu/Tuto/tomwhite-hadoop-book-32dae01/input/ncdc/sample.txt:0+52911/10/03 19:52:40 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done.11/10/03 19:52:40 INFO mapred.LocalJobRunner: 11/10/03 19:52:40 INFO mapred.Merger: Merging 1 sorted segments11/10/03 19:52:40 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 57 bytes11/10/03 19:52:40 INFO mapred.LocalJobRunner: 11/10/03 19:52:40 INFO mapred.Task: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting11/10/03 19:52:40 INFO mapred.LocalJobRunner: 11/10/03 19:52:40 INFO mapred.Task: Task attempt_local_0001_r_000000_0 is allowed to commit now11/10/03 19:52:40 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to file:/home/xiyu/Tuto/tomwhite-hadoop-book-32dae01/ch02/output11/10/03 19:52:41 INFO mapred.JobClient: map 100% reduce 0%11/10/03 19:52:43 INFO mapred.LocalJobRunner: reduce > reduce11/10/03 19:52:43 INFO mapred.Task: Task 'attempt_local_0001_r_000000_0' done.11/10/03 19:52:44 INFO mapred.JobClient: map 100% reduce 100%11/10/03 19:52:44 INFO mapred.JobClient: Job complete: job_local_000111/10/03 19:52:44 INFO mapred.JobClient: Counters: 1711/10/03 19:52:44 INFO mapred.JobClient: File Input Format Counters 11/10/03 19:52:44 INFO mapred.JobClient: Bytes Read=52911/10/03 19:52:44 INFO mapred.JobClient: File Output Format Counters 11/10/03 19:52:44 INFO mapred.JobClient: Bytes Written=2911/10/03 19:52:44 INFO mapred.JobClient: FileSystemCounters11/10/03 19:52:44 INFO mapred.JobClient: FILE_BYTES_READ=147911/10/03 19:52:44 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6137311/10/03 19:52:44 INFO mapred.JobClient: Map-Reduce Framework11/10/03 19:52:44 INFO mapred.JobClient: Map output materialized bytes=6111/10/03 19:52:44 INFO mapred.JobClient: Map input records=511/10/03 19:52:44 INFO mapred.JobClient: Reduce shuffle bytes=011/10/03 19:52:44 INFO mapred.JobClient: Spilled Records=1011/10/03 19:52:44 INFO mapred.JobClient: Map output bytes=4511/10/03 19:52:44 INFO mapred.JobClient: Map input bytes=52911/10/03 19:52:44 INFO mapred.JobClient: SPLIT_RAW_BYTES=12411/10/03 19:52:44 INFO mapred.JobClient: Combine input records=011/10/03 19:52:44 INFO mapred.JobClient: Reduce input records=511/10/03 19:52:44 INFO mapred.JobClient: Reduce input groups=211/10/03 19:52:44 INFO mapred.JobClient: Combine output records=011/10/03 19:52:44 INFO mapred.JobClient: Reduce output records=211/10/03 19:52:44 INFO mapred.JobClient: Map output records=5
其中job_local_0001为提交的作业名,attempt_local_0001_m_000000_0为唯一启动的map任务,attempt_local_0001_r_000000_0为唯一启动的reduce任务.
在任务完成之后,输出了一些job counter的信息.
- [读书笔记]初暼MapReduce
- HDFS、YARN、MapReduce原理--读书笔记
- 读书笔记--MapReduce 适用场景 及 常见应用
- Hadoop读书笔记(九)MapReduce计数器
- Hadoop读书笔记(十二)MapReduce自定义排序
- Hadoop读书笔记(五)MapReduce统计单词demo
- Hadoop读书笔记(六)MapReduce自定义数据类型demo
- Hadoop读书笔记(八)MapReduce 打成jar包demo
- Hadoop读书笔记(十一)MapReduce中的partition分组
- Hadoop读书笔记(十三)MapReduce中Top算法
- 《Hadoop: The Definitive Guide》读书笔记 -- Chapter 2 MapReduce
- 《HBase权威指南》读书笔记7:第七章 与MapReduce 集成
- Hadoop权威指南读书笔记(1) - MapReduce和HDFS简介
- MapReduce
- MapReduce
- MapReduce
- MapReduce
- mapreduce
- 十年造才能造就一程序员
- 光栅操作
- spring的JDBC框架中自增键的问题与cacheSize关系
- C语言语法总结
- Windows2008 网络负载平衡(NLB)实验
- [读书笔记]初暼MapReduce
- 观察者模式的应用
- 如何修改eclipse中的默认工作路径
- VMware vSphere4.1 ESXI环境下微软NLB详细配置
- 16. 3. 6. Tranformation with AffineTransform.getScaleInstance
- 关于AC自动机fail指针的灵感
- Mysql字符集
- feh 一个基于命令行的高速图片查看器
- 使用jquery和json实现系部与班级的级联