Hadoop基础教程-第6章 MapReduce入门(6.5 温度统计)(草稿)
来源:互联网 发布:李白的地位 知乎 编辑:程序博客网 时间:2024/05/16 12:15
第6章 MapReduce入门
6.5 温度统计
6.5.1 问题描述
《HADOOP权威指南 第3版 》教程中有个经典例子,既是温度统计。作者Tom White在书中写了程序和讲解了原理,认为读者们都会MapReduce程序的基本环境搭建部署,所以这里轻描淡写给带过了,对于初学者来说,这是一个“天坑”,程序跑步起来,也就消磨了Hadoop初学者的兴趣和意志。
本节内容的Java项目目录结构请参见6.4节。
这里根据《HADOOP权威指南 第3版 》书中描写,再简单描述一下这个问题。
这是来自美国气象数据中心的数据,示例数据如下。
省略号表示一些未使用的列(暂不考虑)。
将这些若干多行数据以键值对方式作为map方法的输入,行号(行偏移量)作为key,value是一行数据本身。
为了方便温度统计,我们需要从每一行数据中提取出年份和温度(上图的黑体字即是)。其中4位年份是一行的第15到到第18个字符;温度被放大了100倍,每行从第87个字符,到 第91个字符,第87个字符表示温度正负符号。
2017-6-19更新
下载数据 ftp://ftp.ncdc.noaa.gov/pub/data/gsod/
为了测试方便,我们只下载了2016年和2017年的数据
(1)解压年份压缩包
[root@nb0 data]# tar -xvf gsod_2016.tar[root@nb0 data]# tar -xvf gsod_2017.tar[root@nb0 data]# ls........151130-99999-2016.op.gz 416200-99999-2016.op.gz 689230-99999-2016.op.gz 723284-93839-2017.op.gz 871210-99999-2016.op.gz151130-99999-2017.op.gz 416200-99999-2017.op.gz 689230-99999-2017.op.gz 723290-03935-2016.op.gz 871210-99999-2017.op.gz151170-99999-2016.op.gz 416240-99999-2016.op.gz 689250-99999-2016.op.gz 723290-03935-2017.op.gz 871271-99999-2017.op.gz151170-99999-2017.op.gz 416240-99999-2017.op.gz 689250-99999-2017.op.gz 723300-03975-2016.op.gz 871290-99999-2016.op.gz151180-99999-2016.op.gz 416360-99999-2016.op.gz 689260-99999-2016.op.gz 723300-03975-2017.op.gz 871290-99999-2017.op.gz
(2)合并文件
使用zcat命令把这些数据文件解压并合并到一个ncdc.txt文件中
[root@nb0 data]# zcat *.gz > ncdc.txt[root@nb0 data]# ll |grep ncdc-rw-r--r-- 1 root root 874976505 Jun 19 21:10 ncdc.txt
(3)查看ncdc.txt文件
[root@nb0 data]# head -12 ncdc.txt STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT007026 99999 20160622 94.7 7 66.7 7 9999.9 0 9999.9 0 6.2 4 0.0 7 999.9 999.9 100.4* 87.8* 0.00I 999.9 000000007026 99999 20160623 88.3 24 69.7 24 9999.9 0 9999.9 0 6.2 24 0.0 24 999.9 999.9 98.6* 78.8* 0.00I 999.9 000000007026 99999 20160624 80.5 24 69.3 24 9999.9 0 9999.9 0 5.8 22 0.0 24 999.9 999.9 93.2* 69.8* 99.99 999.9 010000007026 99999 20160625 81.4 24 71.8 24 9999.9 0 9999.9 0 5.9 23 0.0 24 999.9 999.9 89.6* 73.4* 0.00I 999.9 000000007026 99999 20160626 80.5 24 63.4 24 9999.9 0 9999.9 0 6.2 22 0.0 24 999.9 999.9 91.4* 69.8* 0.00I 999.9 000000007026 99999 20160627 80.6 24 64.2 24 9999.9 0 9999.9 0 6.0 23 0.0 24 999.9 999.9 93.2* 68.0* 0.00I 999.9 000000007026 99999 20160628 77.4 24 70.8 24 9999.9 0 9999.9 0 5.1 17 0.0 24 999.9 999.9 87.8* 71.6* 99.99 999.9 010000007026 99999 20160629 74.3 16 71.9 16 9999.9 0 9999.9 0 2.7 13 0.0 16 999.9 999.9 86.0* 69.8* 0.00I 999.9 000000STN--- WBAN YEARMODA TEMP DEWP SLP STP VISIB WDSP MXSPD GUST MAX MIN PRCP SNDP FRSHTT007026 99999 20170210 46.9 11 19.4 11 9999.9 0 9999.9 0 6.2 11 3.6 11 8.0 13.0 53.6* 35.6* 0.00I 999.9 000000007026 99999 20170211 54.4 24 35.0 24 9999.9 0 9999.9 0 6.2 24 4.7 24 12.0 21.0 77.0* 42.8* 0.00I 999.9 000000
(4)数据格式说明
ftp://ftp.ncdc.noaa.gov/pub/data/gsod/readme.txt
(5)删除标题行
使用sed命令删除匹配STN的行
[root@nb0 data]# sed -i '/STN/d' ncdc.txt
查看结果
[root@nb0 data]# head -12 ncdc.txt007026 99999 20160622 94.7 7 66.7 7 9999.9 0 9999.9 0 6.2 4 0.0 7 999.9 999.9 100.4* 87.8* 0.00I 999.9 000000007026 99999 20160623 88.3 24 69.7 24 9999.9 0 9999.9 0 6.2 24 0.0 24 999.9 999.9 98.6* 78.8* 0.00I 999.9 000000007026 99999 20160624 80.5 24 69.3 24 9999.9 0 9999.9 0 5.8 22 0.0 24 999.9 999.9 93.2* 69.8* 99.99 999.9 010000007026 99999 20160625 81.4 24 71.8 24 9999.9 0 9999.9 0 5.9 23 0.0 24 999.9 999.9 89.6* 73.4* 0.00I 999.9 000000007026 99999 20160626 80.5 24 63.4 24 9999.9 0 9999.9 0 6.2 22 0.0 24 999.9 999.9 91.4* 69.8* 0.00I 999.9 000000007026 99999 20160627 80.6 24 64.2 24 9999.9 0 9999.9 0 6.0 23 0.0 24 999.9 999.9 93.2* 68.0* 0.00I 999.9 000000007026 99999 20160628 77.4 24 70.8 24 9999.9 0 9999.9 0 5.1 17 0.0 24 999.9 999.9 87.8* 71.6* 99.99 999.9 010000007026 99999 20160629 74.3 16 71.9 16 9999.9 0 9999.9 0 2.7 13 0.0 16 999.9 999.9 86.0* 69.8* 0.00I 999.9 000000007026 99999 20170210 46.9 11 19.4 11 9999.9 0 9999.9 0 6.2 11 3.6 11 8.0 13.0 53.6* 35.6* 0.00I 999.9 000000007026 99999 20170211 54.4 24 35.0 24 9999.9 0 9999.9 0 6.2 24 4.7 24 12.0 21.0 77.0* 42.8* 0.00I 999.9 000000007026 99999 20170212 70.3 24 52.9 24 9999.9 0 9999.9 0 6.2 24 5.2 24 15.0 21.0 84.2* 62.6* 0.00I 999.9 000000007026 99999 20170213 61.1 22 35.4 22 9999.9 0 9999.9 0 6.2 22 8.9 22 15.0 24.1 75.2* 50.0* 0.00I 999.9 000000
6.5.2 准备数据
(1)下载数据文件到其中一个节点
[root@node1 ~]# lsanaconda-ks.cfg cite75_99.txt hadoop-2.7.3.tar.gz jdk-8u112-linux-x64.tar.gz rs.txt word1.txt wordcount.jar
(2)上传数据库文件到HDFS
[root@node1 ~]# hdfs dfs -mkdir -p temperature/input[root@node1 ~]# hdfs dfs -put temperature.txt temperature/input[root@node1 ~]# hdfs dfs -ls temperature/inputFound 1 items-rw-r--r-- 3 root supergroup 46337829 2017-06-01 09:31 temperature/input/temperature.txt[root@node1 ~]#
6.5.3 求解最高温度
package cn.hadron.mrDemo;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.apache.hadoop.mapreduce.Reducer;public class MaxTemperature extends Configured implements Tool { //Mapper public static class MaxTemperatureMapper extends Mapper< LongWritable, Text, Text, IntWritable> { //9999代码表示缺失 private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if(line.charAt(87) == '+') { airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if(airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } } } //Reducer public static class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); } } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); //读取配置文件 conf.set("fs.defaultFS", "hdfs://192.168.80.131:9000"); Job job = Job.getInstance(conf, "MaxTemperature"); job.setJarByClass(MaxTemperature.class); Path in = new Path(args[0]);//输入路径 Path out = new Path(args[1]);//输出路径 FileSystem hdfs = out.getFileSystem(conf); if (hdfs.isDirectory(out)) {//如果输出路径存在就删除 hdfs.delete(out, true); } FileInputFormat.setInputPaths(job, in);//文件输入 FileOutputFormat.setOutputPath(job, out);//文件输出 job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true)?0:1;//等待作业完成退出 } public static void main(String[] args){ System.setProperty("HADOOP_USER_NAME", "root"); try { //程序参数:输入路径、输出路径 String[] args0 ={"/user/root/temperature/input","/user/root/temperature/output/"}; //本地运行:第三个参数可通过数组传入,程序中设置为args0 //集群运行:第三个参数可通过命令行传入,程序中设置为args //这里设置为本地运行,参数为args0 int res = ToolRunner.run(new Configuration(), new MaxTemperature(), args0); System.exit(res); } catch (Exception e) { e.printStackTrace(); } }}
在Eclipse中直接运行
输出同统计结果
[root@node1 ~]# hdfs dfs -cat /user/root/temperature/output/part-r-000001971 4001972 4111973 430
6.5.4 最低温度
package cn.hadron.mrDemo;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.apache.hadoop.mapreduce.Reducer;public class MinTemperature extends Configured implements Tool { //Mapper public static class MinTemperatureMapper extends Mapper< LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if(line.charAt(87) == '+') { airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if(airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } } } //Reducer public static class MinTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int minValue = Integer.MAX_VALUE; for(IntWritable value : values) { minValue = Math.min(minValue, value.get()); } context.write(key, new IntWritable(minValue)); } } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); //读取配置文件 conf.set("fs.defaultFS", "hdfs://192.168.80.131:9000"); Job job = Job.getInstance(conf, "Min temperature"); job.setJarByClass(MinTemperature.class); Path in = new Path(args[0]);//输入路径 Path out = new Path(args[1]);//输出路径 FileSystem hdfs = out.getFileSystem(conf); if (hdfs.isDirectory(out)) {//如果输出路径存在就删除 hdfs.delete(out, true); } FileInputFormat.setInputPaths(job, in);//文件输入 FileOutputFormat.setOutputPath(job, out);//文件输出 job.setMapperClass(MinTemperatureMapper.class); job.setReducerClass(MinTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true)?0:1;//等待作业完成退出 } public static void main(String[] args){ System.setProperty("HADOOP_USER_NAME", "root"); try { //程序参数:输入路径、输出路径 String[] args0 ={"/user/root/temperature/input","/user/root/temperature/output/"}; //本地运行:第三个参数可通过数组传入,程序中设置为args0 //集群运行:第三个参数可通过命令行传入,程序中设置为args //这里设置为本地运行,参数为args0 int res = ToolRunner.run(new Configuration(), new MinTemperature(), args0); System.exit(res); } catch (Exception e) { e.printStackTrace(); } }}
[root@node1 ~]# hdfs dfs -cat /user/root/temperature/output/part-r-000001971 -4611972 -2671973 -390
6.5.5 平均温度
import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.conf.Configured;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.*;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.Tool;import org.apache.hadoop.util.ToolRunner;import org.apache.hadoop.mapreduce.Reducer;public class AvgTemperature extends Configured implements Tool{ //Mapper public static class AvgTemperatureMapper extends Mapper<LongWritable, Text, Text, Text> { private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ String line = value.toString(); String year = line.substring(15, 19); int airTemperature; if(line.charAt(87) == '+') { airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if(airTemperature != MISSING && !quality.matches("[01459]")) { context.write(new Text(year), new Text(String.valueOf(airTemperature))); //System.out.println("Mapper输出<" +year + "," + airTemperature + ">"); } } } //对于平均值而言,各局部平均值的平均值将不再是整体的平均值了,所以不能直接用combiner。 //可以通过变通的办法使用combiner来计算平均值,即在combiner的键值对中不直接存储最后的平均值, //而是存储所有值的和个数,最后在reducer输出时再用和除以个数得到平均值。 //Combiner public static class AvgTemperatureCombiner extends Reducer<Text, Text, Text, Text>{ @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { System.out.println("....AvgTemperatureCombiner.... "); double sumValue = 0; long numValue = 0; for(Text value : values) { sumValue += Double.parseDouble(value.toString()); numValue ++; } context.write(key, new Text(String.valueOf(sumValue) + ',' + String.valueOf(numValue))); System.out.println("Combiner输出键值对<" + key + "," +sumValue+ ","+numValue+">"); } } //java.lang.NoSuchMethodException: temperature.AvgTemperature$AvgTemperatureReducer.<init>() //Mapper和Reducer作为内部类必须是静态static //Reducer public static class AvgTemperatureReducer extends Reducer<Text, Text, Text, IntWritable>{ @Override public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { System.out.println("....AvgTemperatureReducer.... "); double sumValue = 0; long numValue = 0; int avgValue = 0; for(Text value : values) { String[] valueAll = value.toString().split(","); sumValue += Double.parseDouble(valueAll[0]); numValue += Integer.parseInt(valueAll[1]); } avgValue = (int)(sumValue/numValue); context.write(key, new IntWritable(avgValue)); } } @Override public int run(String[] args) throws Exception { Configuration conf = getConf(); //读取配置文件 conf.set("fs.defaultFS", "hdfs://192.168.11.81:9000"); Job job = new Job(conf,"Avg Temperature"); job.setJarByClass(AvgTemperature.class); Path in = new Path(args[0]);//输入路径 Path out = new Path(args[1]);//输出路径 FileSystem hdfs = out.getFileSystem(conf); if (hdfs.isDirectory(out)) {//如果输出路径存在就删除 hdfs.delete(out, true); } FileInputFormat.setInputPaths(job, in);//文件输入 FileOutputFormat.setOutputPath(job, out);//文件输出 job.setMapperClass(AvgTemperatureMapper.class); job.setCombinerClass(AvgTemperatureCombiner.class); job.setReducerClass(AvgTemperatureReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); return job.waitForCompletion(true)?0:1;//等待作业完成退出 } public static void main(String[] args){ System.setProperty("HADOOP_USER_NAME", "root"); try { String[] args0 ={"/user/root/temperature/input","/user/root/temperature/output/"}; int res = ToolRunner.run(new Configuration(), new AvgTemperature(), args0); System.exit(res); } catch (Exception e) { e.printStackTrace(); } }}
[root@node1 ~]# hdfs dfs -cat /user/root/temperature/output/part-r-0000017/03/04 04:13:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable1971 691972 1471973 176
- Hadoop基础教程-第6章 MapReduce入门(6.5 温度统计)(草稿)
- Hadoop基础教程-第6章 MapReduce入门(6.1 MapReduce介绍)(草稿)
- Hadoop基础教程-第6章 MapReduce入门(6.4 MapReduce程序框架)(草稿)
- Hadoop基础教程-第6章 MapReduce入门(6.2 解读WordCount)(草稿)
- Hadoop基础教程-第6章 MapReduce入门(6.3 加速WordCount)(草稿)
- Hadoop基础教程-第7章 MapReduce进阶(7.1 MapReduce过程)(草稿)
- Hadoop基础教程-第7章 MapReduce进阶(7.2 MapReduce工作机制)(草稿)
- Hadoop基础教程-第7章 MapReduce进阶(7.3 MapReduce API)(草稿)
- Hadoop基础教程-第10章 HBase:Hadoop数据库(10.1 NoSQL介绍)(草稿)
- Hadoop基础教程-第10章 HBase:Hadoop数据库(10.2 HBase基本概念、框架)(草稿)
- Hadoop基础教程-第10章 HBase:Hadoop数据库(10.3 HBase安装与配置)(草稿)
- Hadoop基础教程-第10章 HBase:Hadoop数据库(10.4 NTP时间同步)(草稿)
- Hadoop基础教程-第10章 HBase:Hadoop数据库(10.5 HBase Shell)(草稿)
- Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.1 Hive 介绍)(草稿)
- Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.7 HQL:数据查询)(草稿)
- Hadoop基础教程-第11章 Hive:SQL on Hadoop(11.8 HQL:排序)(草稿)
- Hadoop基础教程-第3章 HDFS:分布式文件系统(3.5 HDFS基本命令)(草稿)
- Hadoop基础教程-第4章 HDFS的Java API(4.6 Java API应用)(草稿)
- 高质高效软件开发组织能力模型
- 类的加载
- Google 在 47 个开源项目中发现了 1000 多个 bug
- 使用gdb和core dump如何快速定位到段错误
- 376. Wiggle Subsequence
- Hadoop基础教程-第6章 MapReduce入门(6.5 温度统计)(草稿)
- 快递大决战时刻:菜鸟顺丰互怼背后,数据所有权到底归谁?看最全评论
- struct linger 用法
- 实现SpringMvc消息转换器
- jsp九大隐式对象中最重要的一个对象pageContext
- tmpfs在嵌入式linux中的使用
- Struts2 表单和非表单标签
- 【嵌入式Linux+ARM】硬件相关基础知识(门电路_UART_I2C_SPI)
- java--数组遍历