MapReduce实战之美国气候数据MaxTemperatureVeryYear
来源:互联网 发布:javascript时间格式化 编辑:程序博客网 时间:2024/04/29 08:30
使用的数据来自美国国家气候数据中心,首先我们要在Linux上去下载数据,打开linux终端,在终端下写入如下命令:wget -r -c fttp://ftp.ncdc.noaa.gov/pub/data/noaa/,下载数据,由于数据太大,我就只下载了一部分。步骤如下:
现在就去上面图片中的当前路径:/home/gznc中去查看下载的文件,我只下载了部分文件,看到
文件后将文件解压到桌面的abc.txt文件中,abc.txt文件可以自己随便命名。命令为zcat *.gz > /home/gznc/Desktop/abc.txt
然后看桌面上的,就会有解压的文件
双击打开文件,就可以查看到解压的数据
数据格式如下所示:
0188010010999992011010100004+70933-008667FM-12+0009ENJA V0203401N011010120019N0050001N1-00561-00981102641ADDAA106000021AY181061AY231061GF108991081081008001999999MA1999999102521MD1210171
+9999MW1851REMSYN088AAXX 01001 01001 11550 83411 11056 21098 30252 40264 52017 69901 78583 888// 333 91121
这是一行的数据,我们所需要的年份数据在第15个字符到第18个字符,温度在第88个到第92个字符
接下来就是编写代码块
TemperatureMapper类
package maxTemperature;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;//创建一个 TemperatureMapper类 继承于Mapper抽象类public class TemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable>{ private static final int MISSING = 9999; //Mapper抽象类的核心方法,三个参数 public void map(LongWritable key,//首字符偏移量 Text value,//文件的一行内容 Context context//Mapper端的上下文,与OutputCollector和Reporter的功能类似 ) throws IOException, InterruptedException{ String line = value.toString(); String year = line.substring(15,19);//年 int airTemperature = 0; if(line.length()>95){//防止数组越界 if(line.charAt(87)=='+'){//温度为正 airTemperature = Integer.parseInt(line.substring(88, 92)); }else{//温度为零下 airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); //quality这个字符是否能匹配上0,1,4,5,9这几个字符 if(airTemperature != MISSING && quality.matches("[01459]")){ context.write(new Text(year), new IntWritable(airTemperature)); } } }}
TemperatureReducer 类
package maxTemperature;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class TemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(Text key,Iterable<IntWritable> values,Context context ) throws IOException, InterruptedException{ int maxValue = Integer.MIN_VALUE;//定义一个整型最小值 for(IntWritable value : values){//每一个key所对应的最大温度 maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue));//输出<key,value> }}
TemperatureMain类
package maxTemperature;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class TemperatureMain { public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException { Configuration conf = new Configuration();//读取Hadoop的配置文件,如site-core.xml //将命令行参数自动设置到变量conf中 String [] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs(); if(otherArgs.length!=2){ System.err.println("Usage: Temperature <in><out>"); System.exit(2); } @SuppressWarnings("deprecation") Job job = new Job(conf,"MaxTemperature");//新建一个job,传入配置信息 job.setJarByClass(TemperatureMain.class);//设置主类 job.setMapperClass(TemperatureMapper.class);//设置Mapper类 job.setCombinerClass(TemperatureReducer.class);//设置作业合成类 job.setReducerClass(TemperatureReducer.class);//设置Reducer类 job.setOutputKeyClass(Text.class);//设置输出数据的关键类 job.setOutputValueClass(IntWritable.class);//设置输出值类 FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//文件输入路径 FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//文件输出路径 System.exit(job.waitForCompletion(true)?0:1);//等待完成退出 }}
导出jar包,步骤如下:
桌面上就出现了jar包文件
接下来是启动集群start-all.sh,上传数据到集群中,运行程序,步骤如下图所示
运行程序,输入命令:hadoop jar /home/gznc/Desktop/maxTemperature.jar
maxTemperature.TemperatureMain /user/gznc/inputs/abc.txt /user/gznc/outputfile 格式:hadoop+jar+jar包的路径,
因为我是放在本地的桌面上+工程下的包名.main函数所在的那个类+集群上所存放文件的路径+结果输出的路径
成功后的界面
16/10/25 23:28:40 INFO mapreduce.Job: Running job: job_1477406455166_0002
16/10/25 23:29:06 INFO mapreduce.Job: Job job_1477406455166_0002 running in uber mode : false
16/10/25 23:29:06 INFO mapreduce.Job: map 0% reduce 0%
16/10/25 23:29:28 INFO mapreduce.Job: map 51% reduce 0%
16/10/25 23:29:31 INFO mapreduce.Job: map 100% reduce 0%
16/10/25 23:29:51 INFO mapreduce.Job: map 100% reduce 100%
16/10/25 23:29:52 INFO mapreduce.Job: Job job_1477406455166_0002 completed successfully
16/10/25 23:29:53 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=114
FILE: Number of bytes written=193971
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=53008430
HDFS: Number of bytes written=81
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=22664
Total time spent by all reduces in occupied slots (ms)=17363
Total time spent by all map tasks (ms)=22664
Total time spent by all reduce tasks (ms)=17363
Total vcore-seconds taken by all map tasks=22664
Total vcore-seconds taken by all reduce tasks=17363
Total megabyte-seconds taken by all map tasks=23207936
Total megabyte-seconds taken by all reduce tasks=17779712
Map-Reduce Framework
Map input records=229987
Map output records=229758
Map output bytes=1608306
Map output materialized bytes=114
Input split bytes=108
Combine input records=229758
Combine output records=12
Reduce input groups=12
Reduce shuffle bytes=114
Reduce input records=12
Reduce output records=12
Spilled Records=24
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=608
CPU time spent (ms)=11280
Physical memory (bytes) snapshot=307077120
Virtual memory (bytes) snapshot=1679773696
Total committed heap usage (bytes)=136122368
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=53008322
File Output Format Counters
Bytes Written=81
在火狐浏览器中输入主机名:18088回车既看到如下界面,如果是SUCCEEDED则成功
- MapReduce实战之美国气候数据MaxTemperatureVeryYear
- 下载美国气候数据中心地址
- MapReduce实战之WordCount
- MapReduce实战之 WordCount
- mapreduce数据统计实战总结
- MapReduce编程实战之“初识”
- MapReduce编程实战之“初识”
- MapReduce编程实战之“调试”
- hadoop 之MapReduce编程实战
- Hadoop大数据系列---MapReduce代码实战
- hadoop实战学习之用MapReduce简单对整形数据进行全局排序
- 【转】美国50州气候及学校推荐
- Hadoop实战-初级部分 之 MapReduce
- Hadoop实战-初级部分 之 MapReduce
- MapReduce编程实战之“调试”和"调优"
- MapReduce编程实战之“I/O”
- MapReduce编程实战之“工作原理”
- MapReduce编程实战之“高级特性”
- DDS,EasyDarwin部署
- MD5加密----java
- 获取中文请求参数
- 会跳动的图片
- 计算字符串的相似度-两种解法
- MapReduce实战之美国气候数据MaxTemperatureVeryYear
- Oracle数据库
- IOS 开发 AFNetworking 网络工具-网络请求的三次封装 Swift版
- App项目实战之路(三):原型篇
- 重读C++ Primer Plus
- 快速定位解决Android内存泄漏
- Svn与Git的区别
- linux文件系统2
- Git常用操作 - 分支管理