MapReduce实战之美国气候数据MaxTemperatureVeryYear

来源：互联网发布：javascript时间格式化编辑：程序博客网时间：2024/04/29 08:30

使用的数据来自美国国家气候数据中心,首先我们要在Linux上去下载数据，打开linux终端，在终端下写入如下命令：wget -r -c fttp://ftp.ncdc.noaa.gov/pub/data/noaa/,下载数据，由于数据太大，我就只下载了一部分。步骤如下：

这里写图片描述

现在就去上面图片中的当前路径:/home/gznc中去查看下载的文件，我只下载了部分文件，看到
文件后将文件解压到桌面的abc.txt文件中，abc.txt文件可以自己随便命名。命令为zcat *.gz > /home/gznc/Desktop/abc.txt

这里写图片描述

然后看桌面上的，就会有解压的文件

这里写图片描述

双击打开文件，就可以查看到解压的数据

这里写图片描述

数据格式如下所示：
0188010010999992011010100004+70933-008667FM-12+0009ENJA V0203401N011010120019N0050001N1-00561-00981102641ADDAA106000021AY181061AY231061GF108991081081008001999999MA1999999102521MD1210171
+9999MW1851REMSYN088AAXX 01001 01001 11550 83411 11056 21098 30252 40264 52017 69901 78583 888// 333 91121
这是一行的数据，我们所需要的年份数据在第15个字符到第18个字符，温度在第88个到第92个字符

接下来就是编写代码块

这里写图片描述

TemperatureMapper类

package maxTemperature;import java.io.IOException;import java.util.StringTokenizer;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;//创建一个 TemperatureMapper类 继承于Mapper抽象类public class TemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable>{    private static final int MISSING = 9999;    //Mapper抽象类的核心方法，三个参数    public void map(LongWritable key,//首字符偏移量            Text value,//文件的一行内容            Context context//Mapper端的上下文，与OutputCollector和Reporter的功能类似            ) throws IOException, InterruptedException{        String line = value.toString();        String year = line.substring(15,19);//年        int airTemperature = 0;        if(line.length()>95){//防止数组越界            if(line.charAt(87)=='+'){//温度为正                airTemperature = Integer.parseInt(line.substring(88, 92));            }else{//温度为零下                airTemperature = Integer.parseInt(line.substring(87, 92));            }            String quality = line.substring(92, 93);            //quality这个字符是否能匹配上0,1,4,5,9这几个字符            if(airTemperature != MISSING && quality.matches("[01459]")){                context.write(new Text(year), new IntWritable(airTemperature));            }        }    }}

TemperatureReducer 类

package maxTemperature;import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class TemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable>{    public void reduce(Text key,Iterable<IntWritable> values,Context context ) throws IOException, InterruptedException{        int maxValue  = Integer.MIN_VALUE;//定义一个整型最小值        for(IntWritable value : values){//每一个key所对应的最大温度            maxValue = Math.max(maxValue, value.get());        }        context.write(key, new IntWritable(maxValue));//输出<key,value>    }}

TemperatureMain类

package maxTemperature;import java.io.IOException;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;public class TemperatureMain {    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {        Configuration conf = new Configuration();//读取Hadoop的配置文件，如site-core.xml        //将命令行参数自动设置到变量conf中        String [] otherArgs = new GenericOptionsParser(conf,args).getRemainingArgs();        if(otherArgs.length!=2){            System.err.println("Usage: Temperature <in><out>");            System.exit(2);        }        @SuppressWarnings("deprecation")        Job job = new Job(conf,"MaxTemperature");//新建一个job，传入配置信息        job.setJarByClass(TemperatureMain.class);//设置主类        job.setMapperClass(TemperatureMapper.class);//设置Mapper类        job.setCombinerClass(TemperatureReducer.class);//设置作业合成类        job.setReducerClass(TemperatureReducer.class);//设置Reducer类        job.setOutputKeyClass(Text.class);//设置输出数据的关键类        job.setOutputValueClass(IntWritable.class);//设置输出值类        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));//文件输入路径        FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));//文件输出路径        System.exit(job.waitForCompletion(true)?0:1);//等待完成退出    }}

导出jar包，步骤如下：

这里写图片描述

桌面上就出现了jar包文件

这里写图片描述

接下来是启动集群start-all.sh，上传数据到集群中，运行程序，步骤如下图所示

这里写图片描述

运行程序,输入命令：hadoop jar /home/gznc/Desktop/maxTemperature.jar
maxTemperature.TemperatureMain /user/gznc/inputs/abc.txt /user/gznc/outputfile 格式：hadoop+jar+jar包的路径，
因为我是放在本地的桌面上+工程下的包名.main函数所在的那个类+集群上所存放文件的路径+结果输出的路径

这里写图片描述

成功后的界面

16/10/25 23:28:40 INFO mapreduce.Job: Running job: job_1477406455166_0002
16/10/25 23:29:06 INFO mapreduce.Job: Job job_1477406455166_0002 running in uber mode : false
16/10/25 23:29:06 INFO mapreduce.Job: map 0% reduce 0%
16/10/25 23:29:28 INFO mapreduce.Job: map 51% reduce 0%
16/10/25 23:29:31 INFO mapreduce.Job: map 100% reduce 0%
16/10/25 23:29:51 INFO mapreduce.Job: map 100% reduce 100%
16/10/25 23:29:52 INFO mapreduce.Job: Job job_1477406455166_0002 completed successfully
16/10/25 23:29:53 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=114
FILE: Number of bytes written=193971
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=53008430
HDFS: Number of bytes written=81
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=22664
Total time spent by all reduces in occupied slots (ms)=17363
Total time spent by all map tasks (ms)=22664
Total time spent by all reduce tasks (ms)=17363
Total vcore-seconds taken by all map tasks=22664
Total vcore-seconds taken by all reduce tasks=17363
Total megabyte-seconds taken by all map tasks=23207936
Total megabyte-seconds taken by all reduce tasks=17779712
Map-Reduce Framework
Map input records=229987
Map output records=229758
Map output bytes=1608306
Map output materialized bytes=114
Input split bytes=108
Combine input records=229758
Combine output records=12
Reduce input groups=12
Reduce shuffle bytes=114
Reduce input records=12
Reduce output records=12
Spilled Records=24
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=608
CPU time spent (ms)=11280
Physical memory (bytes) snapshot=307077120
Virtual memory (bytes) snapshot=1679773696
Total committed heap usage (bytes)=136122368
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=53008322
File Output Format Counters
Bytes Written=81
这里写图片描述

在火狐浏览器中输入主机名：18088回车既看到如下界面，如果是SUCCEEDED则成功

这里写图片描述

0 0