Hadoop Map-Reduce 天气示例(压缩格式输出)
来源:互联网 发布:js判断ie9以下版本 编辑:程序博客网 时间:2024/05/16 09:26
实现
还是以以前做的删选最高气温的例子为参照:
以前的例子可以见这个博文:http://supercharles888.blog.51cto.com/609344/878422
我们现在要求让结果输出为压缩格式,所以保持Map类(MaxTemperatureMapper)和Reduce类(MaxTemperatureReducer)不变,只要在Job类的Configuration作一些压缩的配置即可,见第45-49行所示:
package com.charles.parseweather.compression; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.io.compress.CompressionCodec; import org.apache.hadoop.io.compress.GzipCodec; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; /** * * Description: 这个类定义并且运行作业,压缩版 * * @author charles.wang * @created May 24, 2012 5:29:12 PM * */ public class MaxTemperatureWithCompression { /** * @param args */ public static void main(String[] args) throws Exception{ // TODO Auto-generated method stub if (args.length !=2){ System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } //创建一个Map-Reduce的作业 Configuration conf = new Configuration(); conf.set("hadoop.job.ugi", "hadoop-user,hadoop-user"); //在这里我们配置一些和压缩有关的参数 //我们设定reduce输出结果使用gzip压缩的形式 conf.setBoolean("mapred.output.compress", true); conf.setClass("mapred.output.compression.codec", GzipCodec.class, CompressionCodec.class); Job job = new Job(conf,"Get Maximum Weather Information with Compression! ^_^"); //设定作业的启动类/ job.setJarByClass(MaxTemperatureWithCompression.class); //解析输入和输出参数,分别作为作业的输入和输出,都是文件 FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); //配置作业,设定Mapper类,Reducer类 job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true)?0:1); } }
要运行这个例子,我们需要给出输入文件,因为Hadoop系统可以根据输入文件的扩展名自动识别基本文件,所以我们创建目录结构,并且上传一个gzip格式的文件作为map-reduce过程的输入:
然后我们运行的main中传入HDFS的输入文件和输出目录:
当执行完成之后,我们就可以在HDFS文件系统中看到最终的输出结果了,正如我们所预期的,这个结果是个gzip格式的文件:
通过日志观察压缩输出文件过程
我们可以观察日志来更细粒度的观察整个过程:
namenode:
2012-05-31 13:11:08,621 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=open src=/user/hadoop-user/compress-input/1901.gz dst=null perm=null 2012-05-31 13:11:08,754 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=open src=/user/hadoop-user/compress-input/1901.gz dst=null perm=null 2012-05-31 13:11:08,758 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=mkdirs src=/user/hadoop-user/compress-output/_temporary dst=null perm=hadoop-user:supergroup:rwxr-xr-x 2012-05-31 13:11:08,853 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=open src=/user/hadoop-user/compress-input/1901.gz dst=null perm=null 2012-05-31 13:11:09,203 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=create src=/user/hadoop-user/compress-output/_temporary/_attempt_local_0001_r_000000_0/part-r-00000.gz dst=null perm=hadoop-user:supergroup:rw-r--r-- 2012-05-31 13:11:09,238 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /user/hadoop-user/compress-output/_temporary/_attempt_local_0001_r_000000_0/part-r-00000.gz. blk_-3869950436265612646_1016 2012-05-31 13:11:09,292 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 192.168.129.35:50010 is added to blk_-3869950436265612646_1016 size 29 2012-05-31 13:11:09,686 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/hadoop-user/compress-output/_temporary/_attempt_local_0001_r_000000_0/part-r-00000.gz is closed by DFSClient_-356100022 2012-05-31 13:11:09,692 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=listStatus src=/user/hadoop-user/compress-output/_temporary/_attempt_local_0001_r_000000_0 dst=null perm=null 2012-05-31 13:11:09,695 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=mkdirs src=/user/hadoop-user/compress-output dst=null perm=hadoop-user:supergroup:rwxr-xr-x 2012-05-31 13:11:09,698 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=rename src=/user/hadoop-user/compress-output/_temporary/_attempt_local_0001_r_000000_0/part-r-00000.gz dst=/user/hadoop-user/compress-output/part-r-00000.gz perm=hadoop-user:supergroup:rw-r--r-- 2012-05-31 13:11:09,699 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=delete src=/user/hadoop-user/compress-output/_temporary/_attempt_local_0001_r_000000_0 dst=null perm=null 2012-05-31 13:11:09,703 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.40.16 cmd=delete src=/user/hadoop-user/compress-output/_temporary dst=null perm=null 2012-05-31 13:11:51,010 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=hadoop-user,hadoop-user ip=/192.168.129.35 cmd=listStatus src=/user/hadoop-user/compress-output dst=null perm=null
datanode:
2012-05-31 13:11:08,864 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src=\'#\'" /192.168.129.35:50010, dest: /192.168.40.16:6233, bytes: 74447, op: HDFS_READ, cliID: DFSClient_-356100022, srvID: DS-1002949858-192.168.129.35-50010-1337839176422, blockid: blk_-4455870079864415553_1015 2012-05-31 13:11:09,248 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-3869950436265612646_1016 src=\'#\'" /192.168.40.16:6234 dest: /192.168.129.35:50010 2012-05-31 13:11:09,283 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src=\'#\'" /192.168.40.16:6234, dest: /192.168.129.35:50010, bytes: 29, op: HDFS_WRITE, cliID: DFSClient_-356100022, srvID: DS-1002949858-192.168.129.35-50010-1337839176422, blockid: blk_-3869950436265612646_1016 2012-05-31 13:11:09,283 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block blk_-3869950436265612646_1016 terminating
我们在这里清楚的看到在目标目录下生成gzip格式的输出文件的整个过程,假定namenode第i行日志设为N(i),datanode第i行日志设为D(i),则执行顺序为:
N1->N2->N3->N4->D1->N5->N6->D2->D3->D4->N7->N8...->N14,
其中N1->N4是namenode做一些准备工作,包括打开输入文件和创建输出目录及其临时子目录。
D1是datanode读取输入文件
N5,N6按照命名规则和配置中压缩文件的设定,创建输出文件到临时目录下(此时这个文件为空),然后用NameSystem吧这个块分配给datanode
D2-D4是datanode写最终reduce结果到被分配的块中。
N7-N14则是namenode吧输出文件的位置复制到命令行第二个参数指定的位置中,作为最终输出结果
本文出自 “平行线的凝聚” 博客,请务必保留此出处http://supercharles888.blog.51cto.com/609344/883590
- Hadoop Map-Reduce 天气示例(压缩格式输出)
- Hadoop Map-Reduce 天气示例
- hadoop环境安装及简单Map-Reduce示例
- WordCount-Map/Reduce示例
- hadoop 输出结果设为压缩格式
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- hadoop map/reduce setup
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce Implementation
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce教程
- Hadoop Map/Reduce 原理
- HIVE文件存储格式
- 最大连续和(hdu1003)
- schema、权限和角色
- php手动部署一个项目到服务器
- eclipse+maven远程(自动)部署web项目到tomcat
- Hadoop Map-Reduce 天气示例(压缩格式输出)
- 软件开发人员 梦想最大的阻碍:毒、赌、黄
- Oracle Number类型
- 又一款linux提权辅助工具 – Linux_Exploit_Suggester
- Hadoop Map-Reduce 天气示例
- 20130906, 微软九月安全补丁提前通知
- RAC安装错误:The specified nodes are not clusterable
- 关于C++中的拷贝构造函数和赋值函数
- cocos2d-x 数学函数、常用宏粗整理 - by Glede