hadoop2.x之IO:MapReduce压缩
来源:互联网 发布:8080端口如何开启 编辑:程序博客网 时间:2024/05/22 02:05
前面我们说到了hadoop的压缩,在Hadoop所运行的数据一般都是很大的,输入的数据很大,输出的数据也很大。因此我们有必要对map和Reduce的数据进行压缩存储。
如果我们想对Reduce进行压缩,有两种方法,一种是配置使用Configuration配置。另一种是还是用FileOutputFormat类对输出进行设置。
1. 对Reduce进行压缩(使用Configuration)
使用Configuration,我们需要将mapred.output.compress
设置为true
。设置mapred.output.compression.codec
为我们想设置的codec的类名。例如:
Job程序:MaxTemperatureWithCompression.java
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.compress.GzipCodec;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MaxTemperatureWithCompression { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } Configuration conf = new Configuration(); // 重点是这两句 conf.setBoolean("mapred.output.compress", true); conf.set("mapred.output.compression.codec", GzipCodec.class.getName()); conf.set("mapred.jar", "MaxTemperature.jar"); Job job = Job.getInstance(conf); job.setJarByClass(MaxTemperatureWithCompression.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
Map程序:
import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Mapper;public class MaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private static final int MISSING = 9999; @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); String year = line.substring(15, 23); int airTemperature; if (line.charAt(87) == '+') { // parseInt doesn't like leading plus // signs airTemperature = Integer.parseInt(line.substring(88, 92)); } else { airTemperature = Integer.parseInt(line.substring(87, 92)); } String quality = line.substring(92, 93); if (airTemperature != MISSING && quality.matches("[01459]")) { context.write(new Text(year), new IntWritable(airTemperature)); } }}
Reduce程序
import java.io.IOException;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.mapreduce.Reducer;public class MaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable> { @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int maxValue = Integer.MIN_VALUE; for (IntWritable value : values) { maxValue = Math.max(maxValue, value.get()); } context.write(key, new IntWritable(maxValue)); }}
编译打包运行…
[grid@tiny01 myclass]$ hadoop fs -ls /Found 5 items-rw-r--r-- 1 grid supergroup 49252 2017-07-29 00:07 /data.txt-rw-r--r-- 1 grid supergroup 4848295796 2017-07-01 00:40 /inputdrwx------ - grid supergroup 0 2017-07-01 00:42 /tmpdrwxr-xr-x - grid supergroup 0 2017-07-01 00:42 /user[grid@tiny01 myclass]$ hadoop jar MaxTemperature.jar MaxTemperatureWithCompression /data.txt /out[grid@tiny01 myclass]$ hadoop fs -cat /out/part-r-00000.gz |gunzip20160622 38020160623 310
有关Reduce结果压缩的属性:
2. 对Reduce进行压缩(使用FileOutputFormat)
我们只修改Job类:
import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.compress.GzipCodec;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;public class MaxTemperatureWithCompression2 { public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: MaxTemperature <input path> <output path>"); System.exit(-1); } Configuration conf = new Configuration(); conf.set("mapred.jar", "MaxTemperature2.jar"); Job job = Job.getInstance(conf); job.setJarByClass(MaxTemperatureWithCompression2.class); job.setJobName("Max temperature"); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); // 添加这两句 FileOutputFormat.setCompressOutput(job, true); FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class); job.setMapperClass(MaxTemperatureMapper.class); job.setReducerClass(MaxTemperatureReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); System.exit(job.waitForCompletion(true) ? 0 : 1); }}
运行:
[grid@tiny01 myclass]$ hadoop jar MaxTemperature2.jar MaxTemperatureWithCompression2 /data.txt /out2[grid@tiny01 myclass]$ hadoop fs -cat /out2/part-r-00000.gz |gunzip20160622 38020160623 310
是一样的.
3.对map任务进行压缩
因为map和reduce往往在不同的节点上,因此需要网络传输。如果map任务的输出使用一些能够快速压缩的算法,例如LZO,LZ4等就会使Hadoop的性能提升。map任务的压缩属性:
我们还可以使用另一种方式,使用JobConf(Configuration的子类)对象设置相关9配置:
JobConf conf = new JobConf();conf.setCompressMapOutput(true);conf.setMapOutputCompressorClass(GzipCodec.class)conf.set("mapred.jar", "classname.jar");Job job = Job.getInstance(conf);
4.参考资料
[1] Hadoop:The Definitive Guide,Third Edition, by Tom White. Copyright 2013 Tom White,978-1-449-31152-0
阅读全文
0 0
- hadoop2.x之IO:MapReduce压缩
- hadoop2.x之IO:压缩和解压缩
- hadoop2.x MapReduce过程
- hadoop2.x MapReduce过程
- hadoop2.x之IO:序列化
- Hadoop2.x的MapReduce改进
- hadoop2.x之IO:基于文件的数据结构
- 命令行编译MapReduce程序 Hadoop2.X.X
- hadoop2.x配置 - MapReduce相关参数
- hadoop2.x—mapreduce实战和总结
- hadoop2.x之HDFS
- 【mapreduce】hadoop2.x—mapreduce实战和总结
- Hadoop2.2.0伪分布式之MapReduce简介
- 配置Hadoop2.x的HDFS、MapReduce来运行WordCount程序
- [hadoop2.7.1]I/O之压缩
- hadoop2.X之HDFS集群管理:ReplicationMonitor
- Hadoop2.x ResourceManager启动之服务初始化
- Hadoop2.x ResourceManager启动之服务启动
- SQL基础(增删改查基本功能)
- VMware Pro 12 安装 Centos7
- CSU 1976: 搬运工小明(二分)
- 日期计算
- socket(php)(短连接)(循环发count次,但是每次都得重新连接,会自动断开)整理版本2
- hadoop2.x之IO:MapReduce压缩
- 判断合法的压缩字符
- 数据结构实验之查找一:二叉排序树
- CSU 1978: LXX的图论题
- Mybatis 随手记
- Largest Rectangle in a Histogram
- 博弈——kiki's game
- centos7(deepin)编译安装php7.1.11
- 饿了么项目(三)