HDPCD-Java-复习笔记(11)
来源:互联网 发布:手链饰品店淘宝 编辑:程序博客网 时间:2024/06/06 00:25
Optimizing MapReduce Jobs
Optimization Best Practices
Here are some of the best ways to increase performance:
1.Configure the number of Mappers and Reducers so that work is distributed evenly across NodeManagers.
2.Use a Combiner, which can greatly minimize network traffic.
3.Avoid instantiating new objects in the map or reduce methods. Reuse existing objects whenever possible.
4.Be careful with String literals; they are new String objects behind-the-scenes. Make them static fields instead.Prefer StringBuilder over String concatenation.
5.Avoid converting numeric types to Text when not necessary.The parsing adds unnecessary processing, and Text objects take up more serialized space than IntWritable , FloatWritable, etc.
6.Define and configure a RawComparator to avoid deserializing objects.
7.Prefer StringUtils.split over String.split .
8.Use data compression, which can minimize network traffic.
Optimizing the Map Phase
mapreduce.task.io.sort.mb -- The amount of memory allocated to the MapOutputBuffer. The default value is 200MB.
mapreduce.map.sort.spill.percent -- Represents a percentage of mapreduce.task.io.sort.mb that, when exceeded, records are spilled to disk. The default is 80%.
mapreduce.task.io.sort.factor -- The number of partitions to merge at one time. The default is 10.
mapreduce.map.speculative -- When set to true, the MapReduce framework may start another instance of a map task that is straggling, just in case this poorly-performing task eventually fails or could be completed quicker on a different node. The default value of this property is false.
mapreduce.job.jvm.numtasks -- The number of a tasks to run per JVM. The default is 1, meaning each task will run in a new JVM process. Set this value higher than 1 to reuse a JVM for multiple tasks, which can save the overhead of killing and starting up JVM processes.
mapreduce.map.output.compress -- Defaults to false, but set it to true if you want the output of the mapper to be compressed before being sent across the network. You can specify a codec using mapreduce.map.output.compress.codec.
Optimizing the Reduce Phase
mapreduce.reduce.shuffle.input.buffer.percent -- percentage of the Reducer’s memory allocated for storingmap outputs during the shuffle. Default is 70%.
mapreduce.reduce.shuffle.merge.percent -- percentage of mapreduce.reduce.shuffle.input.buffer.percent that, when exceeded,causes merging to occur on the Reducer. Default is 66%.
mapreduce.reduce.merge.inmem.threshold-- when this threshold is reached, a merge is triggered and a spill to disk occurs on thereducer. The default is 1,000 map outputs. Setting this value to 0 has theeffect of letting the merges occur based on the value of mapreduce.reduce.shuffle.merge.percent.
mapreduce.reduce.input.buffer.percent -- allows map outputs to remain in memory and not be written to disk.
mapreduce.reduce.shuffle.parallelcopies -- the number of threads that a reducer uses to retrievethe output from mappers. The default is 5, but if you have hundreds of mappers than this can be a bottleneck.
mapreduce.reduce.speculative -- if a reduce task is straggling then the MapReduceframework will start another instance of the same task on a different node. Thedefault value of this property is false.
Data Compression
Data compression often has two benefits:
1.Increased speed
2.Less space needed on the filesystem
Thecommonly used algorithms and codecs available in Hadoop are:
Snappy -- org.apache.hadoop.io.compress.SnappyCodec
gzip -- org.apache.hadoop.io.compress.GzipCodec
bzip2 -- org.apache.hadoop.io.compress.BZip2Codec
LZO -- com.hadoop.compression.lzo.LzopCodec
DEFLATE -- org.apache.hadoop.io.compress.DefaultCodec
Data compression hasseveral trade-offs that you need to consider, including:
Space vs. time -- While there is a gain in filesystem space or smaller network traffic, it will take additional time to compress and decompress the data.
Splittable vs. Non-splittable -- Most of the compression algorithms do not support the splitting of files, which is a major concern in MapReduce.
If the codec utilized does not support splitting, then a map task cannot take advantage of data locality. For example, if a large file is chunked across 10 DataNodes but uses Snappy compression (which is not splittable), then only one map task can process this file, which means 90% of the file needs to be transferred across the network to a single NodeManager for processing.
Onlybzip2 and LZO support splitting.
Configuring Data Compression
Enable compression and configure the codec using configuration properties, which can be defined at several levels, including:
The DataNode level -- in mapred-site.xml
The application level -- by setting the properties using the Configuration instance
The runtime level -- by using command-line arguments
Here are theproperties to enable and configure compression in a MapReduce job:
mapreduce.map.output.compress:set to true or false to enable or disable compression of data output by the Mapper.
mapreduce.map.output.compress.codec : defines the codec touse for the compressed map output.
mapreduce.output.fileoutputformat.compress : set to true or falseto enable or disable compression of data output by the job. The default is false.
mapreduce.output.fileoutputformat.compress.codec :defines the codec to use for the compressed job output.
mapreduce.output.fileoutputformat.compress.type :if the job output is compressed SequenceFiles, this property determines how they are compressed. Valid values are RECORD, BLOCK, or NONE.
Turns on compression for both the map and reduce output, both using Snappy compression.
- Configuration conf = job.getConfiguration();
- conf.setBoolean(MRJobConfig.MAP_OUTPUT_COMPRESS, true);
- conf.setClass(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC,
- SnappyCodec.class,
- CompressionCodec.class);
- conf.setBoolean(FileOutputFormat.COMPRESS, true);
- conf.setClass(FileOutputFormat.COMPRESS_CODEC,
- SnappyCodec.class,
- CompressionCodec.class);
Providing a RawComparator can greatly improve the performanceof a large MapReduce application:
• When a Mapper writes out a < key ,value > pair using the context.write method, the key and value are immediately serialized.•During the shuffle/sort phase, these keys need to be sorted, and the ordering is determined by the compareTo method of the key class.
•Because the keys are serialized, they must first be deserialized before they can be compared to each other.
You can avoid the deserialization by writing a compare method that compares the keys in their serialized state.
You can also define a group RawComparator for controlling which keys are grouped together for a single call to the reduce method of a Reducer.
Defining a RawComparator
Write a class that implements the org.apache.hadoop.io.RawComparator interface.The easiest way to implement the RawComparator interface is to extend the WritableComparator class.
- public class CustomerComparator extends WritableComparator {
- protected CustomerComparator() {
- super(CustomerKey.class);
- }
- @Override
- public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
- int customerId1 = super.readInt(b1, s1);
- int customerId2 = super.readInt(b2, s2);
- return customerId1 - customerId2;
- }
- }
Let the MapReduce job know that your particular data type is to use your RawComparator.
You have two options for configuring a sort RawComparator:
1.Define a static initializer in the class definition.
- public class CustomerKey
- implements WritableComparable<CustomerKey> {
- static {
- WritableComparator.define(CustomerKey.class,
- new CustomerComparator());
- }
- private int customerId;
- private String zipCode;
- //remainder of class definition...
- }
2.Use the setSortComparator method when configuring the Job.
- job.setSortComparatorClass(CustomerKey.class);
- HDPCD-Java-复习笔记(11)
- HDPCD-Java-复习笔记(1)
- HDPCD-Java-复习笔记(2)
- HDPCD-Java-复习笔记(3)-lab
- HDPCD-Java-复习笔记(4)
- HDPCD-Java-复习笔记(5)
- HDPCD-Java-复习笔记(6)
- HDPCD-Java-复习笔记(7)- lab
- HDPCD-Java-复习笔记(8)- lab
- HDPCD-Java-复习笔记(9)-lab
- HDPCD-Java-复习笔记(10)-lab
- HDPCD-Java-复习笔记(12)
- HDPCD-Java-复习笔记(13)- lab
- HDPCD-Java-复习笔记(14)- lab
- HDPCD-Java-复习笔记(15)
- HDPCD-Java-复习笔记(16)
- HDPCD-Java-复习笔记(17)
- HDPCD-Java-复习笔记(18)
- AForge Video
- Recursive sequence
- 【Java开发】使用Semaphore控制资源访问并发量
- 659. Split Array into Consecutive Subsequences
- UnityShader初级篇——实现逐顶点高光反射光照模型
- HDPCD-Java-复习笔记(11)
- JQuery
- 跟磊哥学工控-第一课 设备的模式和状态
- 洛谷10月月赛R1·浴谷八连测R1·提高组 T2
- linux 虚拟机环境 rpm方式安装 jdk1.8
- 不用第三个参数交换两个数a和b
- BZOJ 3155 [Hnoi2013]数列
- Django学习笔记(七)--将django中多个app放到同个文件夹apps处理
- vcredist_x64.exe解决msvcp120.dll/msvcr100.dll丢失问题