HDPCD-Java-复习笔记（11）

来源：互联网发布：手链饰品店淘宝编辑：程序博客网时间：2024/06/06 00:25

Optimizing MapReduce Jobs

Optimization Best Practices

Here are some of the best ways to increase performance:

1.Configure the number of Mappers and Reducers so that work is distributed evenly across NodeManagers.

2.Use a Combiner, which can greatly minimize network traffic.

3.Avoid instantiating new objects in the map or reduce methods. Reuse existing objects whenever possible.

4.Be careful with String literals; they are new String objects behind-the-scenes. Make them static fields instead.Prefer StringBuilder over String concatenation.

5.Avoid converting numeric types to Text when not necessary.The parsing adds unnecessary processing, and Text objects take up more serialized space than IntWritable , FloatWritable, etc.

6.Define and configure a RawComparator to avoid deserializing objects.

7.Prefer StringUtils.split over String.split .

8.Use data compression， which can minimize network traffic.

Optimizing the Map Phase

mapreduce.task.io.sort.mb -- The amount of memory allocated to the MapOutputBuffer. The default value is 200MB.

mapreduce.map.sort.spill.percent -- Represents a percentage of mapreduce.task.io.sort.mb that, when exceeded, records are spilled to disk. The default is 80%.

mapreduce.task.io.sort.factor -- The number of partitions to merge at one time. The default is 10.

mapreduce.map.speculative -- When set to true, the MapReduce framework may start another instance of a map task that is straggling, just in case this poorly-performing task eventually fails or could be completed quicker on a different node. The default value of this property is false.

mapreduce.job.jvm.numtasks -- The number of a tasks to run per JVM. The default is 1, meaning each task will run in a new JVM process. Set this value higher than 1 to reuse a JVM for multiple tasks, which can save the overhead of killing and starting up JVM processes.

mapreduce.map.output.compress -- Defaults to false, but set it to true if you want the output of the mapper to be compressed before being sent across the network. You can specify a codec using mapreduce.map.output.compress.codec.

Optimizing the Reduce Phase

mapreduce.reduce.shuffle.input.buffer.percent -- percentage of the Reducer’s memory allocated for storingmap outputs during the shuffle. Default is 70%.

mapreduce.reduce.shuffle.merge.percent -- percentage of mapreduce.reduce.shuffle.input.buffer.percent that, when exceeded,causes merging to occur on the Reducer. Default is 66%.

mapreduce.reduce.merge.inmem.threshold-- when this threshold is reached, a merge is triggered and a spill to disk occurs on thereducer. The default is 1,000 map outputs. Setting this value to 0 has theeffect of letting the merges occur based on the value of mapreduce.reduce.shuffle.merge.percent.

mapreduce.reduce.input.buffer.percent -- allows map outputs to remain in memory and not be written to disk.

mapreduce.reduce.shuffle.parallelcopies -- the number of threads that a reducer uses to retrievethe output from mappers. The default is 5, but if you have hundreds of mappers than this can be a bottleneck.

mapreduce.reduce.speculative -- if a reduce task is straggling then the MapReduceframework will start another instance of the same task on a different node. Thedefault value of this property is false.

Data Compression

Data compression often has two benefits:

1.Increased speed

2.Less space needed on the filesystem

Thecommonly used algorithms and codecs available in Hadoop are:

Snappy -- org.apache.hadoop.io.compress.SnappyCodec

gzip -- org.apache.hadoop.io.compress.GzipCodec

bzip2 -- org.apache.hadoop.io.compress.BZip2Codec

LZO -- com.hadoop.compression.lzo.LzopCodec

DEFLATE -- org.apache.hadoop.io.compress.DefaultCodec

Data compression hasseveral trade-offs that you need to consider, including:

Space vs. time -- While there is a gain in filesystem space or smaller network traffic, it will take additional time to compress and decompress the data.

Splittable vs. Non-splittable -- Most of the compression algorithms do not support the splitting of files, which is a major concern in MapReduce.

If the codec utilized does not support splitting, then a map task cannot take advantage of data locality. For example, if a large file is chunked across 10 DataNodes but uses Snappy compression (which is not splittable), then only one map task can process this file, which means 90% of the file needs to be transferred across the network to a single NodeManager for processing.

Onlybzip2 and LZO support splitting.

Configuring Data Compression

Enable compression and configure the codec using configuration properties, which can be defined at several levels, including:

The DataNode level -- in mapred-site.xml

The application level -- by setting the properties using the Configuration instance

The runtime level -- by using command-line arguments

Here are theproperties to enable and configure compression in a MapReduce job:

mapreduce.map.output.compress:set to true or false to enable or disable compression of data output by the Mapper.

mapreduce.map.output.compress.codec : defines the codec touse for the compressed map output.

mapreduce.output.fileoutputformat.compress : set to true or falseto enable or disable compression of data output by the job. The default is false.

mapreduce.output.fileoutputformat.compress.codec :defines the codec to use for the compressed job output.

mapreduce.output.fileoutputformat.compress.type :if the job output is compressed SequenceFiles, this property determines how they are compressed. Valid values are RECORD, BLOCK, or NONE.

Turns on compression for both the map and reduce output, both using Snappy compression.

Configuration conf = job.getConfiguration();
conf.setBoolean(MRJobConfig.MAP_OUTPUT_COMPRESS, true);
conf.setClass(MRJobConfig.MAP_OUTPUT_COMPRESS_CODEC,
SnappyCodec.class,
CompressionCodec.class);
conf.setBoolean(FileOutputFormat.COMPRESS, true);
conf.setClass(FileOutputFormat.COMPRESS_CODEC,
SnappyCodec.class,
CompressionCodec.class);

Overview of Sort RawComparators

Providing a RawComparator can greatly improve the performanceof a large MapReduce application:

• When a Mapper writes out a < key ,value > pair using the context.write method, the key and value are immediately serialized.

•During the shuffle/sort phase, these keys need to be sorted, and the ordering is determined by the compareTo method of the key class.

•Because the keys are serialized, they must first be deserialized before they can be compared to each other.

You can avoid the deserialization by writing a compare method that compares the keys in their serialized state.

You can also define a group RawComparator for controlling which keys are grouped together for a single call to the reduce method of a Reducer.

Defining a RawComparator

Write a class that implements the org.apache.hadoop.io.RawComparator interface.The easiest way to implement the RawComparator interface is to extend the WritableComparator class.

public class CustomerComparator extends WritableComparator {
protected CustomerComparator() {
super(CustomerKey.class);
}
@Override
public int compare(byte[] b1, int s1, int l1, byte[] b2, int s2, int l2) {
int customerId1 = super.readInt(b1, s1);
int customerId2 = super.readInt(b2, s2);
return customerId1 - customerId2;
}
}

Configuring a Sort RawComparator

Let the MapReduce job know that your particular data type is to use your RawComparator.

You have two options for configuring a sort RawComparator:

1.Define a static initializer in the class definition.

public class CustomerKey
implements WritableComparable<CustomerKey> {
static {
WritableComparator.define(CustomerKey.class,
new CustomerComparator());
}
private int customerId;
private String zipCode;
//remainder of class definition...
}

2.Use the setSortComparator method when configuring the Job.

job.setSortComparatorClass(CustomerKey.class);

阅读全文

0 0