Hadoop MapReduce: to Sort or Not to Sort

来源：互联网发布：java工程师的职业规划编辑：程序博客网时间：2024/05/22 03:06

Tuesday Jan 22nd was a critical milestone for us at Syncsort as our main contribution to the Apache Hadoop project was committed. This contribution, patch MAPREDUCE-2454, introduced a new feature to the Hadoop MapReduce framework to allow alternative implementations of the Sort phase. This work started more than a year ago and Syncsort’s Technology Architect Asokan worked closely with the Apache open source community on design iterations, code reviews and commits. We sincerely thank Apache Hadoop community and MapReduce project committers for their collaboration and support throughout this work and congratulate them on the release of Hadoop-2.0.3-alpha.

What is the big deal about Sort? Sort is fundamental to the MapReduce framework, the data is sorted between the Map and Reduce phases (see below). Syncsort’s contribution allows native Hadoop sort to be replaced by an alternative sort implementation, for both Map and Reduce sides, i.e. it makes Sort phase pluggable.

MapReduce

Opening up the Sort phase to alternative implementations will facilitate new use cases and data flows in the MapReduce framework. Let’s look at some of these use cases:

Optimized sort implementations. Performance of sort-intensive data flows and computation of aggregate functions requiring sort, like MEDIAN, will improve significantly when an optimized sort implementation is used. Such implementations can take advantage of hardware architectures, operating system and data characteristics. Improving the performance of sort within the MapReduce framework is already listed as one of the Hadoop Research projects, see http://wiki.apache.org/hadoop/HadoopResearchProjects under ‘Map reduce performance enhancements’, and sort benchmarks are often used for evaluating Hadoop.

Hash-based aggregations. Many aggregate functions where the output of the aggregation is small enough to fit in memory, e.g. COUNT, AVERAGE, MIN/MAX, can be implemented as hash-based aggregation that does not require sort (see MAPREDUCE-3247). A special sort implementation can support this by eliminating the sort altogether. Hash-based aggregations will provide significant performance benefit for applications such as log analysis and queries on large data volumes.

Ability to run a job with a subset of data. Many applications such as data sampling require processing a subset of the data, e.g. first N matches/limit N queries (see MAPREDUCE-1928). In Hadoop MapReduce, all Mappers need to finish before a Reducer can output any data. A special sort implementation using the patch can avoid the sort altogether so that the data can come to a single Reducer as soon as a few Mappers complete. The Reducer will stop after N records are processed. This will prevent launching a large number of Mappers and will drastically reduce the amount of wasted work, benefiting applications like Hive.

Optimized full joins. Critical data warehouse processes such as change data capture require a full join. Basic Hadoop MapReduce framework supports full joins in the Reducer. In certain cases where both sides of the join are very large data sets, Java implementation of a full join may easily turn into a memory hog. The patch will allow resource efficient implementations for handling large joins with performance benefits.

As my colleague Jorge Lopez’ blog post highlights, Big Data skills gap is a key challenge, technical skills around Hadoop, MapReduce and Big Data solutions are scarce and expensive. Involvement from development communities and software vendors will be critical for increased adoption of Hadoop as a data management platform. We, at Syncsort, are excited to be part of the community broadening the Hadoop platform, and increasing business value and ROI for enterprise Big Data initiatives.

Stay tuned for our next blog, we will talk about how Syncsort’s per node scalability complements Hadoop’s horizontal scalability for Big Data integration… In the meantime, we would like to hear from you about your data integration experience on Hadoop!

Ref: http://blog.syncsort.com/2013/02/hadoop-mapreduce-to-sort-or-not-to-sort/

Inspired by Tenzing, in 5.1 MapReduce Enhanceemtns:

Sort Avoidance. Certain operators such as hash join
and hash aggregation require shuffling, but not sorting. The
MapReduce API was enhanced to automatically turn off
sorting for these operations. When sorting is turned off, the
mapper feeds data to the reducer which directly passes the
data to the Reduce() function bypassing the intermediate
sorting step. This makes many SQL operators significantly
more ecient.

There are a lot of applications which need aggregation only, not sorting.Using sorting to achieve aggregation is costly and inefficient. Without sorting, up application can make use of hash table or hash map to do aggregation efficiently.But application should bear in mind that reduce memory is limited, itself is committed to manage memory of reduce, guard against out of memory. Map-side combiner is not supported, you can also do hash aggregation in map side as a workaround.

the following is the main points of sort avoidance implementation

add a configuration parameter mapreduce.sort.avoidance, boolean type, to turn on/off sort avoidance workflow.Two type of workflow are coexist together.
key/value pairs emitted by map function is sorted by partition only, using a more efficient sorting algorithm: counting sort.
map-side merge, use a kind of byte merge, which just concatenate bytes from generated spills, read in bytes, write out bytes, without overhead of key/value serialization/deserailization, comparison, which current version incurs.
reduce can start up as soon as there is any map output available, in contrast to sort workflow which must wait until all map outputs are fetched and merged.
map output in memory can be directly consumed by reduce.When reduce can't catch up with the speed of incoming map outputs, in-memory merge thread will kick in, merging in-memory map outputs onto disk.
sequentially read in on-disk files to feed reduce, in contrast to currently implementation which read multiple files concurrently, result in many disk seek. Map output in memory take precedence over on disk files in feeding reduce function.

I have already implement this feature based on hadoop CDH3U3 and done some performance evaluation, you can reference to https://github.com/hanborq/hadoop for details. Now,I'm willing to port it into yarn. Welcome for commenting.

Ref: https://issues.apache.org/jira/browse/MAPREDUCE-4039

0 0