更为详细的介绍Hadoop combiners-More about Hadoop combiners
来源:互联网 发布:mac dock栏怎么设置 编辑:程序博客网 时间:2024/06/06 13:00
Hadoop combiners are a very powerful tool to speed up our computations. We already saw what a combiner is in a previous post and we also have seen another form of optimization inthis post. Let's put all together to get the broader idea.
The combiners are optimizations that can be used with Hadoop to make a local-reduction: the idea is to reduce the key-value pairs directly on the mapper, to avoid transmitting all of them to the reducers.
Let's get back to the Top20 example from the previous post, which finds the top 20 words most used in a text. The Hadoop output of this job is shown below:
...Map input records=4239Map output records=37817Map output bytes=359621Input split bytes=118Combine input records=0Combine output records=0Reduce input groups=4987Reduce shuffle bytes=435261Reduce input records=37817Reduce output records=20...
As we can see in the lines highlighted in bold, without a combiner we have 4239 lines in input for the mappers and 37817 key-value pairs emitted (the number of different words of the text). Having defined no combiner, the input and output records of combiners are 0, and so the input records for the reducers are exactly those emitted by the mappers, 37817.
Let's define a simple combiner:
public static class WordCountCombiner extends Reducer<text, intwritable,="" text,="" intwritable=""> { @Override public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException { // computes the number of occurrences of a single word int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } }
As we can see, the code has the same logic of the reducer, since its target is the same: reducing key/value pairs.
Running the job having set the combiner gives us this result:
...Map input records=4239Map output records=37817Map output bytes=359621Input split bytes=116Combine input records=37817Combine output records=20Reduce input groups=20Reduce shuffle bytes=194Reduce input records=20Reduce output records=20...
Looking at the output from Hadoop, we see that now the combiner has 37817 input records: this means that the records emitted from the mappers were all sent to the combiners; the result of the combiners is of 20 records emitted, which is the number of records received by the reducers.
Wow, that's a great result! We avoided the transmission of a lot of data: just 20 records instead of 37817 that we had without the combiner.
But there's a big disadvantage using combiners: since is an optimization, Hadoop does not guarantee their execution. So, what can we do to ensure a reduction at the mapper-level? Simple: we can put the logic of the reducer inside the mapper!
This is exactly what we've done in the mapper of this post. This pattern is called "in-mapper combiner". The reduce part is started at mapper level, so that the key-value pairs sent to the reducers are minimized.
Let's see Hadoop output with this pattern (in-mapper combiner and without the stand-alone combiner):
...Map input records=4239Map output records=4987Map output bytes=61522Input split bytes=118Combine input records=0Combine output records=0Reduce input groups=4987Reduce shuffle bytes=71502Reduce input records=4987Reduce output records=20...
Compared to the execution of the other mapper (without combining), this mapper outputs only 4987 records instead of the 37817 that are emitted to the reducers. A big reduction, even if not as big as the one obtained with the stand-alone combiner.
And what happens if we decide to couple the in-mapper combiner pattern and the stand-alone combiner? Well, we've got the best of the two:
...Map input records=4239Map output records=4987Map output bytes=61522Input split bytes=116Combine input records=4987Combine output records=20Reduce input groups=20Reduce shuffle bytes=194Reduce input records=20Reduce output records=20...
In this last case, we have the best performance because we're emitting from the mapper a reduced number of records, the combiners (if it's executed) reduce even more the size of the data to be emitted. The only downside of this approach I can think of is that it takes a lot of time to be coded.
from: http://andreaiacono.blogspot.com/2014/05/more-about-hadoop-combiners.html
- 更为详细的介绍Hadoop combiners-More about Hadoop combiners
- Hadoop Combiners
- Hadoop-MapReduce-Combiners
- hadoop之combiners编程
- Combiners
- combiners 进行map端的reduce
- Combiners和Partitioner编程
- ShaderLab: Legacy Texture Combiners
- hadoop生态系统的详细介绍
- hadoop生态系统的详细介绍
- hadoop生态系统的详细介绍
- H_001.Hadoop生态系统的详细介绍
- About Hadoop
- Hadoop详细的流程
- Hadoop中Namenode单点故障的解决方案及详细介绍
- 一篇详细介绍Hadoop系统组成部分及其功能的文章
- Hadoop集群安全性:Hadoop中Namenode单点故障的解决方案及详细介绍AvatarNode
- Hadoop集群安全性:Hadoop中Namenode单点故障的解决方案及详细介绍AvatarNode
- 第2讲 示例1—旱冰场造价
- 语音模块LD3320控制LED灯
- Java反射的作用
- Tkinter——Pop-up dialogs——(1)tkMessageBox
- L版本在蓝牙设置界面关闭蓝牙再打开时,其他手机搜索不到该手机
- 更为详细的介绍Hadoop combiners-More about Hadoop combiners
- 网易游戏2016研发实习生招聘在线编程题目 推箱子
- infinispan项目中的配置
- ios ffmpeg加字幕
- 网易笔试题
- 实现Hadoop的Writable接口Implementing Writable interface of Hadoop
- apicloud数据云api restapi操作
- php中的get_called_class()方法
- HTML5 - 判断浏览器是否支持html5某个功能(使用modernizr.js)