[ Hadoop | MapReduce ] 使用 CompositeInputSplit 来提高Join效率

来源:互联网 发布:联想win7还原软件 编辑:程序博客网 时间:2024/05/16 19:20

Map side join is the most efficient way. On Hadoop, between two large datasets, we can utilizeComposite Join to achieve this goal. 


The Use Case

First use Identity Mapper and Identity Reducer to sort and partition two inputs, making both have same partition numbers.

use -Dmapred.reduce.tasks=2


Secondly, use composite join…


Note: if the two inputs have different partition numbers(i.e. part* files) , an exception will be thrown: java.io.IOException: Inconsistent split cardinality from child 1 (1/2)

The simplest way to use composite join is to make reduce number = 1, so that there is only one partition for each input file, provided the performance is fine.

The Source Code for the application



0 0
原创粉丝点击