Cloud Computing(6)_Processing Relational Data

来源：互联网发布：公司网络服务器搭建编辑：程序博客网时间：2024/05/21 21:42

Join Algorithms in MapReduce

Reduce-Side Join
Map-Side Join
Memory-Backed join

Reduce-Side Join

we map over both datasets and emit the join key as the intermediate key, and the tuple itself as the intermediate value. Since MapReduce guarantees that all values with the same key are brought together, all tuples will be grouped by the join key|which is exactly what we need to perform the join operation.

The approach isn’t particularly efficient since it requires shuffling both datasets across the network.

Map-Side Join

we map over one of the datasets (the larger one) and inside the mapper read the corresponding part of the other dataset to perform the merge join.

Memory-Backed Join

we can load the smaller dataset into memory in every mapper, populating an associative array to facilitate random access to tuples based on the join key.

Which Join to use?

Memory-Backed Join > Map-Side Join > Reduce-Side Join

0 0