Cloud Computing(6)_Processing Relational Data

来源:互联网 发布:公司网络服务器搭建 编辑:程序博客网 时间:2024/05/21 21:42

Join Algorithms in MapReduce

  • Reduce-Side Join
  • Map-Side Join
  • Memory-Backed join

Reduce-Side Join

we map over both datasets and emit the join key as the intermediate key, and the tuple itself as the intermediate value. Since MapReduce guarantees that all values with the same key are brought together, all tuples will be grouped by the join key|which is exactly what we need to perform the join operation.

The approach isn’t particularly efficient since it requires shuffling both datasets across the network.

Map-Side Join

we map over one of the datasets (the larger one) and inside the mapper read the corresponding part of the other dataset to perform the merge join.

Memory-Backed Join

we can load the smaller dataset into memory in every mapper, populating an associative array to facilitate random access to tuples based on the join key.

Which Join to use?

Memory-Backed Join > Map-Side Join > Reduce-Side Join

0 0
原创粉丝点击