Mapreduce pattern(chapter3)

来源：互联网发布：linux 看tomcat日志编辑：程序博客网时间：2024/05/03 22:26

A single reducer getting a lot of data is bad for a few reasons:

单独一个需要大量数据的reduce任务所带来的问题

1 The sort can become an expensive operation when it has too many records and has to do most of the sorting on local disk, instead of memory;

当数据量很大并且需要在磁盘进行排序的情况下，这种操作是十分耗费资源的；

2 The host where the reduce is running will receive a lot of data over the network, which may create a network resource hot spot for that single host.

并且，对于reduce 任务运行的那个节点而言，将会耗费很多网络资源去获取需要输入的数据；

3 Naturally, scanning through the data in the reduce will take a long time, if there are many records to look through;

因此，要遍历所有的输入记录，将会耗费很多时间；

4 Any sort of memory growth in the reducer has the possibility of blowing through the Java Virtual Machine's memory, for example, if you are all of the values into an ArrayList to perform the median,that ArrayList can grow very big. This will not be a particular problem if you are looking for the top ten items, but if you want to extract from a very large number, you may run into memory limits.

Reduce 端的任何形式的内存增长，都可能对所在节点的JVM的内存使用造成影响，例如，你尝试将所有的value放进一个ArrayList中，从而计算出其中位数，这将会导致将所有的values都要加载进内存中，并且，这个ArrayList 的规模可能会很庞大。如果你想计算出top 10这类的问题，那么上述提到的问题并不鲜见，因此，在这样的情况下，内存资源将会成为reduce的瓶颈；

5 Writes to the output file are not paralleled. Writing to the locally attached disk can be a more expensive operation in reduce phase, when we are dealing with a lot of data. Since there is only one reducer, we are not taking advantage of the parallelism involved in writing data to several hosts, or even several disks on the same host. Again, this is not an issue for top 10, but a become a factor when the data extracts are very large.

另外一个瓶颈是在写输出数据的时候，无法使用并行化的方式。对于reduce阶段，当数据量很大时，向本地磁盘写数据是一种更加耗费资源的操作。由于只使用了一个reducer我们病没有实现写数据的并行操作，这并不只是对于top 10这类问题存在的，当数据量很大时，这个瓶颈就会出现。

0 0