谈Hadoop的C++扩展

来源：互联网发布：搜狗输入法下载linux 编辑：程序博客网时间：2024/06/08 15:08

原文在 http://blog.sina.com.cn/s/blog_6e273ebb0100pid0.html

长期一来，Hadoop因为其Java实现带来的性能问题而饱受争议，同时也涌现了很多方案来缓解这一问题。

Jeff Hammerbacher（Cloudera首席科学家）曾在Quora上写过这样一段：
-------------------------------------------------------------------------------------------------------------------------------------
Doug's newest project, Avro [1], will allow for cross-language serialization and RPC. If you think individual components of Hadoop could be implemented more efficiently in another language, you'll be welcome to try your hand once the migration to Avro for RPC [2] is complete.

In my experience, distributed systems should focus on reliable performance under stress, horizontal scalability, and ease of debugging before optimizing for efficiency. Matt Welsh does a great job of highlighting this issue in his retrospective on SEDA [3]. Sean Quinlan of Google mentions a similar policy at Google, noting that "it's atypical of Google to put a lot of work into tuning any one particular binary." [4] Java has advantages and disadvantages along these dimensions, but I'll leave that for others to discuss.

For HDFS in particular, libhdfs [5] implements a C API to HDFS by communicating with Java over JNI. Using libhdfs and FUSE, one can mount HDFS just like any other file system [6]. Once Avro is in place, the client could be implemented in C and placed in the kernel to make this process even smoother and more efficient. Currently it's not the most pressing issue in Hadoop development.

For Hadoop MapReduce, you can use Hadoop Streaming to write your MapReduce logic in any language, or Hadoop Pipes [7] if you want a C++-specific API. If you can't wait for Avro, there's also the "Hadoop C++ Extension" [8] from Baidu which implements the Task execution environment in Hadoop in C++, and appears to provide moderate performance gains.

[1] http://avro.apache.org
[2] https://issues.apache.org/jira/browse/HADOOP-6659
[3] http://matt-welsh.blogspot.com/2010/07/retrospective-on-seda.html
[4] http://queue.acm.org/detail.cfm?id=1594206
[5] http://hadoop.apache.org/common/docs/current/libhdfs.html
[6] https://wiki.cloudera.com/display/DOC/Mountable+HDFS
[7] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/pipes/package-summary.html
[8] https://issues.apache.org/jira/browse/MAPREDUCE-1270
-------------------------------------------------------------------------------------------------------------------------------------

百度在使用Hadoop过程中同样发现了Hadoop因为Java语言带来的低效问题，并对Hadoop进行扩展。

而在此之前，百度也尝试了 Hadoop PIPES 和 Hadoop Streamming，但是发现这些问题：
- 这两种方案都无法很好地控制Child JVM（Map TaskTracker和Reduce TaskTracker）内存的使用，这部分都由JVM自己控制，而你能做的就只是使用-Xmx设置内存限制而已；
- 这两种方案都只能影响到Mapper和Reducer回调函数，而真正影响性能的sort和shuffle过程依然在Java实现的TaskTracker中执行完成；
- 数据流问题。两种方案中，数据处理流都必须由TaskTracker流向Mapper或者Reducer然后再流回来。而无论是使用pipeline还是socket方式传递数据，都难以避免数据的移动。对于大规模数据处理，其代价是不可忽视的。

究其根本，实际上是C++模块所承担的逻辑太少。于是百度提出了更彻底的方案，即"Hadoop C++ Extention"，该方案中C++代码对Hadoop入侵得更多。它将原来TaskTracker中完成的数据处理工作都交给C++模块去完成，而只让其负责协议通信和控制等。如此一来，上面的问题就都解决了：
- TaskTracker JVM只负责少量通信工作，其内存需求很小并且可以预见，从而容易控制，譬如设为-Xmx100m就足够了；
- sort和shuffle过程都使用C++模块实现，性能得到提高；
- 数据在其整个生命周期都只在C++模块中，避免不必要的移动。
这就犹如将C++模块的战线往前推进了。当然，也许在很多人看来，这只是五十步与百步的区别，但是这多出来的五十步，却正是性能瓶颈所在。