转载:Hadoop 应该用C++实现,而不是Java

来源:互联网 发布:幼儿学字软件 编辑:程序博客网 时间:2024/06/05 12:43

http://www.trendcaller.com/2009/05/hadoop-should-target-cllvm-not-java.html

 

Sunday, May 10, 2009

Hadoop should target C++/LLVM, not Java (because of watts)

Over the years, there have been many contentious arguments about theperformance of C++ versus Java. Oddly, every one I found addressed onlyone kind of performance (work/time). I can't find any benchmarking ofsomething at least as important in today's massive-scale-computingenvironments, work/watt. A dirty little secret about JIT technologieslike Java, is that they throw a lot more CPU resources at the problem,trying to get up to par with native C++ code. JITs use more memory, andperiodically run background optimizer tasks. These overheads aresomewhat offset in work/time performance, by extra optimizations whichcan be performed with more dynamic information. But it results in ahungrier appetite for watts. Another dirty little secret about Java vsC++ benchmarks is that they compare single-workloads. Try running 100VMs, each with a Java and C++ benchmark in it and Java's hungrierappetite for resources (MHz, cache, RAM) will show. But of course, Javafolks don't mention that.

Butlet's say for the sake of (non-)argument, that Java can achieve a 1:1work/time performance relative to C++, for a single program. If Javaconsumes 15% more power doing it, does it matter on a PC? Most peopledon't dare. Does it matter for small-scale server environments? Maybenot. Does it matter when you deploy Hadoopon a 10,000 node cluster, and the holistic inefficiency (multiplethings running concurrently) goes to 30%? Ask the people who sign thechecks for the power bill. Unfortunately, inefficiency scales reallywell.

Btw, Google's MapReduce framework is C++ based. So isn't Hypertable, the clone of Google's Bigtable distributed data storage system. The rationale for choosing C++ for Hypertable is explained here.I realize that Java's appeal is the write-once, run anywhere philosophyas well as all the class libraries that come with it. But there'sanother way to get at portability. And that's to compile fromC/C++/Python/etc to LLVM intermediaterepresentation, which can then be optimized for whatever platformcomprises each node in the cluster. A bonus in using LLVM as therepresentation to distribute to nodes, is that OpenCLcan also be compiled to LLVM. This retains a nice GPGPU abstractionacross heterogeneous nodes (including those including GPGPU-likeprocessing capabilities), without the Java overhead.

Now I don'thave a problem with Java being one of the workloads that can be run oneach Hadoop node (even script languages have their time and place). ButI believe Hadoop's Java infrastructure will prove to be a competitivedisadvantage, and will provoke a mass amount of wasted watts. "Writeonce, waste everywhere..." In the way that Intel tends to retain aprocess advantage over other CPU vendors, I believe Google will retaina power advantage over others with their MapReduce (and well, theirservers are well-tuned too).

Disclosure: no positions

2comments:

srowensaid...

(I thought this was intriguing enough, and was proud enough of myreply on this to the mahout-dev list, that I will favor you with across post here!)

The difference in power consumption between a fully loaded machine and
idle isn't so large (the figure 50% sticks in my head?), but the
difference between a fully loaded and half-loaded machine is quite
small. That is, if the hard disk is up, processor is at full speed,
all memory is fully powered, then using all or most is not a big deal.
Power consumption drops only if you are really idle.

I don't have numbers to back this up at my fingertips, though they're
informed by figures I've seen in the past. I think that's what one
would need to evaluate this argument, and I have a different intuition

about how much this could matter.

The main argument here seems to be, basically, that Java competes well
in wall-time performance by better parallelism and more memory usage.
Maybe, that's an interesting question. Is LLVM going to be more
efficient than Java? unclear, both have an overhead I suppose. But
again interesting question.


But, the topic really does matter. Wasting time means wasting energy,
and when we get to distributed cluster scale, it matters to the
environment. At Google they do a good job of keeping teams really
clear about how much their operations are costing -- it is staggering
sometimes. Developers who might run a big job, oops, see it fail,
start it up again, oops, wrong argument again... might think twice
when the realize how many pounds of CO2 their mistake just pumped into
the atmosphere.

(Mahout folks will now appreciate why I have been messing with the
code all over to try to micro-optimize for performance. I think there
is still not enough attention given to efficiency yet, but hey it's at
0.1.)


And, I think I agree with the conclusion of the blog post for a
different reason:

The Java/C++ performance gap for most apps is pretty negligible these
days. Why? I actually think given a fixed amount of *developer* time,
one can make a faster Java app than C++ app. Why? I can develop
faster, against a larger and more stable collection of libraries,
spend less time debugging, leaving more time to optimize the result.

But that does hit a certain plateau. Given enough developer time, I
can get native code to run faster than even JITted Java. I myself am
hard-pressed to optimize my code (Mahout - Taste) further in Java
without drastic measures.

It may take a lot of time to actually beat Java performance in C++,
but, as the scale of your operations grows, the return on that 1%
improvement you eke out grows. And of course -- when we talk about
code headed for Hadoop, we are definitely talking about large-scale
operations.

For reference, of course, Google operates at such a scale that they
use a C++-based MapReduce framework. It is just almost always
worthwhile to spend the time to beat Java performance.

This isn't going to be true of all users of distributed computing
frameworks, so it's not inherently wrong that Hadoop is in Java, but,
I did find myself saying "hmm, Java?" the first time I heard of
Hadoop.


But isn't this what this whole Hadoop streaming business is about?
letting you farm out the computation itself to whatever native process
you like and just using Hadoop for the management? because that of
course is fine.

rgomes1997said...

Hi,

I wouldnt like to comment your concerns about power consumption but I'd like to contribute with some ideas.

1.If you consider RTSJ (Real Time System Java) you could use ITC(Initialization Time Compilation) instead of JIT. RTSJ can speed upyour Java application too if you use "Soft Real Time Threads", which isnot difficult to implement and can prevent GC to manage memory you canmanage easily yourself (Scoped Memory).

These links may be of your interest:

* http://java.sun.com/javase/technologies/realtime/reference/doc_2.1/release/JavaRTSCompilation.html

* http://www.rtsj.org/specjavadoc/book_index.html

2.IBM has a very interesting research project called X10 which generatescode in Java and/or C++ as output. The input language is somethingbased on Scala (see release 1.7.x).
You could use it to Write-Once- Run-Everywhere, does not matter if you have a JVM or your your native OS.

Avery interesting improvement over Scala is that X10 does not use MPIbut it uses PGAS, which is beneficial as STM but provides maximumperformance for local data.

IBM X10 Language
* http://www.x10-lang.org/
* http://dist.codehaus.org/x10/documentation/languagespec/x10-173.pdf

STM (Software Transactional Memory)
* http://en.wikipedia.org/wiki/Software_transactional_memory

PGAS (Partitioned Global Address Space)
* http://en.wikipedia.org/wiki/Partitioned_global_address_space

Regards

Richard Gomes
http://www.jquantlib.org/index.php/User:RichardGomes