Loop, data, and MapReduce

来源：互联网发布：淘宝店铺团队架构编辑：程序博客网时间：2024/05/14 10:16

在不对原文内容（包括作者信息）做任何改动的前提下，欢迎自由转载。

There is a quite old story about job interview. I forgot who initially wrote/told it, but I still clearly remember the story itself even after many years. In the story, the interviewee, who was seeking a programmer job, was asked of a "basic" interview question:"Tell me, what kind of programs are you good at writing?". The guy pondered for a few seconds and then answered:"I am good at writing loops...".

Loop (including recursion, in the general sense) is a fundamental structure in computer programs. No matter what problem you try to solve by writing a computer program in whatever general-purpose language, you can hardly avoid writing loops! Unless... Well, let me take a small step back, unless you are one of the few lucky or poor guys who write codes at a very high level or in a strange language (e.g., Makefile :-).

It is an interesting question to ask: why are loops so ubiquitous in computer programs? What I came up is that, first of all, iterations are observed in all kinds of phenomena in the real world that programs are trying to model. Everyday, the Sun rises and sets, and we do the same thing again and again... The earth self-rotates and rotates, the solar system rotates, the milky way self-rotates (I don't know whether it rotates around some other things)... Why does the whole universe have to rotate? That is a quite profound question that came into my mind. I don't have any answer; if you do let me know. Anyways, it is hard to think of other ways to model all these in programs better than loop structures. The second reason is perhaps rooted to a gap between human being and machines. Machines do not understand, they just carry out instructions. Therefore, to get a machine to work, a translation of the human's idea to machine instructions is needed. This translation cannot be performed without the concept of iteration. If you ever learned such a thing called "algorithms", you know what I am talking about.

What on earth can loops do then? I think there are two fundamental things that they typically do. One is to process a set of data (PASD). For example, in linear algebra, when we add up two vectors, we could go through one vector element by element to add the element in this vector with the corresponding one from the other vector. In code, it is simply like this:

for (int i = 0; i < vector_size; i++) {

vector1[i] += vector2[i];

}

The other one is to integrate functions (IF). For example, to calculate the integral of a given function numerically, we could write a loop like this:

float x = start;

float dx = step;

float integral = 0.0;

while (x <= end) {

integral += func( x, integral ) * dx;

x += dx;

}

Note that I deliberately write func in such a way that its current value depends on not only x, but also integral. This is an important point to note, because there implies a fundamental difference between PASD and IF: In the former, each cycle is independent of any of the other ones; whereas in the latter, each cycle depends on at least the last cycle.

After realizing that in PASD each cycle is actually an independent calculation, we can answer this question: Do we have to use a loop to do PASD? No, but provided that we can have as processors so that each will deal with just one datum. Now you see what I am heading to. Yes, PASD is parallelizable.

Google's MapReduce (MR) is a product of this kind of thoughts. The computing model of MR is very simple yet quite general: First, transform raw data to a useful form -- mapping, then compute on the basis of the transformed data to generate the final result -- reducing. Mapping is exactly PASD and can be implemented to be parallel. MR's main selling point is basically a (good) implementation of this parallelism.

Let me examine the limitations of the MR computing model: (1) It is a simple computing model and does not fit to every problem; (2) it may not benefit when the data set is small, when the overheads of parallelism are comparatively too much; (3) it may not benefit when the calculation in PASD is trivial and thus the mapping process is not rate-limiting.

The most important lesson I learn from this thought journey is: Weight shifting! Shift the weights from whatever parts in your code to the part of data processing, since the latter is (more) parellelizable. One important way to do shifting is to transform the data from a complicated form to a more simplistic form.