MapReduce: Simplified Data Processing on Large Clusters 中文翻译 1

来源:互联网 发布:菲律宾和美国知乎 编辑:程序博客网 时间:2024/05/01 21:40
Abstract
               MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Many real world tasks are expressible in this model, as shown in the paper.
               Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine
communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

               Our implementation of MapReduce runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines. Programmers find the system easy to use: hundreds of MapReduce programs have been implemented and upwards of one thousand MapReduce jobs are executed on Google’s clusters every day.


大规模集群上的简化数据处理

 
 
Jeffrey Dean and Sanjay Ghemawat  jeff@google.com, sanjay@google.com       Google, Inc.

摘要


MapReduce是一个编程模型,同时也是处理、产生大数据集的相关实现。用户指定一个map函数,用来处理一个 键/值 对,并产生中间 键/值 对的集合,而reduce函数,是将具有相同中间键(intermediate key)的中间值(intermediate value)进行合并。在本篇文章中,许多现实世界的任务将会使用此模型进行表示(表述)。


用此方式编写的程序能自动化的在商品机集群上实现并行执行,这种运行时系统关注的细节,包括输入数据的划分(分割),机群上的程序执行调度,机器错误处理,必要的中间机器通信管理。这样,便可以让对并行和分布式处理没有任何经验的程序员,轻易的使用大型分布式系统的资源。


我们的MapReduce运行在一个高度可扩展的商业机群上,一个典型的MapReduce计算,将会处理上千台机器上的TB级数据。程序员可以发现系统很容易使用:数以百计的MapReduce程序已经被实现,在Google的机群上,每天都有一千个以上的MapReduce作业在执行。


0 0