用Hadoop进行MapReduce计算

来源：互联网发布：北京软件人员工资水平编辑：程序博客网时间：2024/05/24 05:05

用Hadoop进行MapReduce计算

这学期选了《大数据科学导论》这门课，上课没怎么听。最后为了完成大作业，自己了解了些Hadoop的知识，觉得挺有意思。

Hadoop是一个著名的分布式计算框架，主体由Java实现，包括一个分布式文件系统HDFS和一个MapReduce计算模型。MapReduce在大数据领域是一个经典且通用的计算模型，许多领域的许多问题都能用它来解决，比如最经典的对海量文本的单词统计。

Hadoop的部署相对简单（我是在单机上安装），Arch Linux的AUR里有人打了包，直接安装即可。相关配置可以参考Hadoop快速入门。

部署完成后，就可以编写程序进行计算，Hadoop的local语言是Java，但也可以使用C++、Shell、Python、 Ruby、PHP、Perl等语言来编写Mapper和Reducer，这就为开发者提供了更多选择。Hadoop为使用其它语言的开发者提供了一个streaming工具包（hadoop-streaming-.jar），用其他语言编写Mapper和Reducer时，可以直接使用Unix标准输入输出作为数据接口。详细参见Hadoop Streaming原理及实践。为了快速完成作业，我选择了Python来进行开发。

MapReduce的过程主要分为Map-Shuffle-Reduce三个步骤（下面引自Wikipedia）：

“Map” step: Each worker node applies the “map()” function to the local data, and writes the output to a temporary storage. A master node orchestrates that for redundant copies of input data, only one is processed.

“Shuffle” step: Worker nodes redistribute data based on the output keys (produced by the “map()” function), such that all data belonging to one key is located on the same worker node.

“Reduce” step: Worker nodes now process each group of output data, per key, in parallel.

开发者通常需要编写Mapper和Reducer模块，Shuffle操作由Hadoop完成。Mapper接收输入的数据（通常为<key, value>键值对，也可以是其他形式），处理之后生成中间结果（<key, value>键值对，在标准输出里以tab分隔），Hadoop对中间结果进行sort（可能还有combine过程），然后送到各个Reducer，这里保证相同key的键值对被送到同一个Reducer。接下来Reducer对送来的键值对进行处理，最后输出结果。

下面来看一个例题。有许多文本文件，提取出其中的每个单词并生成倒排索引（inverted index）。
跟word count差不多的思路，Mapper扫描对应的文件提取出单词，生成<word, doc>键值对，然后Reducer合并键值对，对每一个word生成一个到排索引。代码如下：

import sysimport json# input comes from STDIN (standard input)for line in sys.stdin:    # remove leading and trailing whitespace    line = line.strip()    # parse the line with json method    record = json.loads(line)    key = record[0];    value = record[1];    # split the line into words    words = value.split()    for word in words:        # write the results to STDOUT (standard output);        # what we output here will be the input for the        # Reduce step, i.e. the input for reducer.py        #        # tab-delimited;        print('%s\t%s' % (word, key))

import sys# maps words to their documentsword2set = {}# input comes from STDINfor line in sys.stdin:    # remove leading and trailing whitespace    line = line.strip()    # parse the input we got from mapper.py    word, doc = line.split('\t', 1)    word2set.setdefault(word,set()).add(doc)# write the results to STDOUT (standard output)for word in word2set:    print('%-16s%s' % (word, word2set[word]))

MapReduce的本质是一种分治思想，将巨大规模的数据划分为若干块，各个块可以并行处理，通过Map生成若干键值对，Sort & Combine得到关于键的若干等价类，类与类之间没有关联，Reduce对每个类分别进行处理，每个类得到一个独立的结果，最后将每个类的结果合并成为最终结果。许多简单的问题，将它放到MapReduce模型里却充满挑战，而一旦将其成功套入模型，就能借助并行计算的力量解决数据规模巨大的难题。

0 0