Hadoop基本概念和编程模型

来源：互联网发布：淘宝装修自定义模板编辑：程序博客网时间：2024/06/06 01:20

一、Hadoop的架构

机器层级：one master node,many cluster nodes

进程层级：a JobTracker run on the master node,a TaskTracker run on each of cluster nodes

作业名称：MapReduce Job

二、用法

配置好MapReduce Job的参数，比如数据的输入和输出路径，map函数和reduce函数的实现等，然后将其提交给master node上的JobTracker，JobTracker通过TaskTracker将软件和配置复制到所在cluster node上，然后进行任务调度，提供状态信息等等操作

三、MapReduce Job的定义

MapReduce Job主要包括Map,Combine,Reduce这三个过程，围绕这三个过程，我们主要需要配置以下这些参数：输入路径，输入数据的类型（用于进行key和value的识别），输出路径，输出数据的类型（用于进行key和value的识别）,Map,Combine,Reduce这三个过程中的输入key和value的类型，输出key和value的类型

数据的处理流程：原始数据（key,value）——》Map过程——》产生新的(key,value)——》经过排序和归并，产生(key,iterable values)——》Combine过程——》产生新的(key,value)——》经过排序和归并，产生(key,iterable values)——》Reduce过程——》产生新的(key,value)

其中Combine是一个本地过程，它跟Map过程运行在同一个机子上

MapReduce Job的底层文件系统是HDFS，比如我们设置Job的输入路径，有可能在整个分布式系统中，该输入路径下有N个文件，同理在最后的输出路径中，产生的输出文件也分布在整个分布式系统中

但是从抽象上简化来看，我们可以把所有的这些输入文件集合作为输入，把所有的这些输出文件集合作为输出，那么Map-Combine-Reduce就相当于完全运行在“一台”机器上

四、Mapper

１）Mapper maps input key/value pairs to a set of intermediate key/value pairs.

２）主函数：map(WritableComparable, Writable, OutputCollector, Reporter)

３）收集函数：OutputCollector.collect(WritableComparable,Writable)

４）多少个Mapper？Mapper的InputFormat根据设置的blocksize和输入文件的大小，进行分割得到InputSplit集合，给每个InputSplit对象分配一个Mapper

[The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job]

五、Reducer

１）Reducer reduces a set of intermediate values which share a key to a smaller set of values.

２）主函数：reduce(WritableComparable, Iterator, OutputCollector, Reporter)

３）收集函数：OutputCollector.collect(WritableComparable, Writable)

４）多少个Reducer？The right number of reduces seems to be 0.95 or 1.75 multiplied by <mapred.tasktracker.reduce.tasks.maximum>;也可以设置为０

六、进程

每个cluster node上的TaskTracker会给分配到Mapper task(或者Reducer task)建立一个进程，如果开启JVM Reuse选项，那么同一个cluster node上，可以开启多个进程用于执行Mapper task（或者Reducer task）。

[The TaskTracker executes the Mapper/ Reducer task as a child process in a separate jvm]

同时，我们通过设置也可以开启一个进程内的多线程支持，具体见参考文献[1]

七、另外

[1]对key和value的类型要求

The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.

[2]Mapper Task的中间输出以文件形式保存，OutputCollector的collect函数本来就是将结果保存到文件中

The output from the Map function is stored in Temporary Intermediate Files. These files are handled transparently by Hadoop, so in a normal scenario, the programmer doesn't have access to that. If you're curious about what's happening inside each mapper, you can review the logs for the respective job where you'll find a log file for each map task.

If you want to control where the temporary files are generated, and have access to them, you have to create your own OutputCollector class, and I don't know how easy that is.

八、编程过程简化

你定义好一个Job实例，设置好输入路径，Map类（包含map函数实现）,Reduce类（包含reduce函数实现），输出路径，然后将该实例提交给JobTracker，它会管控好接下来的细节
从外部来看，整个Hadoop MapReduce集群的表现行为就像一台主机，只要我们定义好Job实例提交给它，然后它就会读取输入路径下的文件，然后依次经过map,reduce过程，最后将结果输出到输出路径下
这个逻辑抽象，非常有利于简化我们的编程过程

这里的“外部”是指调用JobTracker Daemon服务的外部应用程序，一般与JobTracker Daemon服务处于同一台机子，即master node，也可以处于非master node上，通过远程调用的形式来调用master node上的JobTracker Daemon服务

这个外部应用不只可以调用JobTracker Daemon服务，也可以调用本机磁盘存取服务，也可以调用HDFS体系中的NameNode Daemon服务（因为一般的，HDFS中的NameNode和Hadoop MapReduce中的master node是同一个机子）

下图阐述了上述关系

参考文献：

[1]http://kickstarthadoop.blogspot.com/2012/02/enable-multiple-threads-in-mapper-aka.html

翻译自：https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.pdf

备用链接：http://download.csdn.net/detail/dslztx/8679695

0 0