HadoopMapReduce源码解析
来源:互联网 发布:如何关闭阿里云 编辑:程序博客网 时间:2024/06/06 01:20
Hadoop是一个大数据处理平台,目前在大数据领域应用也非常广泛,刚好最近我们BI组在进行把底层数据仓库迁移到Hadoop平台并且当前BI的数据平台已经深度依赖Hadoop平台,所以在工作之余开始去深入了解下Hadoop内部实现以更好地应用它,在遇到问题的时候有更好的解决思路。本文详细介绍了Hadoop领域中分布式离线计算框架MapReduce的原理及源码分析。
1. MapReduce概述
Map/Reduce是一个用于大规模数据处理的分布式计算模型,它最初是由Google工程师设计并实现的,Google已经将它完整的MapReduce论文公开发布了。其中对它的定义是,Map/Reduce是一个编程模型(programmingmodel),是一个用于处理和生成大规模数据集(processing and generating large data sets)的相关的实现。用户定义一个map函数来处理一个key/value对以生成一批中间的key/value对,再定义一个reduce函数将所有这些中间的有着相同key的values合并起来。很多现实世界中的任务都可用这个模型来表达。
2. MapReduce体系结构
图1 MapReduce体系结构
2.1 角色
(1)JobClient:每一个job都会在用户端通过JobClient类将应用程序以及配置参数打包成jar文件存储在HDFS,并把路径提交到JobTracker,然后由JobTracker创建每一个Task(即MapTask和ReduceTask)并将它们分发到各个TaskTracker服务中去执行。
(2)JobTracker:JobTracker是一个master服务, JobTracker负责调度job的每一个子任务task运行于TaskTracker上,并监控它们,如果发现有失败的task就重新运行它。一般情况应该把JobTracker部署在单独的机器上。
(3)TaskTracker:TaskTracker是运行于多个节点上的slaver服务。TaskTracker则负责直接执行每一个task。TaskTracker都需要运行在HDFS的DataNode上。
(4)Task:TaskTracker是运行于多个节点上的slaver服务。TaskTracker则负责直接执行每一个task。TaskTracker都需要运行在HDFS的DataNode上。
2.2 数据结构
(1)Mapper和Reducer:运行于Hadoop的MapReduce应用程序最基本的组成部分包括一个Mapper和一个Reducer类,以及一个创建JobConf的执行程序,在一些应用中还可以包括一个Combiner类,它实际也是Reducer的实现。
(2)JobInProgress:JobClient提交job后,JobTracker会创建一个JobInProgress来跟踪和调度这个job,并把它添加到job队列里。JobInProgress会根据提交的job jar中定义的输入数据集(已分解成FileSplit)创建对应的一批TaskInProgress用于监控和调度MapTask,同时在创建指定数目的TaskInProgress用于监控和调度ReduceTask,缺省为1个ReduceTask。
(3)TaskInProgress:JobTracker启动任务时通过每一个TaskInProgress来launchTask,这时会把Task对象(即MapTask和ReduceTask)序列化写入相应的TaskTracker服务中,TaskTracker收到后会创建对应的TaskInProgress(此TaskInProgress实现非JobTracker中使用的TaskInProgress,作用类似)用于监控和调度该Task。启动具体的Task进程是通过TaskInProgress管理的TaskRunner对象来运行的。TaskRunner会自动装载job jar,并设置好环境变量后启动一个独立的java child进程来执行Task,即MapTask或者ReduceTask,但它们不一定运行在同一个TaskTracker中。
(4)MapTask和ReduceTask:一个完整的job会自动依次执行Mapper、Combiner(在JobConf指定了Combiner时执行)和Reducer,其中Mapper和Combiner是由MapTask调用执行,Reducer则由ReduceTask调用,Combiner实际也是Reducer接口类的实现。Mapper会根据job jar中定义的输入数据集按<key1,value1>对读入,处理完成生成临时的<key2,value2>对,如果定义了Combiner,MapTask会在Mapper完成调用该Combiner将相同key的值做合并处理,以减少输出结果集。MapTask的任务全完成即交给ReduceTask进程调用Reducer处理,生成最终结果<key3,value3>对。这个过程在下一部分再详细介绍。
3. Map-Reduce原理及源码解析
下面就从用户使用Hadoop提交一个MapReduce作业进行MapReduce计算这一过程为线索,详细介绍Task执行的细节,并对Hadoop MapReduce的主要代码进行解析。
3.1 JobClient客户端作业提交
MapReduce的过程首先是由客户端提交一个作业开始的,作业提交过程比较简单,它主要为后续作业执行准备环境,主要涉及创建目录、上传文件等操作,一旦用户提交作业后,JobTracker端便会对作业进行初始化,作业初始化的主要工作是根据输入数据量和作业配置参数将作业分解成若干个MapTask以及ReduceTask,并添加到相关数据结构中,以等待后续被调度执行。如图2,客户端作业提交过程可分为如下几个过程:
(1)用户使用Hadoop提供的Shell命令提交作业;
(2)JobClient按照作业配置信息将作业运行需要的全部文件上传到HDFS上;
(3)JobClient调用RPC接口向JobTracker提交作业;
(4)JobTracker接收到作业后,将其告知TaskScheduler,由TaskScheduler对作业进行初始化。
public static RunningJob runJob(JobConf job) throws IOException { //首先生成一个JobClient对象 JobClient jc = new JobClient(job); …… //调用submitJob来提交一个作业 running = jc.submitJob(job); JobID jobId = running.getID(); while (true) { //while循环中不断得到此任务的状态,并打印到客户端console中 } return running;}其中JobClient的submitJob函数实现如下:
public RunningJob submitJob(JobConf job) throws FileNotFoundException, InvalidJobConfException, IOException { //从JobTracker得到当前任务的id JobID jobId = jobSubmitClient.getNewJobId(); //准备将任务运行所需要的要素写入HDFS: //任务运行程序所在的jar封装成job.jar //任务所要处理的input split信息写入job.split //任务运行的配置项汇总写入job.xml Path submitJobDir = new Path(getSystemDir(), jobId.toString()); Path submitJarFile = new Path(submitJobDir, "job.jar"); Path submitSplitFile = new Path(submitJobDir, "job.split"); //此处将-libjars命令行指定的jar上传至HDFS configureCommandLineOptions(job, submitJobDir, submitJarFile); Path submitJobFile = new Path(submitJobDir, "job.xml"); //通过input format的格式获得相应的input split,默认类型为FileSplit InputSplit[] splits = job.getInputFormat().getSplits(job, job.getNumMapTasks()); // 生成一个写入流,将input split得信息写入job.split文件 FSDataOutputStream out = FileSystem.create(fs, submitSplitFile, new FsPermission(JOB_FILE_PERMISSION)); try { //写入job.split文件的信息包括:split文件头,split文件版本号,split的个数,接着依次写入每一个input split的信息。 //对于每一个input split写入:split类型名(默认FileSplit),split的大小,split的内容(对于FileSplit,写入文件名,此split在文件中的起始位置),split的location信息(即在那个DataNode上)。 writeSplitsFile(splits, out); } finally { out.close(); } job.set("mapred.job.split.file", submitSplitFile.toString()); //根据split的个数设定map task的个数 job.setNumMapTasks(splits.length); // 写入job的配置信息入job.xml文件 out = FileSystem.create(fs, submitJobFile, new FsPermission(JOB_FILE_PERMISSION)); try { job.writeXml(out); } finally { out.close(); } //真正的调用JobTracker来提交任务 JobStatus status = jobSubmitClient.submitJob(jobId);}
3.2 JobTracker启动以及Job的初始化
JobTracker作为一个单独的JVM运行,其运行的main函数主要调用有下面两部分:
- 调用静态函数startTracker(new JobConf())创建一个JobTracker对象
- 调用JobTracker.offerService()函数提供服务
在JobTracker的构造函数中,会生成一个taskScheduler成员变量,来进行Job的调度,默认为JobQueueTaskScheduler,也即按照FIFO的方式调度任务。
在offerService函数中,则调用taskScheduler.start(),在这个函数中,为JobTracker(也即taskScheduler的taskTrackerManager)注册了两个Listener:
- JobQueueJobInProgressListener jobQueueJobInProgressListener用于监控job的运行状态
- EagerTaskInitializationListener eagerTaskInitializationListener用于对Job进行初始化
EagerTaskInitializationListener中有一个线程JobInitThread,不断得到jobInitQueue中的JobInProgress对象,调用JobInProgress对象的initTasks函数对任务进行初始化操作。
synchronized (jobs) { synchronized (taskScheduler) { jobs.put(job.getProfile().getJobID(), job); //对JobTracker的每一个listener都调用jobAdded函数 for (JobInProgressListener listener : jobInProgressListeners) { listener.jobAdded(job); } }}
public synchronized void initTasks() throws IOException { //从HDFS中读取job.split文件从而生成input splits String jobFile = profile.getJobFile(); Path sysDir = new Path(this.jobtracker.getSystemDir()); FileSystem fs = sysDir.getFileSystem(conf); DataInputStream splitFile = fs.open(new Path(conf.get("mapred.job.split.file"))); JobClient.RawSplit[] splits; try { splits = JobClient.readSplitFile(splitFile); } finally { splitFile.close(); } //map task的个数就是input split的个数 numMapTasks = splits.length; //为每个map task创建一个TaskInProgress来处理一个input split maps = new TaskInProgress[numMapTasks]; for(int i=0; i < numMapTasks; ++i) { inputLength += splits[i].getDataLength(); maps[i] = new TaskInProgress(jobId, jobFile, splits[i], jobtracker, conf, this, i); } //对于map task,将其放入nonRunningMapCache,是一个Map<Node, List<TaskInProgress>>,也即对于map task来讲,其将会被分配到其input split所在的Node上。nonRunningMapCache将在JobTracker向TaskTracker分配map task的时候使用。 if (numMapTasks > 0) { nonRunningMapCache = createCache(splits, maxLevel); } //创建reduce task this.reduces = new TaskInProgress[numReduceTasks]; for (int i = 0; i < numReduceTasks; i++) { reduces[i] = new TaskInProgress(jobId, jobFile, numMapTasks, i, jobtracker, conf, this); //reduce task放入nonRunningReduces,其将在JobTracker向TaskTracker分配reduce task的时候使用。 nonRunningReduces.add(reduces[i]); } //创建两个cleanup task,一个用来清理map,一个用来清理reduce. cleanup = new TaskInProgress[2]; cleanup[0] = new TaskInProgress(jobId, jobFile, splits[0], jobtracker, conf, this, numMapTasks); cleanup[0].setJobCleanupTask(); cleanup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks, jobtracker, conf, this); cleanup[1].setJobCleanupTask(); //创建两个初始化task,一个初始化map,一个初始化reduce. setup = new TaskInProgress[2]; setup[0] = new TaskInProgress(jobId, jobFile, splits[0], jobtracker, conf, this, numMapTasks + 1 ); setup[0].setJobSetupTask(); setup[1] = new TaskInProgress(jobId, jobFile, numMapTasks, numReduceTasks + 1, jobtracker, conf, this); setup[1].setJobSetupTask(); tasksInited.set(true);//初始化完毕}
JobTracker启动具体序列图如图5:
图5 JobTracker启动序列图
3.3 TaskTracker启动以及发送心跳Heartbeat
State offerService() throws Exception { long lastHeartbeat = 0; //TaskTracker进程是一直存在的 while (running && !shuttingDown) { long now = System.currentTimeMillis(); //每隔一段时间就向JobTracker发送heartbeat long waitTime = heartbeatInterval - (now - lastHeartbeat); if (waitTime > 0) { synchronized(finishedCount) { if (finishedCount[0] == 0) { finishedCount.wait(waitTime); } finishedCount[0] = 0; } } //发送Heartbeat到JobTracker,得到response HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); //从Response中得到此TaskTracker需要做的事情 TaskTrackerAction[] actions = heartbeatResponse.getActions(); if (actions != null){ for(TaskTrackerAction action: actions) { if (action instanceof LaunchTaskAction) { //如果是运行一个新的Task,则将Action添加到任务队列中 addToTaskQueue((LaunchTaskAction)action); } else if (action instanceof CommitTaskAction) { CommitTaskAction commitAction = (CommitTaskAction)action; if (!commitResponses.contains(commitAction.getTaskID())) { commitResponses.add(commitAction.getTaskID()); } } else { tasksToCleanup.put(action); } } } } return State.NORMAL;}
private HeartbeatResponse transmitHeartBeat(long now) throws IOException { //每隔一段时间,在heartbeat中要返回给JobTracker一些统计信息 boolean sendCounters; if (now > (previousUpdate + COUNTER_UPDATE_INTERVAL)) { sendCounters = true; previousUpdate = now; } else { sendCounters = false; } //报告给JobTracker,此TaskTracker的当前状态 if (status == null) { synchronized (this) { status = new TaskTrackerStatus(taskTrackerName, localHostname, httpPort, cloneAndResetRunningTaskStatuses( sendCounters), failures, maxCurrentMapTasks, maxCurrentReduceTasks); } } //当满足下面的条件的时候,此TaskTracker请求JobTracker为其分配一个新的Task来运行: //当前TaskTracker正在运行的map task的个数小于可以运行的map task的最大个数 //当前TaskTracker正在运行的reduce task的个数小于可以运行的reduce task的最大个数 boolean askForNewTask; long localMinSpaceStart; synchronized (this) { askForNewTask = (status.countMapTasks() < maxCurrentMapTasks || status.countReduceTasks() < maxCurrentReduceTasks) && acceptNewTasks; localMinSpaceStart = minSpaceStart; } //向JobTracker发送heartbeat,这是一个RPC调用 HeartbeatResponse heartbeatResponse = jobClient.heartbeat(status, justStarted, askForNewTask, heartbeatResponseId); return heartbeatResponse;}
3.4 JobTracker接收Heartbeat并向TaskTracker分配任务
当JobTracker被RPC调用来发送heartbeat的时候,JobTracker的heartbeat(TaskTrackerStatus status,boolean initialContact, boolean acceptNewTasks, short responseId)函数被调用:
public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status, boolean initialContact, boolean acceptNewTasks, short responseId) throws IOException { String trackerName = status.getTrackerName(); short newResponseId = (short)(responseId + 1); HeartbeatResponse response = new HeartbeatResponse(newResponseId, null); List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>(); //如果TaskTracker向JobTracker请求一个task运行 if (acceptNewTasks) { TaskTrackerStatus taskTrackerStatus = getTaskTracker(trackerName); if (taskTrackerStatus == null) { LOG.warn("Unknown task tracker polling; ignoring: " + trackerName); } else { //setup和cleanup的task优先级最高 List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus); if (tasks == null ) { //任务调度器分配任务 tasks = taskScheduler.assignTasks(taskTrackerStatus); } if (tasks != null) { for (Task task : tasks) { //将任务放入actions列表,返回给TaskTracker expireLaunchingTasks.addNewTask(task.getTaskID()); actions.add(new LaunchTaskAction(task)); } } } } int nextInterval = getNextHeartbeatInterval(); response.setHeartbeatInterval(nextInterval); response.setActions( actions.toArray(new TaskTrackerAction[actions.size()])); return response;}
默认的任务调度器为JobQueueTaskScheduler,其assignTasks如下:
public synchronized List<Task> assignTasks(TaskTrackerStatus taskTracker) throws IOException { ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus(); int numTaskTrackers = clusterStatus.getTaskTrackers(); Collection<JobInProgress> jobQueue = jobQueueJobInProgressListener.getJobQueue(); int maxCurrentMapTasks = taskTracker.getMaxMapTasks(); int maxCurrentReduceTasks = taskTracker.getMaxReduceTasks(); int numMaps = taskTracker.countMapTasks(); int numReduces = taskTracker.countReduceTasks(); //计算剩余的map和reduce的工作量:remaining int remainingReduceLoad = 0; int remainingMapLoad = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() == JobStatus.RUNNING) { int totalMapTasks = job.desiredMaps(); int totalReduceTasks = job.desiredReduces(); remainingMapLoad += (totalMapTasks - job.finishedMaps()); remainingReduceLoad += (totalReduceTasks - job.finishedReduces()); } } } //计算平均每个TaskTracker应有的工作量,remaining/numTaskTrackers是剩余的工作量除以TaskTracker的个数。 int maxMapLoad = 0; int maxReduceLoad = 0; if (numTaskTrackers > 0) { maxMapLoad = Math.min(maxCurrentMapTasks, (int) Math.ceil((double) remainingMapLoad / numTaskTrackers)); maxReduceLoad = Math.min(maxCurrentReduceTasks, (int) Math.ceil((double) remainingReduceLoad / numTaskTrackers)); } //map优先于reduce,当TaskTracker上运行的map task数目小于平均的工作量,则向其分配map task if (numMaps < maxMapLoad) { int totalNeededMaps = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING) { continue; } Task t = job.obtainNewMapTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { return Collections.singletonList(t); } } } } //分配完map task,再分配reduce task if (numReduces < maxReduceLoad) { int totalNeededReduces = 0; synchronized (jobQueue) { for (JobInProgress job : jobQueue) { if (job.getStatus().getRunState() != JobStatus.RUNNING || job.numReduceTasks == 0) { continue; } Task t = job.obtainNewReduceTask(taskTracker, numTaskTrackers, taskTrackerManager.getNumberOfUniqueHosts()); if (t != null) { return Collections.singletonList(t); } } } } return null;}
从上面的代码中我们可以知道,JobInProgress的obtainNewMapTask是用来分配map task的,其主要调用findNewMapTask,根据TaskTracker所在的Node从nonRunningMapCache中查找TaskInProgress。JobInProgress的obtainNewReduceTask是用来分配reduce task的,其主要调用findNewReduceTask,从nonRunningReduces查找TaskInProgress。
3.5 TaskTracker接收HeartbeatResponse并启动Task子进程
在向JobTracker发送heartbeat后,返回的reponse中有分配好的任务LaunchTaskAction,将其加入队列,调用addToTaskQueue,如果是map task则放入mapLancher(类型为TaskLauncher),如果是reduce task则放入reduceLancher(类型为TaskLauncher):
private void addToTaskQueue(LaunchTaskAction action) { if (action.getTask().isMapTask()) { mapLauncher.addToTaskQueue(action); } else { reduceLauncher.addToTaskQueue(action); }}
TaskLauncher是一个线程,其run函数从上面放入的queue中取出一个TaskInProgress,然后调用startNewTask(TaskInProgress tip)来启动一个task,其又主要调用了localizeJob(TaskInProgress tip):
private void localizeJob(TaskInProgress tip) throws IOException { //首先要做的一件事情是有关Task的文件要从HDFS拷贝到TaskTracker的本地文件系统中:job.split,job.xml以及job.jar Path localJarFile = null; Task t = tip.getTask(); JobID jobId = t.getJobID(); Path jobFile = new Path(t.getJobFile()); Path localJobFile = lDirAlloc.getLocalPathForWrite( getLocalJobDir(jobId.toString()) + Path.SEPARATOR + "job.xml", jobFileSize, fConf); RunningJob rjob = addTaskToJob(jobId, tip); synchronized (rjob) { if (!rjob.localized) { FileSystem localFs = FileSystem.getLocal(fConf); Path jobDir = localJobFile.getParent(); //将job.split拷贝到本地 systemFS.copyToLocalFile(jobFile, localJobFile); JobConf localJobConf = new JobConf(localJobFile); Path workDir = lDirAlloc.getLocalPathForWrite( (getLocalJobDir(jobId.toString()) + Path.SEPARATOR + "work"), fConf); if (!localFs.mkdirs(workDir)) { throw new IOException("Mkdirs failed to create " + workDir.toString()); } System.setProperty("job.local.dir", workDir.toString()); localJobConf.set("job.local.dir", workDir.toString()); // copy Jar file to the local FS and unjar it. String jarFile = localJobConf.getJar(); long jarFileSize = -1; if (jarFile != null) { Path jarFilePath = new Path(jarFile); localJarFile = new Path(lDirAlloc.getLocalPathForWrite( getLocalJobDir(jobId.toString()) + Path.SEPARATOR + "jars", 5 * jarFileSize, fConf), "job.jar"); if (!localFs.mkdirs(localJarFile.getParent())) { throw new IOException("Mkdirs failed to create jars directory "); } //将job.jar拷贝到本地 systemFS.copyToLocalFile(jarFilePath, localJarFile); localJobConf.setJar(localJarFile.toString()); //将job得configuration写成job.xml OutputStream out = localFs.create(localJobFile); try { localJobConf.writeXml(out); } finally { out.close(); } // 解压缩job.jar RunJar.unJar(new File(localJarFile.toString()), new File(localJarFile.getParent().toString())); } rjob.localized = true; rjob.jobConf = localJobConf; } } //真正的启动此Task launchTaskForJob(tip, new JobConf(rjob.jobConf));}当所有的task运行所需要的资源都拷贝到本地后,则调用launchTaskForJob,其又调用TaskInProgress的launchTask函数:
public synchronized void launchTask() throws IOException { //创建task运行目录 localizeTask(task); if (this.taskStatus.getRunState() == TaskStatus.State.UNASSIGNED) { this.taskStatus.setRunState(TaskStatus.State.RUNNING); } //创建并启动TaskRunner,对于MapTask,创建的是MapTaskRunner,对于ReduceTask,创建的是ReduceTaskRunner this.runner = task.createRunner(TaskTracker.this, this); this.runner.start(); this.taskStatus.setStartTime(System.currentTimeMillis());}TaskRunner是一个线程,其run函数如下:
public final void run() { …… TaskAttemptID taskid = t.getTaskID(); LocalDirAllocator lDirAlloc = new LocalDirAllocator("mapred.local.dir"); File jobCacheDir = null; if (conf.getJar() != null) { jobCacheDir = new File( new Path(conf.getJar()).getParent().toString()); } File workDir = new File(lDirAlloc.getLocalPathToRead( TaskTracker.getLocalTaskDir( t.getJobID().toString(), t.getTaskID().toString(), t.isTaskCleanupTask()) + Path.SEPARATOR + MRConstants.WORKDIR, conf). toString()); FileSystem fileSystem; Path localPath; //拼写classpath String baseDir; String sep = System.getProperty("path.separator"); StringBuffer classPath = new StringBuffer(); // start with same classpath as parent process classPath.append(System.getProperty("java.class.path")); classPath.append(sep); if (!workDir.mkdirs()) { if (!workDir.isDirectory()) { LOG.fatal("Mkdirs failed to create " + workDir.toString()); } } String jar = conf.getJar(); if (jar != null) { // if jar exists, it into workDir File[] libs = new File(jobCacheDir, "lib").listFiles(); if (libs != null) { for (int i = 0; i < libs.length; i++) { classPath.append(sep); // add libs from jar to classpath classPath.append(libs[i]); } } classPath.append(sep); classPath.append(new File(jobCacheDir, "classes")); classPath.append(sep); classPath.append(jobCacheDir); } classPath.append(sep); classPath.append(workDir); //拼写完整的命令行(java命令及其参数) Vector<String> vargs = new Vector<String>(8); File jvm = new File(new File(System.getProperty("java.home"), "bin"), "java"); vargs.add(jvm.toString()); String javaOpts = conf.get("mapred.child.java.opts", "-Xmx200m"); javaOpts = javaOpts.replace("@taskid@", taskid.toString()); String [] javaOptsSplit = javaOpts.split(" "); String libraryPath = System.getProperty("java.library.path"); if (libraryPath == null) { libraryPath = workDir.getAbsolutePath(); } else { libraryPath += sep + workDir; } boolean hasUserLDPath = false; for(int i=0; i<javaOptsSplit.length ;i++) { if(javaOptsSplit[i].startsWith("-Djava.library.path=")) { javaOptsSplit[i] += sep + libraryPath; hasUserLDPath = true; break; } } if(!hasUserLDPath) { vargs.add("-Djava.library.path=" + libraryPath); } for (int i = 0; i < javaOptsSplit.length; i++) { vargs.add(javaOptsSplit[i]); } //添加Child进程的临时文件夹 String tmp = conf.get("mapred.child.tmp", "./tmp"); Path tmpDir = new Path(tmp); if (!tmpDir.isAbsolute()) { tmpDir = new Path(workDir.toString(), tmp); } FileSystem localFs = FileSystem.getLocal(conf); if (!localFs.mkdirs(tmpDir) && !localFs.getFileStatus(tmpDir).isDir()) { throw new IOException("Mkdirs failed to create " + tmpDir.toString()); } vargs.add("-Djava.io.tmpdir=" + tmpDir.toString()); // Add classpath. vargs.add("-classpath"); vargs.add(classPath.toString()); // 运行map task和reduce task的子进程的main class是Child vargs.add(Child.class.getName()); // main of Child //启动一个JVM子进程去执行Map任务或Task任务 jvmManager.launchJvm(this, jvmManager.constructJvmEnv(setup,vargs,stdout,stderr,logSize, workDir, env, pidFile, conf));}TaskTracker发送心跳及接收命令启动Task进程具体序列图如图8:
3.6 MapReduce任务的执行
while (true) { //从TaskTracker通过网络通信得到JvmTask对象 JvmTask myTask = umbilical.getTask(jvmId); idleLoopCount = 0; task = myTask.getTask(); taskid = task.getTaskID(); isCleanup = task.isTaskCleanupTask(); JobConf job = new JobConf(task.getJobFile()); TaskRunner.setupWorkDir(job); numTasksToExecute = job.getNumTasksToExecutePerJvm(); task.setConf(job); defaultConf.addResource(new Path(task.getJobFile())); //运行task task.run(job, umbilical); // run the task if (numTasksToExecute > 0 && ++numTasksExecuted == numTasksToExecute) { break; }}<span style="font-family: Arial, sans-serif; background-color: rgb(255, 255, 255);"> </span>
3.6.1 MapTask
(1)Read阶段:MapTask通过用户编写的RecordReader,从输入InputSplit中解析出一个个key/value。
(2)Map阶段:该阶段只要是将解析出的key/value交给用户编写的map()函数处理,并产生一系列新的key/value。
(3)Collect阶段:在用户编写的map()函数中,当数据处理完成后,一般会调用OutputCollector.collect()输出结果。在该函数内部,它会将生成的ley/value分片(通过调用Partition),并写入一个环形内存缓冲区中。
(4)Spill阶段:即“溢写”,当环形缓冲区满后,MapReduce会将数据写到本地磁盘上,生成一个临时文件。需要注意的是,将数据写入本地磁盘之前,先要对数据进行一次本地排序,并在必要时对数据进行合并操作。
(5)Combine阶段:当所有数据处理完成后,MapTask对所有临时文件进行一次合并,以确保最终只会生成一个数据文件。
public void run(final JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException { //用于同TaskTracker进行通信,汇报运行状况 final Reporter reporter = getReporter(umbilical); startCommunicationThread(umbilical); initialize(job, reporter); //map task的输出 int numReduceTasks = conf.getNumReduceTasks(); MapOutputCollector collector = null; if (numReduceTasks > 0) { collector = new MapOutputBuffer(umbilical, job, reporter); } else { collector = new DirectMapOutputCollector(umbilical, job, reporter); } //读取input split,按照其中的信息,生成RecordReader来读取数据instantiatedSplit = (InputSplit) ReflectionUtils.newInstance(job.getClassByName(splitClass), job); DataInputBuffer splitBuffer = new DataInputBuffer(); splitBuffer.reset(split.getBytes(), 0, split.getLength()); instantiatedSplit.readFields(splitBuffer); if (instantiatedSplit instanceof FileSplit) { FileSplit fileSplit = (FileSplit) instantiatedSplit; job.set("map.input.file", fileSplit.getPath().toString()); job.setLong("map.input.start", fileSplit.getStart()); job.setLong("map.input.length", fileSplit.getLength()); } RecordReader rawIn = // open input job.getInputFormat().getRecordReader(instantiatedSplit, job, reporter); RecordReader in = isSkipping() ? new SkippingRecordReader(rawIn, getCounters(), umbilical) : new TrackedRecordReader(rawIn, getCounters()); job.setBoolean("mapred.skip.on", isSkipping()); //对于map task,生成一个MapRunnable,默认是MapRunner MapRunnable runner = ReflectionUtils.newInstance(job.getMapRunnerClass(), job); try { //MapRunner的run函数就是依次读取RecordReader中的数据,然后调用Mapper的map函数进行处理。 runner.run(in, collector, reporter); collector.flush(); } finally { in.close(); // close input collector.close(); } done(umbilical);}MapRunner的run函数就是依次读取RecordReader中的数据,然后调用Mapper的map函数进行处理:
public void run(RecordReader<K1, V1> input, OutputCollector<K2, V2> output, Reporter reporter) throws IOException { try { K1 key = input.createKey(); V1 value = input.createValue(); while (input.next(key, value)) { mapper.map(key, value, output, reporter); if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1); } } } finally { mapper.close(); }}<span style="font-family: Arial, sans-serif; background-color: rgb(255, 255, 255);"> </span>
结果集全部收集到MapOutputBuffer中,其collect函数如下:
public synchronized void collect(K key, V value) throws IOException { reporter.progress(); //从此处看,此buffer是一个ring的数据结构 final int kvnext = (kvindex + 1) % kvoffsets.length; spillLock.lock(); try { boolean kvfull; do { //在ring中,如果下一个空闲位置接上起始位置的话,则表示满了 kvfull = kvnext == kvstart; //在ring中计算是否需要将buffer写入硬盘的阈值 final boolean kvsoftlimit = ((kvnext > kvend) ? kvnext - kvend > softRecordLimit : kvend - kvnext <= kvoffsets.length - softRecordLimit); //如果到达阈值,则开始将buffer写入硬盘,写成spill文件。 //startSpill主要是notify一个背后线程SpillThread的run()函数,开始调用sortAndSpill()开始排序,合并,写入硬盘 if (kvstart == kvend && kvsoftlimit) { startSpill(); } //如果buffer满了,则只能等待写入完毕 if (kvfull) { while (kvstart != kvend) { reporter.progress(); spillDone.await(); } } } while (kvfull); } finally { spillLock.unlock(); } try { //如果buffer不满,则将key, value写入buffer int keystart = bufindex; keySerializer.serialize(key); final int valstart = bufindex; valSerializer.serialize(value); int valend = bb.markRecord(); //调用设定的partitioner,根据key, value取得partition id final int partition = partitioner.getPartition(key, value, partitions); mapOutputRecordCounter.increment(1); mapOutputByteCounter.increment(valend >= keystart ? valend - keystart : (bufvoid - keystart) + valend); //将parition id以及key, value在buffer中的偏移量写入索引数组 int ind = kvindex * ACCTSIZE; kvoffsets[kvindex] = ind; kvindices[ind + PARTITION] = partition; kvindices[ind + KEYSTART] = keystart; kvindices[ind + VALSTART] = valstart; kvindex = kvnext; } catch (MapBufferTooSmallException e) { LOG.info("Record too large for in-memory buffer: " + e.getMessage()); spillSingleRecord(key, value); mapOutputRecordCounter.increment(1); return; }}内存buffer的格式如下:
图11 内存buffer格式
从上面可知,内存buffer写入硬盘spill文件的函数为sortAndSpill:
private void sortAndSpill() throws IOException { FSDataOutputStream out = null; FSDataOutputStream indexOut = null; IFileOutputStream indexChecksumOut = null; //创建硬盘上的spill文件 Path filename = mapOutputFile.getSpillFileForWrite(getTaskID(), numSpills, size); out = rfs.create(filename); final int endPosition = (kvend > kvstart) ? kvend : kvoffsets.length + kvend; //按照partition的顺序对buffer中的数据进行排序 sorter.sort(MapOutputBuffer.this, kvstart, endPosition, reporter); int spindex = kvstart; InMemValBytes value = new InMemValBytes(); //依次一个一个parition的写入文件 for (int i = 0; i < partitions; ++i) { IFile.Writer<K, V> writer = null; long segmentStart = out.getPos(); writer = new Writer<K, V>(job, out, keyClass, valClass, codec); //如果combiner为空,则直接写入文件 if (null == combinerClass) { writer.append(key, value); ++spindex; } else { //如果combiner不为空,则先combine,调用combiner.reduce(…)函数后再写入文件 combineAndSpill(kvIter, combineInputCounter); } }}当map阶段结束的时候,MapOutputBuffer的flush函数会被调用,其也会调用sortAndSpill将buffer中的写入文件,然后再调用mergeParts来合并写入在硬盘上的多个spill:
private void mergeParts() throws IOException { //对于每一个partition for (int parts = 0; parts < partitions; parts++){ //create the segments to be merged List<Segment<K, V>> segmentList = new ArrayList<Segment<K, V>>(numSpills); TaskAttemptID mapId = getTaskID(); //依次从各个spill文件中收集属于当前partition的段 for(int i = 0; i < numSpills; i++) { final IndexRecord indexRecord = getIndexInformation(mapId, i, parts); long segmentOffset = indexRecord.startOffset; long segmentLength = indexRecord.partLength; Segment<K, V> s = new Segment<K, V>(job, rfs, filename[i], segmentOffset, segmentLength, codec, true); segmentList.add(i, s); } //将属于同一个partition的段merge到一起 RawKeyValueIterator kvIter = Merger.merge(job, rfs, keyClass, valClass, segmentList, job.getInt("io.sort.factor", 100), new Path(getTaskID().toString()), job.getOutputKeyComparator(), reporter); //写入合并后的段到文件 long segmentStart = finalOut.getPos(); Writer<K, V> writer = new Writer<K, V>(job, finalOut, keyClass, valClass, codec); if (null == combinerClass || numSpills < minSpillsForCombine) { Merger.writeFile(kvIter, writer, reporter, job); } else { combineCollector.setWriter(writer); combineAndSpill(kvIter, combineInputCounter); } …… }}
3.6.2 ReduceTask
(1)Shuffle阶段:也称为Copy阶段。ReduceTask从各个MapTask所在的TaskTracker上远程拷贝一片数据,并针对某一片数据,如果其大小超过一定阈值,则写到磁盘上,否则直接放到内存中。
(2)Merge阶段:在远程拷贝数据的同时,ReduceTask启动了两个后台线程对内存和磁盘上的文件进行合并,以防止内存使用过多或磁盘上的文件过多,并且可以为后面整体的归并排序减负,提升排序效率。
(3)Sort阶段:按照MapReduce的语义,用户编写的reduce()函数输入数据是按key进行聚集的一组数据。为了将key相同的数据聚集在一起,Hadoop采用了基于排序的策略。由于各个MapTask已经实现了自己的处理结果进行了局部排序,因此,ReduceTask只需要对所有数据进行一次归并排序即可。
(4)Reduce阶段:在该阶段中,ReduceTask将每组数据依次交给用户编写的reduce()函数处理。
(5)Write阶段:reduce()函数将计算结果写到HDFS上
public void run(JobConf job, final TaskUmbilicalProtocol umbilical) throws IOException { job.setBoolean("mapred.skip.on", isSkipping()); //对于reduce,则包含三个步骤:拷贝,排序,Reduce if (isMapOrReduce()) { copyPhase = getProgress().addPhase("copy"); sortPhase = getProgress().addPhase("sort"); reducePhase = getProgress().addPhase("reduce"); } startCommunicationThread(umbilical); final Reporter reporter = getReporter(umbilical); initialize(job, reporter); //copy阶段,主要使用ReduceCopier的fetchOutputs函数获得map的输出。创建多个线程MapOutputCopier,其中copyOutput进行拷贝。 boolean isLocal = "local".equals(job.get("mapred.job.tracker", "local")); if (!isLocal) { reduceCopier = new ReduceCopier(umbilical, job); if (!reduceCopier.fetchOutputs()) { …… } } copyPhase.complete(); //sort阶段,将得到的map输出合并,直到文件数小于io.sort.factor时停止,返回一个Iterator用于访问key-value setPhase(TaskStatus.Phase.SORT); statusUpdate(umbilical); final FileSystem rfs = FileSystem.getLocal(job).getRaw(); RawKeyValueIterator rIter = isLocal ? Merger.merge(job, rfs, job.getMapOutputKeyClass(), job.getMapOutputValueClass(), codec, getMapFiles(rfs, true), !conf.getKeepFailedTaskFiles(), job.getInt("io.sort.factor", 100), new Path(getTaskID().toString()), job.getOutputKeyComparator(), reporter) : reduceCopier.createKVIterator(job, rfs, reporter); mapOutputFilesOnDisk.clear(); sortPhase.complete(); //reduce阶段 setPhase(TaskStatus.Phase.REDUCE); …… Reducer reducer = ReflectionUtils.newInstance(job.getReducerClass(), job); Class keyClass = job.getMapOutputKeyClass(); Class valClass = job.getMapOutputValueClass(); ReduceValuesIterator values = isSkipping() ? new SkippingReduceValuesIterator(rIter, job.getOutputValueGroupingComparator(), keyClass, valClass, job, reporter, umbilical) : new ReduceValuesIterator(rIter, job.getOutputValueGroupingComparator(), keyClass, valClass, job, reporter); //逐个读出key-value list,然后调用Reducer的reduce函数 while (values.more()) { reduceInputKeyCounter.increment(1); reducer.reduce(values.getKey(), values, collector, reporter); values.nextKey(); values.informReduceProgress(); } reducer.close(); out.close(reporter); done(umbilical);}整个MapReuce计算流程其实最重要并且也是最复杂的就是ReduceTask的Shuffle阶段了,如图14所示,这个阶段又可以细分为以下几个阶段:
- 准备运行完成的MapTask列表:GetMapEventsThread线程周期性通过RPC从TaskTracker获取已完成MapTask列表,最终保存在scheduledCopies中。
- 远程拷贝数据:ReduceTask同时启动多个MapOutputCopier线程,这些线程从scheduledCopies列表中获取MapTask输出位置,并通过HTTPGet远程拷贝数据。对于获取的分片数据,如果大小超过一定阈值,则存放到磁盘上,否则直接放到内存中。
- 合并内存文件和磁盘文件:为了防止内存或者磁盘上的文件数过多,ReduceTask启动了LocalFSMerger和InMemFSMergeThread两个线程的分别对磁盘和内存上的文件进行合并。
4. 总结
本文深入剖析了Hadoop的MapReduce离线计算框架的内部实现机制,特别是对于Shuffle阶段的深入理解,有助于我们更深刻的理解MapReduce程序,不断调优写出高性能的MapReduce(Hive)程序。并且在遇到Hadoop相关问题的时候能够根据原理及源码快速定位问题,解决问题,对于大数据领域相关的工程师是非常有必要看一看Hadoop源码。
- HadoopMapReduce源码解析
- HadoopMapReduce
- 第50课:HadoopMapReduce倒排索引解析与实战
- 第51课:HadoopMapReduce多维排序解析与实战
- 关于HadoopMapReduce的精彩介绍
- Hadoop2.6.0运行mapreduce之Uber模式验证 标签: hadoopmapreduce源码uberjava 2016-05-05 14:55 19815人阅读 评论(2) 收藏 举报
- 源码解析
- 源码解析
- HadoopMapReduce -Map-Reduce具体实现详解
- 【JDk源码解析之一】ArrayList源码解析
- 【源码解析】-- ArrayList的源码解析
- EventBus源码解析(史上最全的源码解析)
- 【源码】Vector、Stack源码解析
- Sping源码解析-源码下载
- <Android源码>IntentService源码解析
- JAVA源码解析-String源码
- JAVA源码解析-ArrayList源码
- JAVA源码解析-LinkedList源码
- 经纬度搜索(1)-Geohash算法原理
- 黑马程序员——JavaSE之多线程中关于锁的理解 二
- List与IList的区别
- 连载《一个程序猿的生命周期》- 37、《从0到1》中提到的4点创业信条! 【含】李彦宏的《开讲啦》
- CSDN成长史——用心写博客
- HadoopMapReduce源码解析
- VS2008/BCGControlBar常见问题解决
- [京东 + 华为面试 + 金山笔试]自己总结
- FZU 2150 Fire Game (暴搜/ BFS+DFS)
- 笔记:学习 Android -Handler,Thread,Looper
- hdoj I Hate It 1754 (线段树区间最值 模板)
- 本地通知和远程通知
- Linux--rhel7.0--忘记root用户密码解决办法
- MVC中处理Json和JS中处理Json对象