Hadoop MapReduce过程 源代码解析
来源:互联网 发布:抢购软件哪个好 编辑:程序博客网 时间:2024/05/14 23:56
网上现有的Hadoop源代码分析与最新代码相比稍显落后。笔者本着学习总结目的,分析了Hadoop 2.02的源代码。
概论
一个完整的Hadoop MapReduce过程可以描述如下:
- Client端提交MapReduce Job到JobTracker;
- JobTracker调度Job, 生成MapTask和ReduceTask;
- 各TaskTracker接收MapTask和ReduceTask;
- TaskTracker为MapTask和ReduceTask启动新的Child Task JVM;
- Child Task JVM 运行MapTask或ReduceTask。
- Child Task JVM 通过TaskTracker向JobTracker汇报进度和状态。
- 当JobTacker下所有的Task都成功时,Job标志位成功状态。
JobClient 提交 MapReduce Job
JobClient.runJob()方法一旦调用,MapReduce的大象就起跑了。
钻入Job.submit()方法,找到一个有货的方法JobSubmitter.submitJobInternal()。
JobStatus submitJobInternal(Job job, Cluster cluster) throws ClassNotFoundException, InterruptedException, IOException { //检查job的规范 checkSpecs(job); Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster, job.getConfiguration()); //configure the command line options correctly on the submitting dfs Configuration conf = job.getConfiguration(); InetAddress ip = InetAddress.getLocalHost(); if (ip != null) { submitHostAddress = ip.getHostAddress(); submitHostName = ip.getHostName(); conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName); conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress); } JobID jobId = submitClient.getNewJobID(); job.setJobID(jobId); Path submitJobDir = new Path(jobStagingArea, jobId.toString()); JobStatus status = null; try { conf.set("hadoop.http.filter.initializers", "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer"); conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString()); LOG.debug("Configuring job " + jobId + " with " + submitJobDir + " as the submit dir"); // get delegation token for the dir TokenCache.obtainTokensForNamenodes(job.getCredentials(), new Path[] { submitJobDir }, conf); populateTokenCache(conf, job.getCredentials()); ////拷贝Job相关jar包及配置文件到HDFS copyAndConfigureFiles(job, submitJobDir); Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir); //创建job的InputSplit,并保持到job.split里。job.split包含了每个split的hosts location信息 LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir)); int maps = writeSplits(job, submitJobDir); conf.setInt(MRJobConfig.NUM_MAPS, maps); LOG.info("number of splits:" + maps); String queue = conf.get(MRJobConfig.QUEUE_NAME, JobConf.DEFAULT_QUEUE_NAME); AccessControlList acl = submitClient.getQueueAdmins(queue); conf.set(toFullPropertyName(queue, QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString()); TokenCache.cleanUpTokenReferral(conf); // Write job file to submit dir writeConf(conf, submitJobFile); //正式提交job printTokens(jobId, job.getCredentials()); status = submitClient.submitJob( jobId, submitJobDir.toString(), job.getCredentials()); if (status != null) { return status; } else { throw new IOException("Could not launch job"); } } finally { if (status == null) { LOG.info("Cleaning up the staging area " + submitJobDir); if (jtFs != null && submitJobDir != null) jtFs.delete(submitJobDir, true); } } }
提交job后,一般会调用JobClient.waitForCompletion()方法。依次进入Job.monitorAndPrintJob()方法,可以看到此方法主要完成了执行过程中打印map/reduce进度百分比,执行完成后打印执行状态(如counter数值)。
分析可以发现:
- Map 的input Split和数量在此阶段就已经确定,通过调用InputFormat的getInputSplit()方法来实现;
private <T extends InputSplit> int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException, InterruptedException, ClassNotFoundException { Configuration conf = job.getConfiguration(); InputFormat<?, ?> input = ReflectionUtils.newInstance(job.getInputFormatClass(), conf); //调用InputFormat的Split方法,产生InputSplit List<InputSplit> splits = input.getSplits(job); T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]); // sort the splits into order based on size, so that the biggest // go first Arrays.sort(array, new SplitComparator()); JobSplitWriter.createSplitFiles(jobSubmitDir, conf, jobSubmitDir.getFileSystem(conf), array); return array.length; }
JobTracker生成Map Tasks和Reduce Tasks
JobTracker是Hadoop的指挥中心,其作用在:
- 与JobClient通讯,获取新的Job;功能定义在ClientProtocol里。
- 将Job分解为Map Tasks和ReduceTasks, 存储其队列中;功能定义在TaskTrackerManager.initJob()里。
- 与TaskTracker通讯,将Tasks下发到TaskTracker执行;功能定义在InterTrackerProtocol.heartbeat()里。
- 汇集同一Job下不同Tasks的状态,从而决定Job的状态。
JobTracker通过offerService来提供其功能。
public void offerService() throws InterruptedException, IOException { // Prepare for recovery. This is done irrespective of the status of restart // flag. while (true) { try { recoveryManager.updateRestartCount(); break; } catch (IOException ioe) { LOG.warn("Failed to initialize recovery manager. ", ioe); // wait for some time Thread.sleep(FS_ACCESS_RETRY_PERIOD); LOG.warn("Retrying..."); } } taskScheduler.start(); recoveryManager.recover(); // refresh the node list as the recovery manager might have added // disallowed trackers refreshHosts(); startExpireTrackersThread(); expireLaunchingTaskThread.start(); if (completedJobStatusStore.isActive()) { completedJobsStoreThread = new Thread(completedJobStatusStore, "completedjobsStore-housekeeper"); completedJobsStoreThread.start(); } // start the inter-tracker server once the jt is ready this.interTrackerServer.start(); synchronized (this) { state = State.RUNNING; } LOG.info("Starting RUNNING"); this.interTrackerServer.join(); LOG.info("Stopped interTrackerServer"); }
JobTracker接受JobClient提交的Job
private JobStatus submitJob(org.apache.hadoop.mapreduce.JobID jobID, int restartCount, UserGroupInformation ugi, String jobSubmitDir, boolean recovered, Credentials ts ) throws IOException, InterruptedException { ... // Create the JobInProgress, temporarily unlock the JobTracker since // we are about to copy job.xml from HDFSJobInProgress JobInProgress job = new JobInProgress(this, this.conf, restartCount, jobInfo, ts); synchronized (this) { ... return addJob(jobId, job); } }
JobInitManager将Job分解为Tasks,并加入队列
class JobInitManager implements Runnable { public void run() { JobInProgress job = null; while (true) { try { synchronized (jobInitQueue) { while (jobInitQueue.isEmpty()) { jobInitQueue.wait(); } job = jobInitQueue.remove(0); } threadPool.execute(new InitJob(job)); } catch (InterruptedException t) { LOG.info("JobInitManagerThread interrupted."); break; } } LOG.info("Shutting down thread pool"); threadPool.shutdownNow(); } }调用JobInProcess.initTasks()函数,为MapTask和ReduceTask生成多个TasksInProgress对象
public synchronized void initTasks() throws IOException, KillInterruptedException, UnknownHostException { ... createMapTasks(jobFile.toString(), taskSplitMetaInfo); ... // set the launch time this.launchTime = JobTracker.getClock().getTime(); createReduceTasks(jobFile.toString()); ... }
JobTracker给TaskTracker分配任务
JobTracker在heatBeat()方法中,调用JobQueueTaskScheduler.assignTasks(TaskTracker taskTracker)函数,并将Task包含在HeartbeatResponse里返回。
public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status, boolean restarted, boolean initialContact, boolean acceptNewTasks, short responseId) throws IOException { ... // Process this heartbeat short newResponseId = (short)(responseId + 1); status.setLastSeen(now); if (!processHeartbeat(status, initialContact)) { if (prevHeartbeatResponse != null) { trackerToHeartbeatResponseMap.remove(trackerName); } return new HeartbeatResponse(newResponseId, new TaskTrackerAction[] {new ReinitTrackerAction()}); } // Initialize the response to be sent for the heartbeat HeartbeatResponse response = new HeartbeatResponse(newResponseId, null); List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>(); isBlacklisted = faultyTrackers.isBlacklisted(status.getHost()); // Check for new tasks to be executed on the tasktracker if (acceptNewTasks && !isBlacklisted) { TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName) ; if (taskTrackerStatus == null) { LOG.warn("Unknown task tracker polling; ignoring: " + trackerName); } else { List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus); if (tasks == null ) { tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName)); } if (tasks != null) { for (Task task : tasks) { expireLaunchingTasks.addNewTask(task.getTaskID()); if (LOG.isDebugEnabled()) { LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID()); } actions.add(new LaunchTaskAction(task)); } } } } ... int nextInterval = getNextHeartbeatInterval(); response.setHeartbeatInterval(nextInterval); response.setActions( actions.toArray(new TaskTrackerAction[actions.size()])); // Update the trackerToHeartbeatResponseMap trackerToHeartbeatResponseMap.put(trackerName, response); ... return response; }
TaskTracker接收Task并启动Task
TaskTracker是Hadoop中的任务处理节点,其作用有:
- 与JobTracker通讯,接受任务;
- 启动Chilld JVM,运行MapTask或ReduceTask;
- 将ChildTask执行状态汇报给JobTracker; 功能定义在TaskUmbilicalProtocol中。
/** * The server retry loop. * This while-loop attempts to connect to the JobTracker. It only * loops when the old TaskTracker has gone bad (its state is * stale somehow) and we need to reinitialize everything. */ public void run() { try { startCleanupThreads(); boolean denied = false; while (running && !shuttingDown && !denied) { boolean staleState = false; try { // This while-loop attempts reconnects if we get network errors while (running && !staleState && !shuttingDown && !denied) { try { State osState = offerService(); if (osState == State.STALE) { staleState = true; } else if (osState == State.DENIED) { denied = true; } } catch (Exception ex) { if (!shuttingDown) { LOG.info("Lost connection to JobTracker [" + jobTrackAddr + "]. Retrying...", ex); try { Thread.sleep(5000); } catch (InterruptedException ie) { } } } } } finally { close(); } if (shuttingDown) { return; } LOG.warn("Reinitializing local state"); initialize(); } if (denied) { shutdown(); } } catch (IOException iex) { LOG.error("Got fatal exception while reinitializing TaskTracker: " + StringUtils.stringifyException(iex)); return; } catch (InterruptedException i) { LOG.error("Got interrupted while reinitializing TaskTracker: " + i.getMessage()); return; } }
主要操作定义在函数里。
TaskTracker接收Task
在TaskTracker.offerService()里,当接收到LauchTaskAction时,会将Task加入tasksToLaunch队列中。
State offerService() throws Exception { long lastHeartbeat = 0; while (running && !shuttingDown) { try { ... // Send the heartbeat and process the jobtracker's directives HeartbeatResponse heartbeatResponse = transmitHeartBeat(now); TaskTrackerAction[] actions = heartbeatResponse.getActions(); ... if (actions != null){ for(TaskTrackerAction action: actions) { if (action instanceof LaunchTaskAction) { addToTaskQueue((LaunchTaskAction)action); } else if (action instanceof CommitTaskAction) { CommitTaskAction commitAction = (CommitTaskAction)action; if (!commitResponses.contains(commitAction.getTaskID())) { LOG.info("Received commit task action for " + commitAction.getTaskID()); commitResponses.add(commitAction.getTaskID()); } } else { tasksToCleanup.put(action); } } } markUnresponsiveTasks(); killOverflowingTasks(); //we've cleaned up, resume normal operation if (!acceptNewTasks && isIdle()) { acceptNewTasks=true; } ... } } return State.NORMAL; }
TaskLauncher thread不断轮询tasksToLaunch队列,当有Slots 可用时,就调用launchTask(),着手启动Task了。
public void run() { while (!Thread.interrupted()) { try { TaskInProgress tip; Task task; synchronized (tasksToLaunch) { while (tasksToLaunch.isEmpty()) { tasksToLaunch.wait(); } //get the TIP tip = tasksToLaunch.remove(0); task = tip.getTask(); LOG.info("Trying to launch : " + tip.getTask().getTaskID() + " which needs " + task.getNumSlotsRequired() + " slots"); } //wait for free slots to run synchronized (numFreeSlots) { boolean canLaunch = true; while (numFreeSlots.get() < task.getNumSlotsRequired()) { //Make sure that there is no kill task action for this task! //We are not locking tip here, because it would reverse the //locking order! //Also, Lock for the tip is not required here! because : // 1. runState of TaskStatus is volatile // 2. Any notification is not missed because notification is // synchronized on numFreeSlots. So, while we are doing the check, // if the tip is half way through the kill(), we don't miss // notification for the following wait(). if (!tip.canBeLaunched()) { //got killed externally while still in the launcher queue LOG.info("Not blocking slots for " + task.getTaskID() + " as it got killed externally. Task's state is " + tip.getRunState()); canLaunch = false; break; } LOG.info("TaskLauncher : Waiting for " + task.getNumSlotsRequired() + " to launch " + task.getTaskID() + ", currently we have " + numFreeSlots.get() + " free slots"); numFreeSlots.wait(); } if (!canLaunch) { continue; } LOG.info("In TaskLauncher, current free slots : " + numFreeSlots.get()+ " and trying to launch "+tip.getTask().getTaskID() + " which needs " + task.getNumSlotsRequired() + " slots"); numFreeSlots.set(numFreeSlots.get() - task.getNumSlotsRequired()); assert (numFreeSlots.get() >= 0); } synchronized (tip) { //to make sure that there is no kill task action for this if (!tip.canBeLaunched()) { //got killed externally while still in the launcher queue LOG.info("Not launching task " + task.getTaskID() + " as it got" + " killed externally. Task's state is " + tip.getRunState()); addFreeSlots(task.getNumSlotsRequired()); continue; } tip.slotTaken = true; } //got a free slot. launch the task startNewTask(tip); } catch (InterruptedException e) { return; // ALL DONE } catch (Throwable th) { LOG.error("TaskLauncher error " + StringUtils.stringifyException(th)); } } } }
TaskTracker启动Task JVM
- TaskTracker调用launchTask(),创建TaskRunner线程并启动线程;
- TaskRunner调用launchJvmAndWait(), 通过JvmManager.reapJvm()函数完成。
- JvmManager根据条件判断,是否启用新JVM。如果已有相同JobID的JVM而且此JVM处于空闲状态,则复用此JVM。
- 如果没有可用JVM复用,则调用spawnNewJvm() 创建JvmRunner并调用其runChild()。
- JvmRunner 依次调用DefaultTaskController.launchTaskJVM(),ShellCommandExecutor.runCommand(),ProcessBuilder.start()方法。
Task JVM运行MapTask/ReduceTask
Task JVM启动的MAIN_CLASS是org.apache.hadoop.mapred.Child。
Child通过TaskUmbilicalProtocol与 localhost的TaskTracker通讯,通过 umbilical.getTask()获取task, 并运行其run方法。获取的task分为三种类型,MapTask, ReduceTask,或者CleanTask。
try { while (true) { taskid = null; JvmTask myTask = umbilical.getTask(context); if (myTask.shouldDie()) { break; } else { if (myTask.getTask() == null) { taskid = null; if (++idleLoopCount >= SLEEP_LONGER_COUNT) { //we sleep for a bigger interval when we don't receive //tasks for a while Thread.sleep(1500); } else { Thread.sleep(500); } continue; } } idleLoopCount = 0; task = myTask.getTask();... final Task taskFinal = task; childUGI.doAs(new PrivilegedExceptionAction<Object>() { @Override public Object run() throws Exception { try { // use job-specified working directory FileSystem.get(job).setWorkingDirectory(job.getWorkingDirectory()); taskFinal.run(job, umbilical); // run the task } finally { TaskLog.syncLogs(logLocation, taskid, isCleanup); } return null; } });... } } }}
本文描述的是MapReduce New API的部分
运行MapTask
Map阶段比较容易理解。
MapTask在run()方法中,调用runNewMapper()方法,
@SuppressWarnings("unchecked") private <INKEY,INVALUE,OUTKEY,OUTVALUE> void runNewMapper(final JobConf job, final TaskSplitIndex splitIndex, final TaskUmbilicalProtocol umbilical, TaskReporter reporter ) throws IOException, ClassNotFoundException, InterruptedException { // make a task context so we can get the classes org.apache.hadoop.mapreduce.TaskAttemptContext taskContext = new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job, getTaskID(), reporter); // make a mapper org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper = (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>) ReflectionUtils.newInstance(taskContext.getMapperClass(), job); // make the input format org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat = (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>) ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job); // rebuild the input split org.apache.hadoop.mapreduce.InputSplit split = null; split = getSplitDetails(new Path(splitIndex.getSplitLocation()), splitIndex.getStartOffset()); org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input = new NewTrackingRecordReader<INKEY,INVALUE> (split, inputFormat, reporter, taskContext); job.setBoolean(JobContext.SKIP_RECORDS, isSkipping()); org.apache.hadoop.mapreduce.RecordWriter output = null; // get an output object if (job.getNumReduceTasks() == 0) { output = new NewDirectOutputCollector(taskContext, job, umbilical, reporter); } else { output = new NewOutputCollector(taskContext, job, umbilical, reporter); } org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE> mapContext = new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(), input, output, committer, reporter, split); org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context mapperContext = new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext( mapContext); input.initialize(split, mapperContext); mapper.run(mapperContext); mapPhase.complete(); setPhase(TaskStatus.Phase.SORT); statusUpdate(umbilical); input.close(); output.close(mapperContext); }
runNewMapper中调用用户提交的Mapper类
/** * Expert users can override this method for more complete control over the * execution of the Mapper. * @param context * @throws IOException */ public void run(Context context) throws IOException, InterruptedException { setup(context); while (context.nextKeyValue()) { map(context.getCurrentKey(), context.getCurrentValue(), context); } cleanup(context); }
Map结果到内存MapOutputBuffer
map的结果被context收集,默认的实现是收集在MapOutputBuffer中。MapOutputBuffer是通过循环数组来实现的。
/** * Serialize the key, value to intermediate storage. * When this method returns, kvindex must refer to sufficient unused * storage to store one METADATA. */ public synchronized void collect(K key, V value, final int partition ) throws IOException { reporter.progress(); if (key.getClass() != keyClass) { throw new IOException("Type mismatch in key from map: expected " + keyClass.getName() + ", received " + key.getClass().getName()); } if (value.getClass() != valClass) { throw new IOException("Type mismatch in value from map: expected " + valClass.getName() + ", received " + value.getClass().getName()); } if (partition < 0 || partition >= partitions) { throw new IOException("Illegal partition for " + key + " (" + partition + ")"); } checkSpillException(); bufferRemaining -= METASIZE; if (bufferRemaining <= 0) { // start spill if the thread is not running and the soft limit has been // reached spillLock.lock(); try { do { if (!spillInProgress) { final int kvbidx = 4 * kvindex; final int kvbend = 4 * kvend; // serialized, unspilled bytes always lie between kvindex and // bufindex, crossing the equator. Note that any void space // created by a reset must be included in "used" bytes final int bUsed = distanceTo(kvbidx, bufindex); final boolean bufsoftlimit = bUsed >= softLimit; if ((kvbend + METASIZE) % kvbuffer.length != equator - (equator % METASIZE)) { // spill finished, reclaim space resetSpill(); bufferRemaining = Math.min( distanceTo(bufindex, kvbidx) - 2 * METASIZE, softLimit - bUsed) - METASIZE; continue; } else if (bufsoftlimit && kvindex != kvend) { // spill records, if any collected; check latter, as it may // be possible for metadata alignment to hit spill pcnt startSpill(); final int avgRec = (int) (mapOutputByteCounter.getCounter() / mapOutputRecordCounter.getCounter()); // leave at least half the split buffer for serialization data // ensure that kvindex >= bufindex final int distkvi = distanceTo(bufindex, kvbidx); final int newPos = (bufindex + Math.max(2 * METASIZE - 1, Math.min(distkvi / 2, distkvi / (METASIZE + avgRec) * METASIZE))) % kvbuffer.length; setEquator(newPos); bufmark = bufindex = newPos; final int serBound = 4 * kvend; // bytes remaining before the lock must be held and limits // checked is the minimum of three arcs: the metadata space, the // serialization space, and the soft limit bufferRemaining = Math.min( // metadata max distanceTo(bufend, newPos), Math.min( // serialization max distanceTo(newPos, serBound), // soft limit softLimit)) - 2 * METASIZE; } } } while (false); } finally { spillLock.unlock(); } } try { // serialize key bytes into buffer int keystart = bufindex; keySerializer.serialize(key); if (bufindex < keystart) { // wrapped the key; must make contiguous bb.shiftBufferedKey(); keystart = 0; } // serialize value bytes into buffer final int valstart = bufindex; valSerializer.serialize(value); // It's possible for records to have zero length, i.e. the serializer // will perform no writes. To ensure that the boundary conditions are // checked and that the kvindex invariant is maintained, perform a // zero-length write into the buffer. The logic monitoring this could be // moved into collect, but this is cleaner and inexpensive. For now, it // is acceptable. bb.write(b0, 0, 0); // the record must be marked after the preceding write, as the metadata // for this record are not yet written int valend = bb.markRecord(); mapOutputRecordCounter.increment(1); mapOutputByteCounter.increment( distanceTo(keystart, valend, bufvoid)); // write accounting info kvmeta.put(kvindex + INDEX, kvindex); kvmeta.put(kvindex + PARTITION, partition); kvmeta.put(kvindex + KEYSTART, keystart); kvmeta.put(kvindex + VALSTART, valstart); // advance kvindex kvindex = (kvindex - NMETA + kvmeta.capacity()) % kvmeta.capacity(); } catch (MapBufferTooSmallException e) { LOG.info("Record too large for in-memory buffer: " + e.getMessage()); spillSingleRecord(key, value, partition); mapOutputRecordCounter.increment(1); return; } }
MapOut Spill到硬盘
当循环数组满时,sortAndSpill会调用,会将spill保存到硬盘。
private void sortAndSpill() throws IOException, ClassNotFoundException, InterruptedException { //approximate the length of the output file to be the length of the //buffer + header lengths for the partitions final long size = (bufend >= bufstart ? bufend - bufstart : (bufvoid - bufend) + bufstart) + partitions * APPROX_HEADER_LENGTH; FSDataOutputStream out = null; try { // create spill file final SpillRecord spillRec = new SpillRecord(partitions); final Path filename = mapOutputFile.getSpillFileForWrite(numSpills, size); out = rfs.create(filename); final int mstart = kvend / NMETA; final int mend = 1 + // kvend is a valid record (kvstart >= kvend ? kvstart : kvmeta.capacity() + kvstart) / NMETA; sorter.sort(MapOutputBuffer.this, mstart, mend, reporter); int spindex = mstart; final IndexRecord rec = new IndexRecord(); final InMemValBytes value = new InMemValBytes(); for (int i = 0; i < partitions; ++i) { IFile.Writer<K, V> writer = null; try { long segmentStart = out.getPos(); writer = new Writer<K, V>(job, out, keyClass, valClass, codec, spilledRecordsCounter); if (combinerRunner == null) { // spill directly DataInputBuffer key = new DataInputBuffer(); while (spindex < mend && kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) { final int kvoff = offsetFor(spindex % maxRec); key.reset(kvbuffer, kvmeta.get(kvoff + KEYSTART), (kvmeta.get(kvoff + VALSTART) - kvmeta.get(kvoff + KEYSTART))); getVBytesForOffset(kvoff, value); writer.append(key, value); ++spindex; } } else { int spstart = spindex; while (spindex < mend && kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) { ++spindex; } // Note: we would like to avoid the combiner if we've fewer // than some threshold of records for a partition if (spstart != spindex) { combineCollector.setWriter(writer); RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex); combinerRunner.combine(kvIter, combineCollector); } } // close the writer writer.close(); // record offsets rec.startOffset = segmentStart; rec.rawLength = writer.getRawLength(); rec.partLength = writer.getCompressedLength(); spillRec.putIndex(rec, i); writer = null; } finally { if (null != writer) writer.close(); } } if (totalIndexCacheMemory >= indexCacheMemoryLimit) { // create spill index file Path indexFilename = mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions * MAP_OUTPUT_INDEX_RECORD_LENGTH); spillRec.writeToFile(indexFilename, job); } else { indexCacheList.add(spillRec); totalIndexCacheMemory += spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH; } LOG.info("Finished spill " + numSpills); ++numSpills; } finally { if (out != null) out.close(); } }
运行ReduceTask
Reduce分为三个阶段,Copy, Sort, and Reduce。
Reduce的输入是一个KeyValueInterator,其通过Shuffle类的run方法产生的
public RawKeyValueIterator run() throws IOException, InterruptedException { // Start the map-completion events fetcher thread final EventFetcher<K,V> eventFetcher = new EventFetcher<K,V>(reduceId, umbilical, scheduler, this); eventFetcher.start(); // Start the map-output fetcher threads final int numFetchers = jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES, 5); Fetcher<K,V>[] fetchers = new Fetcher[numFetchers]; for (int i=0; i < numFetchers; ++i) { fetchers[i] = new Fetcher<K,V>(jobConf, reduceId, scheduler, merger, reporter, metrics, this, reduceTask.getJobTokenSecret()); fetchers[i].start(); } // Wait for shuffle to complete successfully while (!scheduler.waitUntilDone(PROGRESS_FREQUENCY)) { reporter.progress(); synchronized (this) { if (throwable != null) { throw new ShuffleError("error in shuffle in " + throwingThreadName, throwable); } } } // Stop the event-fetcher thread eventFetcher.shutDown(); // Stop the map-output fetcher threads for (Fetcher<K,V> fetcher : fetchers) { fetcher.shutDown(); } fetchers = null; // stop the scheduler scheduler.close(); copyPhase.complete(); // copy is already complete taskStatus.setPhase(TaskStatus.Phase.SORT); reduceTask.statusUpdate(umbilical); // Finish the on-going merges... RawKeyValueIterator kvIter = null; try { kvIter = merger.close(); } catch (Throwable e) { throw new ShuffleError("Error while doing final merge " , e); } // Sanity check synchronized (this) { if (throwable != null) { throw new ShuffleError("error in shuffle in " + throwingThreadName, throwable); } } return kvIter; }
从MapHost拷贝Output
EventFetcher通过调用umbilical.getMapCompletionEvents()获取 MapTask完成,并通过scheduler.addKnownMapOutput()告知scheduler. Fetcher通过调用copyFromHost(MapHost host) 方法,通过HttpConnection连接MapOutputServlet获取某个host下MapJob的MapOut。
当scheduler下的所有Fetcher都成功后,copy阶段结束。
将MapOut排序并返回RawKeyValueInterator
MergeManager的close方法中,将会对MapOut进行sort, 并返回排序group后的RawKeyValueInterator。实际上,Task没有将不同MapTask产生的MapOutput合成一个排好序的大文件,而是通过MergeQueue实现了一个RawKeyValueInterator。
MergeQueue 中实际上是一个heap, 存放了按key排序好的MapOut(称为Segment)作为元素,如MapOutput1, MapOut2, ... , MapOutN. 每次调用next()方法时,从segments中取出具有最小key的记录,并返回给Reducer调用。
public boolean next() throws IOException { if (size() == 0) return false; if (minSegment != null) { //minSegment is non-null for all invocations of next except the first //one. For the first invocation, the priority queue is ready for use //but for the subsequent invocations, first adjust the queue adjustPriorityQueue(minSegment); if (size() == 0) { minSegment = null; return false; } } minSegment = top(); if (!minSegment.inMemory()) { //When we load the value from an inmemory segment, we reset //the "value" DIB in this class to the inmem segment's byte[]. //When we load the value bytes from disk, we shouldn't use //the same byte[] since it would corrupt the data in the inmem //segment. So we maintain an explicit DIB for value bytes //obtained from disk, and if the current segment is a disk //segment, we reset the "value" DIB to the byte[] in that (so //we reuse the disk segment DIB whenever we consider //a disk segment). value.reset(diskIFileValue.getData(), diskIFileValue.getLength()); } long startPos = minSegment.getPosition(); key = minSegment.getKey(); minSegment.getValue(value); long endPos = minSegment.getPosition(); totalBytesProcessed += endPos - startPos; mergeProgress.set(totalBytesProcessed * progPerByte); return true; }
参考:
- http://blog.csdn.net/HEYUTAO007/article/details/5725379
- Hadoop MapReduce过程 源代码解析
- Hadoop MapReduce 任务执行流程源代码详细解析
- Hadoop MapReduce 任务执行流程源代码详细解析
- Hadoop MapReduce 任务执行流程源代码详细解析
- Hadoop MapReduce 任务执行流程源代码详细解析
- Hadoop MapReduce 任务执行流程源代码详细解析
- Hadoop MapReduce 过程概述
- Hadoop MapReduce 过程概述
- Hadoop -- MapReduce过程
- hadoop的mapreduce过程
- Hadoop-MapReduce过程
- Hadoop MapReduce工作过程
- hadoop MapReduce实例解析
- hadoop MapReduce实例解析
- hadoop MapReduce实例解析
- hadoop MapReduce实例解析
- hadoop MapReduce实例解析
- hadoop MapReduce实例解析
- uva 10655 Contemplation! Algebra 矩阵快速幂
- LeetCode-Triangle
- 线程存储pthread_key_create
- QT 5.1.0(MinGW)的安装及使用
- 跨对话框操作,添加的按钮,无法响应事件。
- Hadoop MapReduce过程 源代码解析
- hdu 3033 i love sneakers
- Android中引入第三方Jar包的方法(java.lang.NoClassDefFoundError解决办法)
- 如何解决Eclipse启动时画面一闪而过
- hdu1233 还是畅通工程(最小生成树 prim 算法)
- ubuntu 登录输入用户名密码之后重新跳回登录界面
- 一些逻辑题,有空看看 可以锻炼一下逻辑思维
- 异趣同辉APP
- [leetcode刷题系列]3Sum Closest