Hadoop MapReduce过程源代码解析

来源：互联网发布：抢购软件哪个好编辑：程序博客网时间：2024/05/14 23:56

网上现有的Hadoop源代码分析与最新代码相比稍显落后。笔者本着学习总结目的，分析了Hadoop 2.02的源代码。

概论

一个完整的Hadoop MapReduce过程可以描述如下：

Client端提交MapReduce Job到JobTracker;
JobTracker调度Job, 生成MapTask和ReduceTask;
各TaskTracker接收MapTask和ReduceTask;
TaskTracker为MapTask和ReduceTask启动新的Child Task JVM;
Child Task JVM 运行MapTask或ReduceTask。
Child Task JVM 通过TaskTracker向JobTracker汇报进度和状态。
当JobTacker下所有的Task都成功时，Job标志位成功状态。

JobClient 提交 MapReduce Job

JobClient.runJob()方法一旦调用，MapReduce的大象就起跑了。

钻入Job.submit()方法，找到一个有货的方法JobSubmitter.submitJobInternal()。

  JobStatus submitJobInternal(Job job, Cluster cluster)   throws ClassNotFoundException, InterruptedException, IOException {    //检查job的规范    checkSpecs(job);        Path jobStagingArea = JobSubmissionFiles.getStagingDir(cluster,                                                      job.getConfiguration());    //configure the command line options correctly on the submitting dfs    Configuration conf = job.getConfiguration();    InetAddress ip = InetAddress.getLocalHost();    if (ip != null) {      submitHostAddress = ip.getHostAddress();      submitHostName = ip.getHostName();      conf.set(MRJobConfig.JOB_SUBMITHOST,submitHostName);      conf.set(MRJobConfig.JOB_SUBMITHOSTADDR,submitHostAddress);    }    JobID jobId = submitClient.getNewJobID();    job.setJobID(jobId);    Path submitJobDir = new Path(jobStagingArea, jobId.toString());    JobStatus status = null;    try {      conf.set("hadoop.http.filter.initializers",           "org.apache.hadoop.yarn.server.webproxy.amfilter.AmFilterInitializer");      conf.set(MRJobConfig.MAPREDUCE_JOB_DIR, submitJobDir.toString());      LOG.debug("Configuring job " + jobId + " with " + submitJobDir           + " as the submit dir");      // get delegation token for the dir      TokenCache.obtainTokensForNamenodes(job.getCredentials(),          new Path[] { submitJobDir }, conf);            populateTokenCache(conf, job.getCredentials());            ////拷贝Job相关jar包及配置文件到HDFS      copyAndConfigureFiles(job, submitJobDir);      Path submitJobFile = JobSubmissionFiles.getJobConfPath(submitJobDir);            //创建job的InputSplit，并保持到job.split里。job.split包含了每个split的hosts location信息      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));      int maps = writeSplits(job, submitJobDir);      conf.setInt(MRJobConfig.NUM_MAPS, maps);      LOG.info("number of splits:" + maps);      String queue = conf.get(MRJobConfig.QUEUE_NAME,          JobConf.DEFAULT_QUEUE_NAME);      AccessControlList acl = submitClient.getQueueAdmins(queue);      conf.set(toFullPropertyName(queue,          QueueACL.ADMINISTER_JOBS.getAclName()), acl.getAclString());      TokenCache.cleanUpTokenReferral(conf);      // Write job file to submit dir      writeConf(conf, submitJobFile);            //正式提交job      printTokens(jobId, job.getCredentials());      status = submitClient.submitJob(          jobId, submitJobDir.toString(), job.getCredentials());      if (status != null) {        return status;      } else {        throw new IOException("Could not launch job");      }    } finally {      if (status == null) {        LOG.info("Cleaning up the staging area " + submitJobDir);        if (jtFs != null && submitJobDir != null)          jtFs.delete(submitJobDir, true);      }    }  }

提交job后，一般会调用JobClient.waitForCompletion()方法。依次进入Job.monitorAndPrintJob()方法，可以看到此方法主要完成了执行过程中打印map/reduce进度百分比，执行完成后打印执行状态（如counter数值）。

分析可以发现：

Map 的input Split和数量在此阶段就已经确定，通过调用InputFormat的getInputSplit()方法来实现；

  private <T extends InputSplit>  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,      InterruptedException, ClassNotFoundException {    Configuration conf = job.getConfiguration();    InputFormat<?, ?> input =      ReflectionUtils.newInstance(job.getInputFormatClass(), conf);    //调用InputFormat的Split方法，产生InputSplit    List<InputSplit> splits = input.getSplits(job);    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);    // sort the splits into order based on size, so that the biggest    // go first    Arrays.sort(array, new SplitComparator());    JobSplitWriter.createSplitFiles(jobSubmitDir, conf,         jobSubmitDir.getFileSystem(conf), array);    return array.length;  }

JobTracker生成Map Tasks和Reduce Tasks

JobTracker是Hadoop的指挥中心，其作用在：

与JobClient通讯，获取新的Job；功能定义在ClientProtocol里。
将Job分解为Map Tasks和ReduceTasks, 存储其队列中；功能定义在TaskTrackerManager.initJob()里。
与TaskTracker通讯，将Tasks下发到TaskTracker执行；功能定义在InterTrackerProtocol.heartbeat()里。
汇集同一Job下不同Tasks的状态，从而决定Job的状态。

JobTracker通过offerService来提供其功能。

  public void offerService() throws InterruptedException, IOException {    // Prepare for recovery. This is done irrespective of the status of restart    // flag.    while (true) {      try {        recoveryManager.updateRestartCount();        break;      } catch (IOException ioe) {        LOG.warn("Failed to initialize recovery manager. ", ioe);        // wait for some time        Thread.sleep(FS_ACCESS_RETRY_PERIOD);        LOG.warn("Retrying...");      }    }    taskScheduler.start();        recoveryManager.recover();        // refresh the node list as the recovery manager might have added     // disallowed trackers    refreshHosts();        startExpireTrackersThread();    expireLaunchingTaskThread.start();    if (completedJobStatusStore.isActive()) {      completedJobsStoreThread = new Thread(completedJobStatusStore,                                            "completedjobsStore-housekeeper");      completedJobsStoreThread.start();    }    // start the inter-tracker server once the jt is ready    this.interTrackerServer.start();        synchronized (this) {      state = State.RUNNING;    }    LOG.info("Starting RUNNING");        this.interTrackerServer.join();    LOG.info("Stopped interTrackerServer");  }

JobTracker接受JobClient提交的Job

  private JobStatus submitJob(org.apache.hadoop.mapreduce.JobID jobID,       int restartCount, UserGroupInformation ugi,       String jobSubmitDir, boolean recovered, Credentials ts      )      throws IOException, InterruptedException {     ...    // Create the JobInProgress, temporarily unlock the JobTracker since    // we are about to copy job.xml from HDFSJobInProgress    JobInProgress job =        new JobInProgress(this, this.conf, restartCount, jobInfo, ts);    synchronized (this) {      ...      return addJob(jobId, job);    }  }

JobInitManager将Job分解为Tasks，并加入队列

  class JobInitManager implements Runnable {       public void run() {      JobInProgress job = null;      while (true) {        try {          synchronized (jobInitQueue) {            while (jobInitQueue.isEmpty()) {              jobInitQueue.wait();            }            job = jobInitQueue.remove(0);          }          threadPool.execute(new InitJob(job));        } catch (InterruptedException t) {          LOG.info("JobInitManagerThread interrupted.");          break;        }       }      LOG.info("Shutting down thread pool");      threadPool.shutdownNow();    }  }

调用JobInProcess.initTasks()函数，为MapTask和ReduceTask生成多个TasksInProgress对象

  public synchronized void initTasks()   throws IOException, KillInterruptedException, UnknownHostException {    ...    createMapTasks(jobFile.toString(), taskSplitMetaInfo);        ...            // set the launch time    this.launchTime = JobTracker.getClock().getTime();    createReduceTasks(jobFile.toString());        ...  }

JobTracker给TaskTracker分配任务

JobTracker在heatBeat()方法中，调用JobQueueTaskScheduler.assignTasks(TaskTracker taskTracker)函数,并将Task包含在HeartbeatResponse里返回。

  public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status,                                                   boolean restarted,                                                  boolean initialContact,                                                  boolean acceptNewTasks,                                                   short responseId)     throws IOException {    ...       // Process this heartbeat     short newResponseId = (short)(responseId + 1);    status.setLastSeen(now);    if (!processHeartbeat(status, initialContact)) {      if (prevHeartbeatResponse != null) {        trackerToHeartbeatResponseMap.remove(trackerName);      }      return new HeartbeatResponse(newResponseId,                    new TaskTrackerAction[] {new ReinitTrackerAction()});    }          // Initialize the response to be sent for the heartbeat    HeartbeatResponse response = new HeartbeatResponse(newResponseId, null);    List<TaskTrackerAction> actions = new ArrayList<TaskTrackerAction>();    isBlacklisted = faultyTrackers.isBlacklisted(status.getHost());    // Check for new tasks to be executed on the tasktracker    if (acceptNewTasks && !isBlacklisted) {      TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName) ;      if (taskTrackerStatus == null) {        LOG.warn("Unknown task tracker polling; ignoring: " + trackerName);      } else {        List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);        if (tasks == null ) {          tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName));        }        if (tasks != null) {          for (Task task : tasks) {            expireLaunchingTasks.addNewTask(task.getTaskID());            if (LOG.isDebugEnabled()) {              LOG.debug(trackerName + " -> LaunchTask: " + task.getTaskID());            }            actions.add(new LaunchTaskAction(task));          }        }      }    }          ...    int nextInterval = getNextHeartbeatInterval();    response.setHeartbeatInterval(nextInterval);    response.setActions(                        actions.toArray(new TaskTrackerAction[actions.size()]));        // Update the trackerToHeartbeatResponseMap    trackerToHeartbeatResponseMap.put(trackerName, response);    ...            return response;  }

TaskTracker接收Task并启动Task

TaskTracker是Hadoop中的任务处理节点，其作用有：

与JobTracker通讯，接受任务；
启动Chilld JVM，运行MapTask或ReduceTask;
将ChildTask执行状态汇报给JobTracker; 功能定义在TaskUmbilicalProtocol中。

  /**   * The server retry loop.     * This while-loop attempts to connect to the JobTracker.  It only    * loops when the old TaskTracker has gone bad (its state is   * stale somehow) and we need to reinitialize everything.   */  public void run() {    try {      startCleanupThreads();      boolean denied = false;      while (running && !shuttingDown && !denied) {        boolean staleState = false;        try {          // This while-loop attempts reconnects if we get network errors          while (running && !staleState && !shuttingDown && !denied) {            try {              State osState = offerService();              if (osState == State.STALE) {                staleState = true;              } else if (osState == State.DENIED) {                denied = true;              }            } catch (Exception ex) {              if (!shuttingDown) {                LOG.info("Lost connection to JobTracker [" +                         jobTrackAddr + "].  Retrying...", ex);                try {                  Thread.sleep(5000);                } catch (InterruptedException ie) {                }              }            }          }        } finally {          close();        }        if (shuttingDown) { return; }        LOG.warn("Reinitializing local state");        initialize();      }      if (denied) {        shutdown();      }    } catch (IOException iex) {      LOG.error("Got fatal exception while reinitializing TaskTracker: " +                StringUtils.stringifyException(iex));      return;    }    catch (InterruptedException i) {      LOG.error("Got interrupted while reinitializing TaskTracker: " +           i.getMessage());      return;    }  }

主要操作定义在函数里。

TaskTracker接收Task

在TaskTracker.offerService()里，当接收到LauchTaskAction时，会将Task加入tasksToLaunch队列中。

State offerService() throws Exception {    long lastHeartbeat = 0;    while (running && !shuttingDown) {      try {        ...        // Send the heartbeat and process the jobtracker's directives        HeartbeatResponse heartbeatResponse = transmitHeartBeat(now);                TaskTrackerAction[] actions = heartbeatResponse.getActions();        ...        if (actions != null){           for(TaskTrackerAction action: actions) {            if (action instanceof LaunchTaskAction) {              addToTaskQueue((LaunchTaskAction)action);            } else if (action instanceof CommitTaskAction) {              CommitTaskAction commitAction = (CommitTaskAction)action;              if (!commitResponses.contains(commitAction.getTaskID())) {                LOG.info("Received commit task action for " +                           commitAction.getTaskID());                commitResponses.add(commitAction.getTaskID());              }            } else {              tasksToCleanup.put(action);            }          }        }        markUnresponsiveTasks();        killOverflowingTasks();                    //we've cleaned up, resume normal operation        if (!acceptNewTasks && isIdle()) {          acceptNewTasks=true;        }        ...      }    }    return State.NORMAL;  }

TaskLauncher thread不断轮询tasksToLaunch队列，当有Slots 可用时，就调用launchTask()，着手启动Task了。

    public void run() {      while (!Thread.interrupted()) {        try {          TaskInProgress tip;          Task task;          synchronized (tasksToLaunch) {            while (tasksToLaunch.isEmpty()) {              tasksToLaunch.wait();            }            //get the TIP            tip = tasksToLaunch.remove(0);            task = tip.getTask();            LOG.info("Trying to launch : " + tip.getTask().getTaskID() +                      " which needs " + task.getNumSlotsRequired() + " slots");          }          //wait for free slots to run          synchronized (numFreeSlots) {            boolean canLaunch = true;            while (numFreeSlots.get() < task.getNumSlotsRequired()) {              //Make sure that there is no kill task action for this task!              //We are not locking tip here, because it would reverse the              //locking order!              //Also, Lock for the tip is not required here! because :              // 1. runState of TaskStatus is volatile              // 2. Any notification is not missed because notification is              // synchronized on numFreeSlots. So, while we are doing the check,              // if the tip is half way through the kill(), we don't miss              // notification for the following wait().               if (!tip.canBeLaunched()) {                //got killed externally while still in the launcher queue                LOG.info("Not blocking slots for " + task.getTaskID()                    + " as it got killed externally. Task's state is "                    + tip.getRunState());                canLaunch = false;                break;              }                            LOG.info("TaskLauncher : Waiting for " + task.getNumSlotsRequired() +                        " to launch " + task.getTaskID() + ", currently we have " +                        numFreeSlots.get() + " free slots");              numFreeSlots.wait();            }            if (!canLaunch) {              continue;            }            LOG.info("In TaskLauncher, current free slots : " + numFreeSlots.get()+                     " and trying to launch "+tip.getTask().getTaskID() +                      " which needs " + task.getNumSlotsRequired() + " slots");            numFreeSlots.set(numFreeSlots.get() - task.getNumSlotsRequired());            assert (numFreeSlots.get() >= 0);          }          synchronized (tip) {            //to make sure that there is no kill task action for this            if (!tip.canBeLaunched()) {              //got killed externally while still in the launcher queue              LOG.info("Not launching task " + task.getTaskID() + " as it got"                + " killed externally. Task's state is " + tip.getRunState());              addFreeSlots(task.getNumSlotsRequired());              continue;            }            tip.slotTaken = true;          }          //got a free slot. launch the task          startNewTask(tip);        } catch (InterruptedException e) {           return; // ALL DONE        } catch (Throwable th) {          LOG.error("TaskLauncher error " +               StringUtils.stringifyException(th));        }      }    }  }

TaskTracker启动Task JVM

TaskTracker调用launchTask()，创建TaskRunner线程并启动线程；
TaskRunner调用launchJvmAndWait(), 通过JvmManager.reapJvm()函数完成。
1. JvmManager根据条件判断，是否启用新JVM。如果已有相同JobID的JVM而且此JVM处于空闲状态，则复用此JVM。
2. 如果没有可用JVM复用，则调用spawnNewJvm() 创建JvmRunner并调用其runChild()。
JvmRunner 依次调用DefaultTaskController.launchTaskJVM()，ShellCommandExecutor.runCommand()，ProcessBuilder.start()方法。

Task JVM运行MapTask/ReduceTask

Task JVM启动的MAIN_CLASS是org.apache.hadoop.mapred.Child。

Child通过TaskUmbilicalProtocol与 localhost的TaskTracker通讯，通过 umbilical.getTask()获取task, 并运行其run方法。获取的task分为三种类型，MapTask, ReduceTask,或者CleanTask。

    try {      while (true) {        taskid = null;        JvmTask myTask = umbilical.getTask(context);        if (myTask.shouldDie()) {          break;        } else {          if (myTask.getTask() == null) {            taskid = null;            if (++idleLoopCount >= SLEEP_LONGER_COUNT) {              //we sleep for a bigger interval when we don't receive              //tasks for a while              Thread.sleep(1500);            } else {              Thread.sleep(500);            }            continue;          }        }        idleLoopCount = 0;        task = myTask.getTask();...        final Task taskFinal = task;        childUGI.doAs(new PrivilegedExceptionAction<Object>() {          @Override          public Object run() throws Exception {            try {              // use job-specified working directory              FileSystem.get(job).setWorkingDirectory(job.getWorkingDirectory());              taskFinal.run(job, umbilical);             // run the task            } finally {              TaskLog.syncLogs(logLocation, taskid, isCleanup);            }                        return null;          }        });...      }    }  }}

本文描述的是MapReduce New API的部分

运行MapTask

Map阶段比较容易理解。

MapTask在run()方法中，调用runNewMapper()方法，

  @SuppressWarnings("unchecked")  private <INKEY,INVALUE,OUTKEY,OUTVALUE>  void runNewMapper(final JobConf job,                    final TaskSplitIndex splitIndex,                    final TaskUmbilicalProtocol umbilical,                    TaskReporter reporter                    ) throws IOException, ClassNotFoundException,                             InterruptedException {    // make a task context so we can get the classes    org.apache.hadoop.mapreduce.TaskAttemptContext taskContext =      new org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl(job,                                                                   getTaskID(),                                                                  reporter);    // make a mapper    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE> mapper =      (org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>)        ReflectionUtils.newInstance(taskContext.getMapperClass(), job);    // make the input format    org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE> inputFormat =      (org.apache.hadoop.mapreduce.InputFormat<INKEY,INVALUE>)        ReflectionUtils.newInstance(taskContext.getInputFormatClass(), job);    // rebuild the input split    org.apache.hadoop.mapreduce.InputSplit split = null;    split = getSplitDetails(new Path(splitIndex.getSplitLocation()),        splitIndex.getStartOffset());    org.apache.hadoop.mapreduce.RecordReader<INKEY,INVALUE> input =      new NewTrackingRecordReader<INKEY,INVALUE>        (split, inputFormat, reporter, taskContext);        job.setBoolean(JobContext.SKIP_RECORDS, isSkipping());    org.apache.hadoop.mapreduce.RecordWriter output = null;        // get an output object    if (job.getNumReduceTasks() == 0) {      output =         new NewDirectOutputCollector(taskContext, job, umbilical, reporter);    } else {      output = new NewOutputCollector(taskContext, job, umbilical, reporter);    }    org.apache.hadoop.mapreduce.MapContext<INKEY, INVALUE, OUTKEY, OUTVALUE>     mapContext =       new MapContextImpl<INKEY, INVALUE, OUTKEY, OUTVALUE>(job, getTaskID(),           input, output,           committer,           reporter, split);    org.apache.hadoop.mapreduce.Mapper<INKEY,INVALUE,OUTKEY,OUTVALUE>.Context         mapperContext =           new WrappedMapper<INKEY, INVALUE, OUTKEY, OUTVALUE>().getMapContext(              mapContext);    input.initialize(split, mapperContext);    mapper.run(mapperContext);    mapPhase.complete();    setPhase(TaskStatus.Phase.SORT);    statusUpdate(umbilical);    input.close();    output.close(mapperContext);  }

runNewMapper中调用用户提交的Mapper类

  /**   * Expert users can override this method for more complete control over the   * execution of the Mapper.   * @param context   * @throws IOException   */  public void run(Context context) throws IOException, InterruptedException {    setup(context);    while (context.nextKeyValue()) {      map(context.getCurrentKey(), context.getCurrentValue(), context);    }    cleanup(context);  }

Map结果到内存MapOutputBuffer

map的结果被context收集，默认的实现是收集在MapOutputBuffer中。MapOutputBuffer是通过循环数组来实现的。

    /**     * Serialize the key, value to intermediate storage.     * When this method returns, kvindex must refer to sufficient unused     * storage to store one METADATA.     */    public synchronized void collect(K key, V value, final int partition                                     ) throws IOException {      reporter.progress();      if (key.getClass() != keyClass) {        throw new IOException("Type mismatch in key from map: expected "                              + keyClass.getName() + ", received "                              + key.getClass().getName());      }      if (value.getClass() != valClass) {        throw new IOException("Type mismatch in value from map: expected "                              + valClass.getName() + ", received "                              + value.getClass().getName());      }      if (partition < 0 || partition >= partitions) {        throw new IOException("Illegal partition for " + key + " (" +            partition + ")");      }      checkSpillException();      bufferRemaining -= METASIZE;      if (bufferRemaining <= 0) {        // start spill if the thread is not running and the soft limit has been        // reached        spillLock.lock();        try {          do {            if (!spillInProgress) {              final int kvbidx = 4 * kvindex;              final int kvbend = 4 * kvend;              // serialized, unspilled bytes always lie between kvindex and              // bufindex, crossing the equator. Note that any void space              // created by a reset must be included in "used" bytes              final int bUsed = distanceTo(kvbidx, bufindex);              final boolean bufsoftlimit = bUsed >= softLimit;              if ((kvbend + METASIZE) % kvbuffer.length !=                  equator - (equator % METASIZE)) {                // spill finished, reclaim space                resetSpill();                bufferRemaining = Math.min(                    distanceTo(bufindex, kvbidx) - 2 * METASIZE,                    softLimit - bUsed) - METASIZE;                continue;              } else if (bufsoftlimit && kvindex != kvend) {                // spill records, if any collected; check latter, as it may                // be possible for metadata alignment to hit spill pcnt                startSpill();                final int avgRec = (int)                  (mapOutputByteCounter.getCounter() /                  mapOutputRecordCounter.getCounter());                // leave at least half the split buffer for serialization data                // ensure that kvindex >= bufindex                final int distkvi = distanceTo(bufindex, kvbidx);                final int newPos = (bufindex +                  Math.max(2 * METASIZE - 1,                          Math.min(distkvi / 2,                                   distkvi / (METASIZE + avgRec) * METASIZE)))                  % kvbuffer.length;                setEquator(newPos);                bufmark = bufindex = newPos;                final int serBound = 4 * kvend;                // bytes remaining before the lock must be held and limits                // checked is the minimum of three arcs: the metadata space, the                // serialization space, and the soft limit                bufferRemaining = Math.min(                    // metadata max                    distanceTo(bufend, newPos),                    Math.min(                      // serialization max                      distanceTo(newPos, serBound),                      // soft limit                      softLimit)) - 2 * METASIZE;              }            }          } while (false);        } finally {          spillLock.unlock();        }      }      try {        // serialize key bytes into buffer        int keystart = bufindex;        keySerializer.serialize(key);        if (bufindex < keystart) {          // wrapped the key; must make contiguous          bb.shiftBufferedKey();          keystart = 0;        }        // serialize value bytes into buffer        final int valstart = bufindex;        valSerializer.serialize(value);        // It's possible for records to have zero length, i.e. the serializer        // will perform no writes. To ensure that the boundary conditions are        // checked and that the kvindex invariant is maintained, perform a        // zero-length write into the buffer. The logic monitoring this could be        // moved into collect, but this is cleaner and inexpensive. For now, it        // is acceptable.        bb.write(b0, 0, 0);        // the record must be marked after the preceding write, as the metadata        // for this record are not yet written        int valend = bb.markRecord();        mapOutputRecordCounter.increment(1);        mapOutputByteCounter.increment(            distanceTo(keystart, valend, bufvoid));        // write accounting info        kvmeta.put(kvindex + INDEX, kvindex);        kvmeta.put(kvindex + PARTITION, partition);        kvmeta.put(kvindex + KEYSTART, keystart);        kvmeta.put(kvindex + VALSTART, valstart);        // advance kvindex        kvindex = (kvindex - NMETA + kvmeta.capacity()) % kvmeta.capacity();      } catch (MapBufferTooSmallException e) {        LOG.info("Record too large for in-memory buffer: " + e.getMessage());        spillSingleRecord(key, value, partition);        mapOutputRecordCounter.increment(1);        return;      }    }

MapOut Spill到硬盘

当循环数组满时，sortAndSpill会调用，会将spill保存到硬盘。

    private void sortAndSpill() throws IOException, ClassNotFoundException,                                       InterruptedException {      //approximate the length of the output file to be the length of the      //buffer + header lengths for the partitions      final long size = (bufend >= bufstart          ? bufend - bufstart          : (bufvoid - bufend) + bufstart) +                  partitions * APPROX_HEADER_LENGTH;      FSDataOutputStream out = null;      try {        // create spill file        final SpillRecord spillRec = new SpillRecord(partitions);        final Path filename =            mapOutputFile.getSpillFileForWrite(numSpills, size);        out = rfs.create(filename);        final int mstart = kvend / NMETA;        final int mend = 1 + // kvend is a valid record          (kvstart >= kvend          ? kvstart          : kvmeta.capacity() + kvstart) / NMETA;        sorter.sort(MapOutputBuffer.this, mstart, mend, reporter);        int spindex = mstart;        final IndexRecord rec = new IndexRecord();        final InMemValBytes value = new InMemValBytes();        for (int i = 0; i < partitions; ++i) {          IFile.Writer<K, V> writer = null;          try {            long segmentStart = out.getPos();            writer = new Writer<K, V>(job, out, keyClass, valClass, codec,                                      spilledRecordsCounter);            if (combinerRunner == null) {              // spill directly              DataInputBuffer key = new DataInputBuffer();              while (spindex < mend &&                  kvmeta.get(offsetFor(spindex % maxRec) + PARTITION) == i) {                final int kvoff = offsetFor(spindex % maxRec);                key.reset(kvbuffer, kvmeta.get(kvoff + KEYSTART),                          (kvmeta.get(kvoff + VALSTART) -                           kvmeta.get(kvoff + KEYSTART)));                getVBytesForOffset(kvoff, value);                writer.append(key, value);                ++spindex;              }            } else {              int spstart = spindex;              while (spindex < mend &&                  kvmeta.get(offsetFor(spindex % maxRec)                            + PARTITION) == i) {                ++spindex;              }              // Note: we would like to avoid the combiner if we've fewer              // than some threshold of records for a partition              if (spstart != spindex) {                combineCollector.setWriter(writer);                RawKeyValueIterator kvIter =                  new MRResultIterator(spstart, spindex);                combinerRunner.combine(kvIter, combineCollector);              }            }            // close the writer            writer.close();            // record offsets            rec.startOffset = segmentStart;            rec.rawLength = writer.getRawLength();            rec.partLength = writer.getCompressedLength();            spillRec.putIndex(rec, i);            writer = null;          } finally {            if (null != writer) writer.close();          }        }        if (totalIndexCacheMemory >= indexCacheMemoryLimit) {          // create spill index file          Path indexFilename =              mapOutputFile.getSpillIndexFileForWrite(numSpills, partitions                  * MAP_OUTPUT_INDEX_RECORD_LENGTH);          spillRec.writeToFile(indexFilename, job);        } else {          indexCacheList.add(spillRec);          totalIndexCacheMemory +=            spillRec.size() * MAP_OUTPUT_INDEX_RECORD_LENGTH;        }        LOG.info("Finished spill " + numSpills);        ++numSpills;      } finally {        if (out != null) out.close();      }    }

运行ReduceTask

Reduce分为三个阶段，Copy, Sort, and Reduce。

Reduce的输入是一个KeyValueInterator，其通过Shuffle类的run方法产生的

  public RawKeyValueIterator run() throws IOException, InterruptedException {    // Start the map-completion events fetcher thread    final EventFetcher<K,V> eventFetcher =       new EventFetcher<K,V>(reduceId, umbilical, scheduler, this);    eventFetcher.start();        // Start the map-output fetcher threads    final int numFetchers = jobConf.getInt(MRJobConfig.SHUFFLE_PARALLEL_COPIES, 5);    Fetcher<K,V>[] fetchers = new Fetcher[numFetchers];    for (int i=0; i < numFetchers; ++i) {      fetchers[i] = new Fetcher<K,V>(jobConf, reduceId, scheduler, merger,                                      reporter, metrics, this,                                      reduceTask.getJobTokenSecret());      fetchers[i].start();    }        // Wait for shuffle to complete successfully    while (!scheduler.waitUntilDone(PROGRESS_FREQUENCY)) {      reporter.progress();            synchronized (this) {        if (throwable != null) {          throw new ShuffleError("error in shuffle in " + throwingThreadName,                                 throwable);        }      }    }    // Stop the event-fetcher thread    eventFetcher.shutDown();        // Stop the map-output fetcher threads    for (Fetcher<K,V> fetcher : fetchers) {      fetcher.shutDown();    }    fetchers = null;        // stop the scheduler    scheduler.close();    copyPhase.complete(); // copy is already complete    taskStatus.setPhase(TaskStatus.Phase.SORT);    reduceTask.statusUpdate(umbilical);    // Finish the on-going merges...    RawKeyValueIterator kvIter = null;    try {      kvIter = merger.close();    } catch (Throwable e) {      throw new ShuffleError("Error while doing final merge " , e);    }    // Sanity check    synchronized (this) {      if (throwable != null) {        throw new ShuffleError("error in shuffle in " + throwingThreadName,                               throwable);      }    }        return kvIter;  }

从MapHost拷贝Output

EventFetcher通过调用umbilical.getMapCompletionEvents()获取 MapTask完成，并通过scheduler.addKnownMapOutput()告知scheduler. Fetcher通过调用copyFromHost(MapHost host) 方法，通过HttpConnection连接MapOutputServlet获取某个host下MapJob的MapOut。

当scheduler下的所有Fetcher都成功后，copy阶段结束。

将MapOut排序并返回RawKeyValueInterator

MergeManager的close方法中，将会对MapOut进行sort, 并返回排序group后的RawKeyValueInterator。实际上，Task没有将不同MapTask产生的MapOutput合成一个排好序的大文件，而是通过MergeQueue实现了一个RawKeyValueInterator。

MergeQueue 中实际上是一个heap, 存放了按key排序好的MapOut（称为Segment）作为元素，如MapOutput1, MapOut2, ... , MapOutN. 每次调用next()方法时，从segments中取出具有最小key的记录，并返回给Reducer调用。

    public boolean next() throws IOException {      if (size() == 0)        return false;      if (minSegment != null) {        //minSegment is non-null for all invocations of next except the first        //one. For the first invocation, the priority queue is ready for use        //but for the subsequent invocations, first adjust the queue         adjustPriorityQueue(minSegment);        if (size() == 0) {          minSegment = null;          return false;        }      }      minSegment = top();      if (!minSegment.inMemory()) {        //When we load the value from an inmemory segment, we reset        //the "value" DIB in this class to the inmem segment's byte[].        //When we load the value bytes from disk, we shouldn't use        //the same byte[] since it would corrupt the data in the inmem        //segment. So we maintain an explicit DIB for value bytes        //obtained from disk, and if the current segment is a disk        //segment, we reset the "value" DIB to the byte[] in that (so         //we reuse the disk segment DIB whenever we consider        //a disk segment).        value.reset(diskIFileValue.getData(), diskIFileValue.getLength());      }      long startPos = minSegment.getPosition();      key = minSegment.getKey();      minSegment.getValue(value);      long endPos = minSegment.getPosition();      totalBytesProcessed += endPos - startPos;      mergeProgress.set(totalBytesProcessed * progPerByte);      return true;    }

参考：

http://blog.csdn.net/HEYUTAO007/article/details/5725379

Hadoop MapReduce过程 源代码解析

概论