Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系

来源：互联网发布：张予曦的淘宝店编辑：程序博客网时间：2024/05/23 13:34

Spark根据RDD间的依赖关系是否是Shuffle依赖进行Stage的划分，为了说明问题在这里先执行的Stage标记为Stage1，后执行的Stage标记为Stage2。Shuffle分2步操作

Map操作和Recude操作可以通过下面这个图表示出来：

1. Map操作。Map操作在Stage1结束的时候执行；Map操作的作用是将Stage1阶段的一个partition的数据写入到Shuffle文件，Shuffle文件保存在执行Map操作的节点的本地磁盘

2. Reduce操作。 Reduce操作在Stage2开始的时候执行；Reduce操作的作用是读取Map操作生成的Shuffle文件，生成Stage2的一个partition

Map操作与Recude操作通过ShuffledRDD类和MapOutputTrackerMaster类联系起来。

通过ShuffledRDD可以找到Stage1和Stage2之间的依赖关系，这个依赖关系里面包含了横跨2个Stage间ShuffledRDD的分区算法类Partitioner，数据合并类Aggregator、数据序列化类Serializer等。

MapOutputTrackerMaster类保存了Stage1是如何生成Shuffle文件的，Stage2根据MapOutputTrackerMaster类对Shuffle文件的描述读取Shuffle磁盘的文件，产生Stage2的一个分区的数据。

Stage1(Shuffle Map任务)执行完成之后，Spark Driver采用如下时序图进行任务的调度启动Stage2：

上面的时序图对理解Spark shuffle任务调度特别重要，由于显示的不清楚请同学们放大观看，放大后能够清晰显示！

ShuffleMapTask运行结束之后，将生成的Shuffle文件信息返回给TaskRunner类，TaskRunner调用CoarseGrainedExecutorBackend.statusUpdate方法将Shuffle文件信息发送给CoarseGrainedSchedulerBackend类给CoarseGrainedSchedulerBackend类接收到消息后，会调用TaskSchedulerImpl.statusUpate处理任务状态变更消息。

TaskSchdeulerImpl.statusUpdate主要干了以下2件事情：

1. 通过调用TaskSetManager.removeRunningTask将成功完成的任务从TaskSetManager.runningTasksSet HashMap中删除

2.调用TaskResultGetter.enqueueSuccessFullTask将ShuffleMapTask返回信息deserialize，还原ShuffleMapTask返回的对象，这个对象会发送给TaskSchedulerImpl。.默认情况下，ShuffleMapTask返回结果大于1G返回结果会丢弃，并将结果丢弃信息发送给Driver；如果返回信息大于10M-200K，则将返回信息serialize后保存到BlockManager；其它情况则将ShuffleMapTask返回序列化后直接发送给Driver。所以在这一步中，TaskResultGetter.enqueueSuccessFullTask根据这几种情况对结果分别加以处理。处理完成之后会调用TaskSchedulerImpl.handleSuccessfullTask

相关代码为：

def enqueueSuccessfulTask(    taskSetManager: TaskSetManager, tid: Long, serializedData: ByteBuffer) {    getTaskResultExecutor.execute(new Runnable {      override def run(): Unit = Utils.logUncaughtExceptions {        try {          val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match {              /*              * 直接返回结果的处理，这种情况下返回信息没有保存在BlockManager              * */            case directResult: DirectTaskResult[_] =>              if (!taskSetManager.canFetchMoreResults(serializedData.limit())) {                return              }              // deserialize "value" without holding any lock so that it won't block other threads.              // We should call it here, so that when it's called again in              // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value.              directResult.value()              (directResult, serializedData.limit())              /*              * 间接返回结果的处理，这种情况下返回信息保存在BlockManager。从BlockManager读取返回信息之后，需要将返回信息从BlockManager删除              * */            case IndirectTaskResult(blockId, size) =>              if (!taskSetManager.canFetchMoreResults(size)) {                // dropped by executor if size is larger than maxResultSize                sparkEnv.blockManager.master.removeBlock(blockId)                return              }              logDebug("Fetching indirect task result for TID %s".format(tid))              scheduler.handleTaskGettingResult(taskSetManager, tid)              val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId)              if (!serializedTaskResult.isDefined) {                /* We won't be able to get the task result if the machine that ran the task failed                 * between when the task ended and when we tried to fetch the result, or if the                 * block manager had to flush the result. */                scheduler.handleFailedTask(                  taskSetManager, tid, TaskState.FINISHED, TaskResultLost)                return              }              val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]](                serializedTaskResult.get)              sparkEnv.blockManager.master.removeBlock(blockId)              (deserializedResult, size)          }          result.metrics.setResultSize(size)          /*          * Task返回信息已经处理完毕，调用TaskSchedulerImpl.handleSuccessfulTask进行近一步的处理          * */          scheduler.handleSuccessfulTask(taskSetManager, tid, result)        } catch {          case cnf: ClassNotFoundException =>            val loader = Thread.currentThread.getContextClassLoader            taskSetManager.abort("ClassNotFound with classloader: " + loader)          // Matching NonFatal so we don't catch the ControlThrowable from the "return" above.          case NonFatal(ex) =>            logError("Exception while getting task result", ex)            taskSetManager.abort("Exception while getting task result: %s".format(ex))        }      }    })  }

TaskSchedulerImpl.handleSuccessfulTask主要调用了TaskSetManager.handleSuccessfulTask方法进入了任务集内部的任务调度处理。

TaskSetManager.handleSuccessfulTask调用DAGScheduler.taskEnded方法给DAGSchedulerEventProcessLoop发送任务完成CompletionEvent消息，接收到消息之后调用DAGScheduler.handleTaskCompletion方法处理任务完成的收尾工作。

在DAGScheduler.handleTaskCompletion方法首先调用ShuffleMapStage.addOutputLoc将TaskResultGetter.enqueueSuccessFullTask deserialize出来的对象存入ShuffleMapStage.outLocs数组中，这个对象就是一个Shuffle文件的描述信息。当一个Stage的所有Task都完成之后，DAGScheduler把ShuffleMapStage.outLocs的信息通过调用MapOutputTrackerMaster.registerMapOutputs方法将每个Stage1分区对应的Shuffle文件消息保存到MapOutputTrackerMaster。Stage1成功完成之后，从Stage等待队列中找到所有没有Parent的Stage，执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列，然后申请资源，执行任务。在这里的包含了以Shuffle reduce操作开始的Stage2。代码如下：

private[scheduler] def handleTaskCompletion(event: CompletionEvent) {    val task = event.task    val stageId = task.stageId    val taskType = Utils.getFormattedClassName(task)    outputCommitCoordinator.taskCompleted(stageId, task.partitionId,      event.taskInfo.attempt, event.reason)    // The success case is dealt with separately below, since we need to compute accumulator    // updates before posting.    if (event.reason != Success) {      val attemptId = stageIdToStage.get(task.stageId).map(_.latestInfo.attemptId).getOrElse(-1)      listenerBus.post(SparkListenerTaskEnd(stageId, attemptId, taskType, event.reason,        event.taskInfo, event.taskMetrics))    }    if (!stageIdToStage.contains(task.stageId)) {      // Skip all the actions if the stage has been cancelled.      return    }    val stage = stageIdToStage(task.stageId)    event.reason match {      case Success =>        listenerBus.post(SparkListenerTaskEnd(stageId, stage.latestInfo.attemptId, taskType,          event.reason, event.taskInfo, event.taskMetrics))        stage.pendingTasks -= task        task match {          case rt: ResultTask[_, _] =>            // Cast to ResultStage here because it's part of the ResultTask            // TODO Refactor this out to a function that accepts a ResultStage            val resultStage = stage.asInstanceOf[ResultStage]            resultStage.resultOfJob match {              case Some(job) =>                if (!job.finished(rt.outputId)) {                  updateAccumulators(event)                  job.finished(rt.outputId) = true                  job.numFinished += 1                  // If the whole job has finished, remove it                  if (job.numFinished == job.numPartitions) {                    markStageAsFinished(resultStage)                    cleanupStateForJobAndIndependentStages(job)                    listenerBus.post(                      SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded))                  }                  // taskSucceeded runs some user code that might throw an exception. Make sure                  // we are resilient against that.                  try {                    job.listener.taskSucceeded(rt.outputId, event.result)                  } catch {                    case e: Exception =>                      // TODO: Perhaps we want to mark the resultStage as failed?                      job.listener.jobFailed(new SparkDriverExecutionException(e))                  }                }              case None =>                logInfo("Ignoring result from " + rt + " because its job has finished")            }          case smt: ShuffleMapTask =>            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]            updateAccumulators(event)            val status = event.result.asInstanceOf[MapStatus]            val execId = status.location.executorId            logDebug("ShuffleMapTask finished on " + execId)            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {              logInfo("Ignoring possibly bogus ShuffleMapTask completion from " + execId)            } else {              /*              * 记录shuffle map阶段一个partition的返回结果到stage.outputLocs              * shuffle map阶段将一个partition的数据根据Partitioner对Key进行运算后shuffle到reduce阶段的N个分区中，这些数据写记录到本地disk的一个文件中              * shuffle map阶段的返回结果就是reduce阶段每个partition的数据长度              * 因为在这个文件中是根据partition index从0开始依次记录的，所以知道每位partition的数据长度后，也就知道了每个partition数据在这个partition的起始地址              * */              shuffleStage.addOutputLoc(smt.partitionId, status)            }            /*            * 一个stage的所有task都执行完，没有pending的task了            * */            if (runningStages.contains(shuffleStage) && shuffleStage.pendingTasks.isEmpty) {              markStageAsFinished(shuffleStage)              logInfo("looking for newly runnable stages")              logInfo("running: " + runningStages)              logInfo("waiting: " + waitingStages)              logInfo("failed: " + failedStages)              // We supply true to increment the epoch number here in case this is a              // recomputation of the map outputs. In that case, some nodes may have cached              // locations with holes (from when we detected the error) and will need the              // epoch incremented to refetch them.              // TODO: Only increment the epoch number if this is not the first time              //       we registered these map outputs.              /*              * Shuffle Stage的任务等待队列没有任务之后，将这个Stage的所有ShuffleMapTask的返回结果保存到MapOutputTrackerMaster              * */              mapOutputTracker.registerMapOutputs(                shuffleStage.shuffleDep.shuffleId,                shuffleStage.outputLocs.map(list => if (list.isEmpty) null else list.head).toArray,                changeEpoch = true)              clearCacheLocs()              if (shuffleStage.outputLocs.contains(Nil)) {                // Some tasks had failed; let's resubmit this shuffleStage                // TODO: Lower-level scheduler should also deal with this                logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +                  ") because some of its tasks had failed: " +                  shuffleStage.outputLocs.zipWithIndex.filter(_._1.isEmpty)                      .map(_._2).mkString(", "))                submitStage(shuffleStage)              } else {                val newlyRunnable = new ArrayBuffer[Stage]                for (shuffleStage <- waitingStages) {                  logInfo("Missing parents for " + shuffleStage + ": " +                    getMissingParentStages(shuffleStage))                }                /*                * 一个Stage成功完成之后，从Stage等待队列中找到所有没有Parent的Stage，执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列，                * 然后申请资源，执行任务。在这里的包含了以Shuffle reduce操作开始的Stage                * */                for (shuffleStage <- waitingStages if getMissingParentStages(shuffleStage).isEmpty)                {                  newlyRunnable += shuffleStage                }                waitingStages --= newlyRunnable                runningStages ++= newlyRunnable                for {                  shuffleStage <- newlyRunnable.sortBy(_.id)                  jobId <- activeJobForStage(shuffleStage)                } {                  logInfo("Submitting " + shuffleStage + " (" +                    shuffleStage.rdd + "), which is now runnable")                  submitMissingTasks(shuffleStage, jobId)                }              }            }          }      case Resubmitted =>        logInfo("Resubmitted " + task + ", so marking it as still running")        stage.pendingTasks += task      case FetchFailed(bmAddress, shuffleId, mapId, reduceId, failureMessage) =>        val failedStage = stageIdToStage(task.stageId)        val mapStage = shuffleToMapStage(shuffleId)        // It is likely that we receive multiple FetchFailed for a single stage (because we have        // multiple tasks running concurrently on different executors). In that case, it is possible        // the fetch failure has already been handled by the scheduler.        if (runningStages.contains(failedStage)) {          logInfo(s"Marking $failedStage (${failedStage.name}) as failed " +            s"due to a fetch failure from $mapStage (${mapStage.name})")          markStageAsFinished(failedStage, Some(failureMessage))        }        if (disallowStageRetryForTest) {          abortStage(failedStage, "Fetch failure will not retry stage due to testing config")        } else if (failedStages.isEmpty) {          // Don't schedule an event to resubmit failed stages if failed isn't empty, because          // in that case the event will already have been scheduled.          // TODO: Cancel running tasks in the stage          logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " +            s"$failedStage (${failedStage.name}) due to fetch failure")          messageScheduler.schedule(new Runnable {            override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages)          }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS)        }        failedStages += failedStage        failedStages += mapStage        // Mark the map whose fetch failed as broken in the map stage        if (mapId != -1) {          mapStage.removeOutputLoc(mapId, bmAddress)          mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress)        }        // TODO: mark the executor as failed only if there were lots of fetch failures on it        if (bmAddress != null) {          handleExecutorLost(bmAddress.executorId, fetchFailed = true, Some(task.epoch))        }      case commitDenied: TaskCommitDenied =>        // Do nothing here, left up to the TaskScheduler to decide how to handle denied commits      case ExceptionFailure(className, description, stackTrace, fullStackTrace, metrics) =>        // Do nothing here, left up to the TaskScheduler to decide how to handle user failures      case TaskResultLost =>        // Do nothing here; the TaskScheduler handles these failures and resubmits the task.      case other =>        // Unrecognized failure - also do nothing. If the task fails repeatedly, the TaskScheduler        // will abort the job.    }    submitWaitingStages()  }

MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息是一个二维数组，数组的第一维度是Stage1也就是Map操作所在的Stage的分区index；数组的第二维度是Stage2也就是reduce操作所在的Stage的分区index。

假设Stage1 Map操作有3个分区，Stage2 Reduce操作同样也有3个分区，那么MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息可以用如下图表示：

上图中的Shuffle操作把Stage1的一个分区的数据Shuffle到了Stage2的3个分区，"Map2 reduce1长度"表示Stage1中第二个分区的数据Shuffle到Stage2第一个分区的数据长度

在DAGScheduler.submitMissingTasks方法里面，会调用到DAGScheduler.getPreferredLocs方法计算出这个Stage每个任务的位置

在DAGScheduler.getPreferredLocs方法里面，会调用到DAGScheduler.getPreferredLocsInternal方法计算出这个Stage每个任务的位置

DAGScheduler.getPreferredLocsInternal会调用MapOutputTrackerMaster.getLocationsWithLargestOutputs确定Task的位置

表示Task位置的数据结构为TaskLocation，具体源码为：

private[spark] object TaskLocation {  // We identify hosts on which the block is cached with this prefix.  Because this prefix contains  // underscores, which are not legal characters in hostnames, there should be no potential for  // confusion.  See  RFC 952 and RFC 1123 for information about the format of hostnames.  val inMemoryLocationTag = "hdfs_cache_"  def apply(host: String, executorId: String): TaskLocation = {    new ExecutorCacheTaskLocation(host, executorId)  }  /**   * Create a TaskLocation from a string returned by getPreferredLocations.   * These strings have the form [hostname] or hdfs_cache_[hostname], depending on whether the   * location is cached.   */  def apply(str: String): TaskLocation = {    val hstr = str.stripPrefix(inMemoryLocationTag)    if (hstr.equals(str)) {      new HostTaskLocation(str)    } else {      new HostTaskLocation(hstr)    }  }}

这个数据结构通过节点的IP地址和Executor id确定出Task的位置

计算方法为：在Stage1某个TaskLocation上partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION，则这个TaskLocation会作为Stage2任务的TaskLocation

以上图"Map2 reduce2长度"为例：(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION则使用Map2在Stage1所在节点的IP地址和Executor Id创建TaskLocation

通过TaskLocation的定义可以知道实际创建的TaskLocation类型是ExecutorCacheLocation

具体代码为：

private def getPreferredLocsInternal(      rdd: RDD[_],      partition: Int,      visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {    // If the partition has already been visited, no need to re-visit.    // This avoids exponential path exploration.  SPARK-695    if (!visited.add((rdd, partition))) {      // Nil has already been returned for previously visited partitions.      return Nil    }    // If the partition is cached, return the cache locations    val cached = getCacheLocs(rdd)(partition)    if (cached.nonEmpty) {      return cached    }    // If the RDD has some placement preferences (as is the case for input RDDs), get those    val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList    if (rddPrefs.nonEmpty) {      return rddPrefs.map(TaskLocation(_))    }    rdd.dependencies.foreach {      case n: NarrowDependency[_] =>        // If the RDD has narrow dependencies, pick the first partition of the first narrow dep        // that has any placement preferences. Ideally we would choose based on transfer sizes,        // but this will do for now.        for (inPart <- n.getParents(partition)) {          val locs = getPreferredLocsInternal(n.rdd, inPart, visited)          if (locs != Nil) {            return locs          }        }      case s: ShuffleDependency[_, _, _] =>        // For shuffle dependencies, pick locations which have at least REDUCER_PREF_LOCS_FRACTION        // of data as preferred locations        if (shuffleLocalityEnabled &&            rdd.partitions.size < SHUFFLE_PREF_REDUCE_THRESHOLD &&            s.rdd.partitions.size < SHUFFLE_PREF_MAP_THRESHOLD) {          // Get the preferred map output locations for this reducer          /*          *根据Stage1 shuffle map操作各个任务的返回值确定Stage2 shuffle reduce操作各个任务的本地性          *在Stage1某个节点partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION（默认0.2），则这个          * 节点会作为Stage2任务的启动节点          * */          val topLocsForReducer = mapOutputTracker.getLocationsWithLargestOutputs(s.shuffleId,            partition, rdd.partitions.size, REDUCER_PREF_LOCS_FRACTION)          if (topLocsForReducer.nonEmpty) {            return topLocsForReducer.get.map(loc => TaskLocation(loc.host, loc.executorId))          }        }      case _ =>    }    Nil  }

DAGScheduler.submitMissingTasks会根据TaskLocation创建ShuffleMapTask或者ResultTask，如果Shuffle后的数据，没有能够满足条件创建ExecutorCacheLocation，则在创建ShuffleMapTask或者ResultTask的时候，传入的locs参数为Nil

创建完某个Stage的任务之后，然后调用TaskSchedulerImpl.submitTasks创建TaskSetManager，在创建TaskSetManager的时候会根据ShuffleMapTask或者ResultTask的TaskLocality加入到不同的HashMap中，代码如下：

<pre name="code" class="java"> private def addPendingTask(index: Int, readding: Boolean = false) {    // Utility method that adds `index` to a list only if readding=false or it's not already there    def addTo(list: ArrayBuffer[Int]) {      if (!readding || !list.contains(index)) {        list += index      }    }    for (loc <- tasks(index).preferredLocations) {//preferredLocation方法返回partition所在的IP地址和Executor id      loc match {        case e: ExecutorCacheTaskLocation =>          /*          * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数类型为ExecutorCacheTaskLocation则将Task id加入到pendingTasksForExecutor          * pendingTasksForExecutor HashMap的key是Executor id， value是这个Executor对应的Task的id，可以有多个Task          * */          addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer))        case e: HDFSCacheTaskLocation => {          val exe = sched.getExecutorsAliveOnHost(loc.host)          exe match {            case Some(set) => {              for (e <- set) {                addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer))              }              logInfo(s"Pending task $index has a cached location at ${e.host} " +                ", where there are executors " + set.mkString(","))            }            case None => logDebug(s"Pending task $index has a cached location at ${e.host} " +                ", but there are no executors alive there.")          }        }        case _ => Unit      }      addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer))//由于DirectDStream方式的loc.host地址不属于Spark集群和HDFS集群，所以Task加到了这个HashMap      for (rack <- sched.getRackForHost(loc.host)) {        addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer))      }    }    /*    * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数为Nil则将Task id加入到pendingTasksWithNoPrefs    * pendingTasksWithNoPrefs类型是ArrayBuffe，它的每个元素的value是这Task的id    * */    if (tasks(index).preferredLocations == Nil) {      addTo(pendingTasksWithNoPrefs)    }    if (!readding) {      /*      * 所有任务都会加入到addPendingTasks      * */      allPendingTasks += index  // No point scanning this whole list to find the old task there  所有的Task都会加入到这个HashMap，包括DirectDStream情况下的Task    }  }

从上面的代码可见，TaskLocation类型是ExecutorCacheLocation的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForExecutor、pendingTasksForHost、allPendingTasks 中。而TaskLocation类型是Nil的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForHost、allPendingTasks 中

这里有一个问题需要注意：虽然相同的任务加入到了多个任务等待队列，但是不会出现任务的重复调度，具体原因会在下面任务调度的时候讲到。

TaskScheduler.submitTasks方法创建TaskSetManager之后，调用CoarseGrainedSchedulerBackend.reviveOffers申请执行Stage2的资源.

在CoarseGrainedSchedulerBackend.reviveOffers最终会调用到TaskSchedulerImpl.resourceOffer方法分配执行资源。代码如下：

def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized {    // Mark each slave as alive and remember its hostname    // Also track if new executor is added    var newExecAvail = false    for (o <- offers) {      executorIdToHost(o.executorId) = o.host      activeExecutorIds += o.executorId      if (!executorsByHost.contains(o.host)) {        executorsByHost(o.host) = new HashSet[String]()        executorAdded(o.executorId, o.host)        newExecAvail = true      }      for (rack <- getRackForHost(o.host)) {        hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host      }    }    // Randomly shuffle offers to avoid always placing tasks on the same set of workers.    val shuffledOffers = Random.shuffle(offers)    // Build a list of tasks to assign to each worker.    //tasks是一个链表，这个链表的一个元素是一个数组，数组的类型是ArrayBuffer[TaskDescription]，数组的大小跟跟这个executor的cores个数相同    val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores))    //availableCpus是一个List，这个List的每个元素表示一个worker上这个Executor可用的cpu个数    val availableCpus = shuffledOffers.map(o => o.cores).toArray    val sortedTaskSets = rootPool.getSortedTaskSetQueue //从rootPool里面拿到对应的TaskSet，会使用设置的调度算法返回TaskSet的顺序    for (taskSet <- sortedTaskSets) {      logDebug("parentName: %s, name: %s, runningTasks: %s".format(        taskSet.parent.name, taskSet.name, taskSet.runningTasks))      if (newExecAvail) {        taskSet.executorAdded()//添加了新的Executor，重新计算任务的本地性      }    }    // Take each TaskSet in our scheduling order, and then offer it each node in increasing order    // of locality levels so that it gets a chance to launch local tasks on all of them.    // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY    var launchedTask = false    /*    * 对于一个任务集合，优先执行PROCESS_LOCAL任务，最后执行ANY任务    * */    for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {      do {        launchedTask = resourceOfferSingleTaskSet(            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)      } while (launchedTask)    }    if (tasks.size > 0) {      hasLaunchedTask = true    }    return tasks  }

对任务集中的任务进行资源分配优先集为：PROCESS_LOCAL>NODE_LOCAL>NO_PREF>RACK_LOCAL>ANY

TaskSchedulerImpl.resourceOffer最终要调用TaskSchedulerImpl.resouceOfferSingleTaskSet为一个任务集分配资源，TaskSchedulerImpl.resouceOfferSingleTaskSet方法轮询每个Executor，从任务等待队列中拿到需要执行任务id，给轮询到的Executor执行

 private def resourceOfferSingleTaskSet(      taskSet: TaskSetManager,      maxLocality: TaskLocality,      shuffledOffers: Seq[WorkerOffer],      availableCpus: Array[Int],      tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = {    var launchedTask = false    for (i <- 0 until shuffledOffers.size) {      val execId = shuffledOffers(i).executorId      val host = shuffledOffers(i).host      if (availableCpus(i) >= CPUS_PER_TASK) {//按照cpu cores个数分配task        try {          for (task <- taskSet.resourceOffer(execId, host, maxLocality)) {            tasks(i) += task //将这个task放在了第i个worker(worker顺序已经shuffle了)            val tid = task.taskId            taskIdToTaskSetId(tid) = taskSet.taskSet.id//记录task所在的taskset            taskIdToExecutorId(tid) = execId//记录task所在的executor            executorsByHost(host) += execId            availableCpus(i) -= CPUS_PER_TASK            assert(availableCpus(i) >= 0)            launchedTask = true          }        } catch {          case e: TaskNotSerializableException =>            logError(s"Resource offer failed, task set ${taskSet.name} was not serializable")            // Do not offer resources for this task, but don't throw an error to allow other            // task sets to be submitted.            return launchedTask        }      }    }    return launchedTask  }

TaskSchedulerImpl.resouceOfferSingleTaskSet方法调用了TaskSetManager.resourceOffer为任务集中的单个任务进行资源分配，TaskSetManager.resourceOffer调用TaskSetManager.dequeueTask执行具体的资源分配，代码如下：

 private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value)    : Option[(Int, TaskLocality.Value, Boolean)] =  {    /*    * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数为类型则从pendingTasksForExecutor HashMap获取任务执行的id    * */    for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) {      return Some((index, TaskLocality.PROCESS_LOCAL, false))    }    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {//由于KafkaRDD partition所在的Ip地址跟Executor的IP地址不同，所以Task不能从这个HashMap获取      for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) {        return Some((index, TaskLocality.NODE_LOCAL, false))      }    }    /*   * 在创建ShuffleMapTask或者ResultTask的时候，如果这个Task传入的locs参数为Nil则从pendingTasksWithNoPrefs获取任务执行的id   * */    if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) {      // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic      for (index <- dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) {        return Some((index, TaskLocality.PROCESS_LOCAL, false))      }    }    if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) {      for {        rack <- sched.getRackForHost(host)        index <- dequeueTaskFromList(execId, getPendingTasksForRack(rack))      } {        return Some((index, TaskLocality.RACK_LOCAL, false))      }    }    if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {//KafkaRDD的处理Task从addPendingTask这个HashMap获取      for (index <- dequeueTaskFromList(execId, allPendingTasks)) {        return Some((index, TaskLocality.ANY, false))      }    }    // find a speculative task if all others tasks have been scheduled    dequeueSpeculativeTask(execId, host, maxLocality).map {      case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)}  }

TaskSetManager.dequeueTask调用TaskSetManager.dequeueTaskFromList从任务等待链表中获取任务：

private def dequeueTaskFromList(execId: String, list: ArrayBuffer[Int]): Option[Int] = {    var indexOffset = list.size    while (indexOffset > 0) {      indexOffset -= 1      val index = list(indexOffset)      if (!executorIsBlacklisted(execId, index)) {        // This should almost always be list.trimEnd(1) to remove tail        list.remove(indexOffset)        /*        * 若任务已经开始执行，copiesRunning(index)==1        * 若任务已经成功完成，则successful(index) ==true        * 通过这个判断避免任务重复执行        * */        if (copiesRunning(index) == 0 && !successful(index)) {          return Some(index)        }      }    }    None  }

对于本地性非常高的任务，比如ExecutorCacheLocation任务，它被添加到pendingTasksForExecutor pendingTasksForHost allPendingTasks 3个数据结构中，若任务已经开始执行，copiesRunning(index)==1，若任务已经成功完成，则successful(index) ==true，通过判断copiesRunning(index) == 0 && !successful(index)避免任务重复执行

通过以上分析可以得出结论，

Stage2任务在哪个Executor执行分2中情况

情况1：创建Stage2任务的时候，传入的loc2类型是ExecutorCacheLocation，也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION，这种任务执行的Executor会在Map2所执行的Executor

情况2：创建Stage2任务的时候，传入的loc2类型是Nil，也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) < REDUCER_PREF_LOCS_FRACTION，这种任务执行的Executor会随机分配一个Executor执行

0 0