spark学习-49-Spark的job调度

来源：互联网发布：手游直播软件编辑：程序博客网时间：2024/05/17 02:33

1。先看一下从源码层面梳理Spark在任务调度与资源分配上的做法。
这里写图片描述
这里涉及几个小知识点：
　　1.1。最上面的7个圆圈是如何划分stage的？
　　　　原则：凡是RDD之间是窄依赖的，都归到一个stage里，这里面的每个操作都对应成MapTask，并行度就是各自RDD的partition数目。凡是遇到宽依赖的操作，那么就把这一次操作切为一个stage，这里面的操作对应成ResultTask。
Spark的stage的划分：http://blog.csdn.net/qq_21383435/article/details/78700524

　　1.2。为什么会有3个TaskSet?
　　一个stage内的task集合成一个TaskSet类。上面一共有3个stage.
　　
　　1.3 在TaskScheduler和SchedulerBackend之间还有相应的实现类TaskSchedulerImpl以及TaskSetManager
　　
　　1.4 Executor是真正执行任务的进程，本身拥有若干cpu和内存，可以执行以线程为单位的计算任务，它是资源管理系统能够给予的最小单位。SchedulerBackend是spark提供的接口，定义了许多与Executor事件相关的处理，包括：新的executor注册进来的时候记录executor的信息，增加全局的资源量(核数)，进行一次makeOffer；executor更新状态，若任务完成的话，回收core，进行一次makeOffer；其他停止executor、remove executor等事件。

2。一个Job实际上是从RDD调用一个Action操作开始的，该Action操作最终会进入到org.apache.spark.SparkContext.runJob() 方法中，在SparkContext中有多个重载的runJob方法，最终入口是下面这个

/**   * Run a function on a given set of partitions in an RDD and pass the results to the given   * handler function. This is the main entry point for all actions in Spark.    *    *  运行一个函数在给定一个RDD分区设置和结果传递到特定的处理函数。这是在所有Spark actions的主要入口点。   *   * @param rdd target RDD to run tasks on   * @param func a function to run on each partition of the RDD   * @param partitions set of partitions to run on; some jobs may not want to compute on all   * partitions of the target RDD, e.g. for operations like `first()`   * @param resultHandler callback to pass each result to    *    *    *    一个Job实际上是从RDD调用一个Action操作开始的，该Action操作最终会进入到org.apache.spark.SparkContext.runJob()    * 方法中，在SparkContext中有多个重载的runJob方法，最终入口是下面这个   */  def runJob[T, U: ClassTag](      rdd: RDD[T],      func: (TaskContext, Iterator[T]) => U,      partitions: Seq[Int],      resultHandler: (Int, U) => Unit): Unit = {    // 判断SparkContext是否停止，这里使用AtomicBoolean是线程阻塞的    if (stopped.get()) {      throw new IllegalStateException("SparkContext has been shutdown")    }    // 返回调用点    val callSite = getCallSite    // clean方法实际上调用了ClosureCleaner的clean方法，这里一再清除闭包中的不能序列化的变量，防止RDD在网络传输过程中反序列化失败。    val cleanedFunc = clean(func)    logInfo("Starting job: " + callSite.shortForm)    if (conf.getBoolean("spark.logLineage", false)) {      logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)    }    // job先要按依赖关系通过dagScheduler切分stage，stage通过dagScheduler进行调度    // 这里调用dagScheduler.runJob()方法后，正式进入之前构造的DAGScheduler对象中。    dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)    progressBar.foreach(_.finishAll())    rdd.doCheckpoint()  }

3.下面来整体看看图
这里写图片描述

这个图非常的重要，这个图目前是spark2.2版本以前的，和我讲的有些出入，但是大部分还是相同的，以后完全贯通了，会修改这张图，先看代码（有明白的可以画张图帮我补充一下）

4.上面dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)调用的是

/**   * Run an action job on the given RDD and pass all the results to the resultHandler function as   * they arrive.    *    * 在给定的RDD上运行一个操作任务，并将所有结果传递给resultHandler函数。   *   * @param rdd target RDD to run tasks on   * @param func a function to run on each partition of the RDD   * @param partitions set of partitions to run on; some jobs may not want to compute on all   *   partitions of the target RDD, e.g. for operations like first()   * @param callSite where in the user program this job was called   * @param resultHandler callback to pass each result to   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name   *   * @note Throws `Exception` when the job fails    *    *    *  这个在SparkContext中被调用   */  def runJob[T, U](      rdd: RDD[T],      func: (TaskContext, Iterator[T]) => U,      partitions: Seq[Int],      callSite: CallSite,      resultHandler: (Int, U) => Unit,      properties: Properties): Unit = {    val start = System.nanoTime    /**      * 调用DAGScheduler.submitJob方法后会得到一个JobWaiter实例来监听Job的执行情况。针对Job的Succeeded状态和Failed状态，      * 在接下来代码中都有不同的处理方式。      * */    val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)    // 线程阻塞，不释放锁，时间到了，会继续运行。等待job提交完成后，异步返回的waiter（submit是异步提交）    // 阻塞住 jobWaiter  直到jobWaiter 知道 submitJob运行 失败还是成功  也就是说 等到completionFuture 有状态值    ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)    // 如果成功就打印成功日志，否则打印失败日志    waiter.completionFuture.value.get match {      case scala.util.Success(_) =>        logInfo("Job %d finished: %s, took %f s".format          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))      case scala.util.Failure(exception) =>        logInfo("Job %d failed: %s, took %f s".format          (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))        // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.        val callerStackTrace = Thread.currentThread().getStackTrace.tail        exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)        throw exception    }  }

这句话里面重要的是val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)这一句话，那么我们看看这一句话调用了什么？

 /**   * Submit an action job to the scheduler.    *    * submitJob方法用来将一个Job提交到Job scheduler.   *   * @param rdd target RDD to run tasks on   * @param func a function to run on each partition of the RDD   * @param partitions set of partitions to run on; some jobs may not want to compute on all   *   partitions of the target RDD, e.g. for operations like first()   * @param callSite where in the user program this job was called   * @param resultHandler callback to pass each result to   * @param properties scheduler properties to attach to this job, e.g. fair scheduler pool name   *   * @return a JobWaiter object that can be used to block until the job finishes executing   *         or can be used to cancel the job.   *   * @throws IllegalArgumentException when partitions ids are illegal    *    * submitJob的处理步骤如下：    *   1.调用RDD的partitions函数来获取当前Job的最大分区数，即maxPartitions。根据maxPartitions，确认我们没有在一个不存在的    *     partition上运行任务。    *   2.生成当前的Job的jobId;    *   3.创建JobWaiter，望文生义，即Job的服务员。    *   4.向eventProcessLoop发送JobSubmitted事件（这里的eventProcessLoop就是DAGSchedulerEventProcessLoop）    *   5.返回JobWaiter.    *    *    *     进入submitJob方法，首先会去检查rdd的分区信息，在确保rdd分区信息正确的情况下，给当前job生成一个jobId，nexJobId在刚构造出来时是从0开始编号的，    *   在同一个SparkContext中，jobId会逐渐顺延。然后构造出一个JobWaiter对象返回给上一级调用函数。通过上面提到的eventProcessLoop提交该任务，    *   最终会调用到DAGScheduler.handleJobSubmitted来处理这次提交的Job。handleJobSubmitted在下面的Stage划分部分会有提到。    *   */  def submitJob[T, U](      rdd: RDD[T],      func: (TaskContext, Iterator[T]) => U,      partitions: Seq[Int],      callSite: CallSite,      resultHandler: (Int, U) => Unit,      properties: Properties): JobWaiter[U] = {    // Check to make sure we are not launching a task on a partition that does not exist.    // 检查确保我们没有在不存在的分区上启动任务。    val maxPartitions = rdd.partitions.length    partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>      throw new IllegalArgumentException(        "Attempting to access a non-existent partition: " + p + ". " +          "Total number of partitions: " + maxPartitions)    }    // 2.生成当前的Job的jobId;    val jobId = nextJobId.getAndIncrement()    // 因为nexJobId在刚构造出来时是从0开始编号的，所以如果为0，就说明这个job还没运行，可以直接返回JobWaiter    if (partitions.size == 0) {      // Return immediately if the job is running 0 tasks      // 如果作业运行为0，则立即返回      return new JobWaiter[U](this, jobId, 0, resultHandler)    }    assert(partitions.size > 0)    val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]    // 3.创建JobWaiter，望文生义，即Job的服务员。    val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)    // 4.向eventProcessLoop发送JobSubmitted事件（这里的eventProcessLoop就是DAGSchedulerEventProcessLoop）    //   这里开始运行job吧，这里先生成一个DAGSchedulerEvent类型的JobSubmitted（这是一个类，里面啥都没有）的事件对象    //   然后传递给eventProcessLoop去处理    eventProcessLoop.post(JobSubmitted(      jobId, rdd, func2, partitions.toArray, callSite, waiter,      SerializationUtils.clone(properties)))    // 5.返回JobWaiter.    waiter  }

里面的 eventProcessLoop.post()方法是后续处理。eventProcessLoop.post()方法是

private[spark] abstract class EventLoop[E](name: String) extends Logging { /**   * Put the event into the event queue. The event thread will process it later.    * 将事件放入事件队列中。事件线程稍后将处理它。   */  def post(event: E): Unit = {    // 将JobSubmitted，Job提交事件存入该队列中    eventQueue.put(event)  }}

eventProcessLoop的继承是DAGSchedulerEventProcessLoop

// 这里创建 DAGSchedulerEvent事件处理对象（相当于一个路由器，比方说他是现实中的指引者，你post各种事件过去，就是问，我去张家界去哪？  // 他告诉你石家庄，你问我去九寨沟，他告诉你去四川省阿坝藏族），这个DAGSchedulerEventProcessLoop类就在这个文件中  private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

所以这里 eventProcessLoop.post()方法是后续处理方法，是由DAGSchedulerEventProcessLoop的onReceive(event: DAGSchedulerEvent)方法去循环获取事件处理

 /**   * The main event loop of the DAG scheduler.    * DAG调度器的主事件循环。    *    * 这个方法在EventLoop抽象类中被循环调用   */  override def onReceive(event: DAGSchedulerEvent): Unit = {    val timerContext = timer.time()    try {      doOnReceive(event)    } finally {      timerContext.stop()    }  }

这里哪里有循环呢？没看到啊？
打开它的继承类private[spark] abstract class EventLoop[E](name: String) extends Logging {}我们发现

  private val eventThread = new Thread(name) {    // 设置为守护进程    setDaemon(true)    override def run(): Unit = {      try {        while (!stopped.get) {          // 检索并删除该队列的头部，如果需要，则等待元素变为可用。(就是拿出队列的第一个元素，然后删除)          val event = eventQueue.take()          try {            // 这里调用这个方法，但是EventLoop是一个抽象方法，谁继承这个类，就调用子类的这个方法，也就是            // DAGSchedulerEventProcessLoop的  onReceive()方法，而且是循环调用，不停地在一个线程里调用            onReceive(event)          } catch {            case NonFatal(e) =>              try {                onError(e)              } catch {                case NonFatal(e) => logError("Unexpected error in " + name, e)              }          }        }      } catch {        case ie: InterruptedException => // exit even if eventQueue is not empty        case NonFatal(e) => logError("Unexpected error in " + name, e)      }    }  }

这里单独一个线程里循环调用了继承类的onReceive(event: DAGSchedulerEvent)方法

然后我看看DAGSchedulerEventProcessLoop获取了JobSubmitted（job提交事件）会怎么做？

/**    * 在该方法中，根据事件类别分别匹配不同的方法进一步处理。本次传入的是JobSubmitted方法，那么进一步调用的方法是    * DAGScheduler.handleJobSubmitted。这部分的逻辑，以及还可以处理的其他事件，都在下面的源代码中。    *    * @param event    */  private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {    // 处理Job提交事件    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>      // 开始处理Job，并执行Stage的划分。      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)    // 处理Map Stage提交事件    case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>      dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)    // 处理Stage取消事件    case StageCancelled(stageId, reason) =>      dagScheduler.handleStageCancellation(stageId, reason)    // 处理Job取消事件    case JobCancelled(jobId, reason) =>      dagScheduler.handleJobCancellation(jobId, reason)    // 处理Job组取消事件    case JobGroupCancelled(groupId) =>      dagScheduler.handleJobGroupCancelled(groupId)    // 处理所有Job取消事件    case AllJobsCancelled =>      dagScheduler.doCancelAllJobs()    // 处理Executor分配事件    case ExecutorAdded(execId, host) =>      dagScheduler.handleExecutorAdded(execId, host)    // 处理Executor丢失事件    case ExecutorLost(execId, reason) =>      val filesLost = reason match {        case SlaveLost(_, true) => true        case _ => false      }      dagScheduler.handleExecutorLost(execId, filesLost)    case BeginEvent(task, taskInfo) =>      dagScheduler.handleBeginEvent(task, taskInfo)    case GettingResultEvent(taskInfo) =>      dagScheduler.handleGetTaskResult(taskInfo)    // 处理完成事件    case completion: CompletionEvent =>      dagScheduler.handleTaskCompletion(completion)    // 处理task集失败事件    case TaskSetFailed(taskSet, reason, exception) =>      dagScheduler.handleTaskSetFailed(taskSet, reason, exception)    // 处理重新提交失败Stage事件    case ResubmitFailedStages =>      dagScheduler.resubmitFailedStages()  }

可以看到该方法调用了下面处理Job提交事件

   case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>      // 开始处理Job，并执行Stage的划分。      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)

那么我们去看看dagScheduler.handleJobSubmitted（）方法

/**    * 5.4-2 任务的提交-处理Job    *    *   DAGSchedulerEventProcessLoop收到JobSubmitted事件，会调用dagScheduler的handleJobSubmitted方法。    *   执行过程如下：    *     1.创建finalStage及Stage的划分。创建Stage的过程可能发生异常，比如，运行在HadoopRDD上的任务所以来的底层HDFS文件被删除了。    *       所以当异常发生时需要主动调用JobWaiter的jobFailed方法。    *     2.创建ActiveJob并且更新jobIdToActiveJob = new HashMap[Int,ActiveJob]，activeJobs = new HashSet[ActiveJob]    *       和finalStage.resultOfJob.    *     3.向listenerBus发送SparkListenerJobSatrt事件。    *     4.提交finalStage.    *     5.提交等待中的Stage。    */  private[scheduler] def handleJobSubmitted(jobId: Int,      finalRDD: RDD[_],      func: (TaskContext, Iterator[_]) => _,      partitions: Array[Int],      callSite: CallSite,      listener: JobListener,      properties: Properties) {    var finalStage: ResultStage = null    try {      // New stage creation may throw an exception if, for example, jobs are run on a      // HadoopRDD whose underlying HDFS files have been deleted.      // Stage划分过程是从最后一个Stage开始往前执行的，最后一个Stage的类型是ResultStage      /**        * 这里存在递归调用：createResultStage（）--> getOrCreateParentStages()-->  getOrCreateShuffleMapStage（）        *                                                 |                      |        *                                                 |                      |        *                                                 |                      |        *                                                 >                      |        *                                            createShuffleMapStage（） <--        *        * 获取最后一个stages        */      finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)    } catch {      case e: Exception =>        logWarning("Creating new stage failed due to exception - job: " + jobId, e)        listener.jobFailed(e)        return    }    //为该Job生成一个ActiveJob对象，并准备计算这个finalStage    val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)    // 删除Hash表中的所有条目 这个是什么鬼？这个有什么用    clearCacheLocs()    logInfo("Got job %s (%s) with %d output partitions".format(      job.jobId, callSite.shortForm, partitions.length))    logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")    logInfo("Parents of final stage: " + finalStage.parents)    logInfo("Missing parents: " + getMissingParentStages(finalStage))    val jobSubmissionTime = clock.getTimeMillis()    // 该job进入active状态 记录处于Active状态的job，key为jobId, value为ActiveJob类型对象    jobIdToActiveJob(jobId) = job    // active状态的Job列表    activeJobs += job    finalStage.setActiveJob(job)    val stageIds = jobIdToStageIds(jobId).toArray    val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))    // 向LiveListenerBus发送Job提交事件    listenerBus.post(SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))    //提交当前Stage    submitStage(finalStage)  }

最后调用了submitStage(finalStage)去提交当前Stage

/** Submits stage, but first recursively submits any missing parents.    * 提交阶段，但首先递归提交任何丢失的父Stage。    *    * submitStage提交Stage，它会把一个job中的第一个stage提交上去    *    *    在提交finalStage之前，如果存在没有提交的祖先Stage,则需要先提交所有没有提交的祖先Stage.每个Stage提交之前，    * 如果存在没有提交的祖先Stage,都会先提交祖先Stage,并且将子Satge放入waitingStages = new HashSet[Stage]    * 中等待。如果不存在没有提交的祖先Stage，则提交、所有未提交的Task。    *    *    * =====>    *    *   提交Job的提交，是从最后那个Stage开始的。如果当前stage已经被提交过，处于waiting或者waiting状态，或者当前    * stage已经处于failed状态则不作任何处理，否则继续提交该stage。    *    *   在提交时，需要当前Stage需要满足依赖关系，其前置的Parent Stage都运行完成后才能轮得到当前Stage运行。如果还有    * Parent Stage未运行完成，则优先提交Parent Stage。通过调用方法DAGScheduler.getMissingParentStages方法获    * 取未执行的Parent Stage。    *    * 如果当前Stage满足上述两个条件后，调用DAGScheduler.submitMissingTasks方法，提交当前Stage。    * */  private def submitStage(stage: Stage) {    // 获取当前提交Stage所属的Job    val jobId = activeJobForStage(stage)    // jobId不为空    if (jobId.isDefined) {      logDebug("submitStage(" + stage + ")")      // 首先判断当前stage的状态，如果当前Stage不是处于waiting, running以及failed状态      // 则提交该stage      if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {        // getMissingParentStages方法用来找到Stage所有不可用的祖先Stage.        val missing = getMissingParentStages(stage).sortBy(_.id)        logDebug("missing: " + missing)        //如果所有的parent stage都以及完成，那么就会提交该stage所包含的task        if (missing.isEmpty) {  ////找到了第一个Stage，其ParentStages为Empty，则提交这个Stage的task          logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")          //过程见下面的方法描述          submitMissingTasks(stage, jobId.get)        } else {          //否则递归的去提交未完成的parent stage          for (parent <- missing) {            submitStage(parent)  ////没有找到的话，继续往上找，这里使用递归调用自己的这个方法了          }          //当前stage进入等待队列          waitingStages += stage  ////并把中间的Stage记录下来        }      }    } else {      //如果jobId没被定义，即无效的stage则直接停止      abortStage(stage, "No active job for stage " + stage.id, None)    }  }

最后调用的是submitMissingTasks(stage, jobId.get)，递归后也是调用submitMissingTasks(stage, jobId.get)这个方法

/** Called when stage's parents are available and we can now do its task.    * 当一个上stage的父stage存在的时候，我们现在就可以运行他的任务了。    *    * 当找到了第一个Stage之后，会开始提交这个Stage的task    *    * 5.4.5 提交Task    *   提交Task的入口是submitMissingTasks函数，此函数在Stage没有不可用的祖先Stage时候，被调用处理当前Stage未提交的任务。    *    *   1.提交还未计算的任务    *    *     submitMissingTasks用于提交还未计算的任务。    *     pendingTasks:类型是HashSet[Task[_]],存储有待处理的Task。    *     MapStatus:包括执行Task的BlockManager的地址和要传给reduce任务的Block的估算大小。    *     outputLocs:如果Stage是map任务，则outputLocs记录每个Partition的MapStatus。    *    *     submitMissingTasks执行过程总结如下：    *     （1）.清空pendingTasks，由于当前Stage的任务刚开始提交，所以需要清空，便于记录需要计算的任务。    *     （2）.找出还未计算的partition(如果Stage是map任务，那么需要获取Stage的finalJob，并且调用finished方法判断每个partition    *           的任务是否完成)    *     （3）.将当前Stage加入运行中的Stage集合（runningStages:HashSet[stage]）中。    *     （4）.使用StageInfo。fromStage方法创建当前Stage的latestInfo(StageInfo)    *     （5）.向listenerBus发送SparkListenerStageSubmitted事件。    *     （6）.如果Stage是map任务，那么序列化Stage的RDD及ShuffleDependency,如果Stage不是Map任务，那么序列化Stage的RDD及resultOfJob    *           的处理函数，这些序列化得到的字节数组最后需要使用sc.broadcast进行广播。    *     （7）.如果Stage是map任务，则创建ShuffleMapTask，否则创建ResultTask，还未计算的partition个数决定了最终创建的Task    *           个数。并将创建的所有Task都添加到Stage的pendingTasks中。    *     （8）.利用上一步创建的所有Task，当前Stage的id，jobId等信息创建TaskSet，并调用taskScheduler的submitTasks，批量提交Stage    *           及其所有的Task.    *    * */  private def submitMissingTasks(stage: Stage, jobId: Int) {    logDebug("submitMissingTasks(" + stage + ")")    // First figure out the indexes of partition ids to compute.    // 首先要计算的分区索引ID。 取得当前Stage需要计算的partition 返回丢失的分区id序列(即需要计算)    val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()    // Use the scheduling pool, job group, description, etc. from an ActiveJob associated    // with this Stage    // 使用调度池，工作组，描述，等从一个与这个Stage相关的activejob    val properties = jobIdToActiveJob(jobId).properties    // 将当前stage存入running状态的stage列表中    runningStages += stage    // SparkListenerStageSubmitted should be posted before testing whether tasks are    // serializable. If tasks are not serializable, a SparkListenerStageCompleted event    // will be posted, which should always come after a corresponding SparkListenerStageSubmitted    // event.    // sparklistenerstagesubmitted应该被posted提交在检验任务无论tasks任务是否序列化之前，如果任务不可序列化的，    // 一个sparklistenerstagecompleted事件将posted，这应该是一个相应的sparklistenerstagesubmitted事件后。    // 判断当前stage是ShuffleMapStage还是ResultStage，（猜测的）    stage match {      case s: ShuffleMapStage =>        outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)      case s: ResultStage =>        outputCommitCoordinator.stageStart(          stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)    }    // 这一点不知道是干嘛的？ 在DAGScheudler的submitMissingTasks方法中利用RDD的本地性来得到Task的本地性， 获取Stage内部Task的最佳位置。    //  dagscheduler 初步判断划分的task 跑在那个executer上  是根据RDD的getPreferredLocs 来确定 数据在哪里  就近分配    /**      * DAGScheduler 通过调用 submitStage 来提交一个 stage 对应的 tasks，submitStage 会调用submitMissingTasks，      * submitMissingTasks 会以下代码来确定每个需要计算的 task 的preferredLocations，这里调用到了 RDD#getPreferredLocs，      * getPreferredLocs返回的 partition 的优先位置，就是这个 partition 对应的 task 的优先位置      */    val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {      stage match {        case s: ShuffleMapStage =>          // getPreferredLocs:获取与特定RDD分区相关联的本地信息。          // 取得当前Stage需要计算的partition 返回丢失的分区id序列(即需要计算)          partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap        case s: ResultStage =>          partitionsToCompute.map { id =>            val p = s.partitions(id)            (id, getPreferredLocs(stage.rdd, p))          }.toMap      }    } catch {      // 非致命的错误，如VirtualMachineError，OutOfMemoryError，StackOverflowError等      case NonFatal(e) =>        // 通过使用新的尝试ID创建一个新的StageInfo，为这个stage创建一个新的尝试。        stage.makeNewStageAttempt(partitionsToCompute.size)        // 重新提交这个stage,这里listenerBus.post(SparkListenerStageSubmitted))) 这个被谁消费了？        // SparkListenerStageSubmitted事件是SparkListenerBus的doPostEvent（）方法处理的        listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))        runningStages -= stage        return    }    // 通过使用新的尝试ID创建一个新的StageInfo，为这个stage创建一个新的尝试。    stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)    // 向listenerBus提交StageSubmitted事件    // SparkListenerStageSubmitted事件是SparkListenerBus的doPostEvent（）方法处理的，    // 这里有个问题没解决，不知道最后谁使用的？    listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))    // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.    // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast    // the serialized copy of the RDD and for each task we will deserialize it, which means each    // task gets a different copy of the RDD. This provides stronger isolation between tasks that    // might modify state of objects referenced in their closures. This is necessary in Hadoop    // where the JobConf/Configuration object is not thread-safe.    /**      * TODO:也许我们可以把任务二进制文件(taskBinary)放在Stage上，以避免多次序列化。      * 用于任务的广播二进制文件，用于将任务分派给执行程序executors。请注意，我们广播了RDD的序列化副本，对于每个任务，我们将对其进行反序列化，      * 这意味着每个任务得到RDD的不同副本。这为可能修改闭包中引用的对象状态的任务提供了更强的隔离。在Hadoop中，JobConf / Configuration对      * 象不是线程安全的。      *      */    var taskBinary: Broadcast[Array[Byte]] = null    try {      // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). 对于ShuffleMapTask，序列化和广播      // For ResultTask, serialize and broadcast (rdd, func). 对于ResultTask，序列化和广播      //注意：我们broadcast RDD的拷贝并且对于每一个task我们将要反序列化，这意味着每个task得到一个不同的RDD 拷贝      val taskBinaryBytes: Array[Byte] = stage match {        case stage: ShuffleMapStage =>          JavaUtils.bufferToArray(            closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))        case stage: ResultStage =>          JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))      }      //将序列化后的task广播出去      taskBinary = sc.broadcast(taskBinaryBytes)    } catch {      // In the case of a failure during serialization, abort the stage.      // 在序列化失败的情况下，中止stage。      case e: NotSerializableException =>        abortStage(stage, "Task not serializable: " + e.toString, Some(e))        runningStages -= stage        // Abort execution 中止执行        return      case NonFatal(e) =>        abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))        runningStages -= stage        return    }    // 根据stage的类型获取其中包含的task    //根据stage生成tasks    /**      * 这段调用返回的 taskIdToLocations: Seq[ taskId -> Seq[hosts] ] 会在submitMissingTasks生成要提交给      * TaskScheduler 调度的 taskSet: Seq[Task[_]]时用到      *      */    val tasks: Seq[Task[_]] = try {      val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()      stage match {        // ShuffleMapStage中对应的是ShuffleMapTask   //对于ShuffleMapStages生成ShuffleMapTask        case stage: ShuffleMapStage =>          // 清空stage的PendingTasks          stage.pendingPartitions.clear()          // 每个分区对应一个ShuffleMapTask（这样更加高效）          partitionsToCompute.map { id =>            val locs = taskIdToLocations(id)            val part = stage.rdd.partitions(id)            stage.pendingPartitions += id            //< 使用上述获得的 task 对应的优先位置，即 locs 来构造ShuffleMapTask            // 生成ShuffleMapTask            //可见一个partition，一个task，一个位置信息            new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,              taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),              Option(sc.applicationId), sc.applicationAttemptId)          }        // ResultStage中对应的是ResultTask //对于ResultStage生成ResultTask        case stage: ResultStage =>          // 每个分区对应一个ResultTask          partitionsToCompute.map { id =>            val p: Int = stage.partitions(id)            val part = stage.rdd.partitions(p)            val locs = taskIdToLocations(id)            //< 使用上述获得的 task 对应的优先位置，即 locs 来构造ResultTask            new ResultTask(stage.id, stage.latestInfo.attemptId,              taskBinary, part, locs, id, properties, serializedTaskMetrics,              Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)          }      }    } catch {      case NonFatal(e) =>        abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))        runningStages -= stage        return    }    /**      * taskIdToLocations 和 tasks: Seq[Task[_]] =这两个总结：      *   简而言之，在 DAGScheduler 为 stage 创建要提交给 TaskScheduler 调度执行的 taskSet 时，对于 taskSet      *  中的每一个 task，其优先位置与其对应的 partition 对应的优先位置一致      */    // 如果当前Stege中有task    if (tasks.size > 0) {      logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +        s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")      // 要么是ShuffleMapTask或者是ResultTask，其TaskSet的priority为stage的jobid，而jobid是递增的，在submitTasks函数里面会      // 创建TaskSetManager，然后把TaskSetManager添加到以上的pool中      // 根据tasks生成TaskSet，然后通过TaskScheduler.submitTasks方法提交TaskSet      // TODO:最后所有的Stage都转换为TaskSet任务集去提交，最后开始执行任务      // 这里调用的是TaskScheduler的接口方法submitTasks（）提交一系列要运行的任务。所以要看其实现类TaskSchedulerImpl。      // 调用了里面的方法submitTasks      taskScheduler.submitTasks(new TaskSet(        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))      stage.latestInfo.submissionTime = Some(clock.getTimeMillis())    } else {      // 如果当前Stege中不包含task      // Because we posted SparkListenerStageSubmitted earlier, we should mark      // the stage as completed here in case there are no tasks to run      // 由于前面已经向listenerBus中提交了StageSubmitted事件，现在这个Stege中没有task运行      // 则正常流程时，该stage不会被标记为结束。那么需要手动指定该stege为finish状态。      //因为我们之前就已经发送了事件SparkListenerStageSubmitted，所以我们标记Stage为completed防止没有任务提交      markStageAsFinished(stage, None)      // log中的显示信息 //将debugString记录到日志中      val debugString = stage match {        case stage: ShuffleMapStage =>          s"Stage ${stage} is actually done; " +            s"(available: ${stage.isAvailable}," +            s"available outputs: ${stage.numAvailableOutputs}," +            s"partitions: ${stage.numPartitions})"        case stage : ResultStage =>          s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"      }      logDebug(debugString)      submitWaitingChildStages(stage)    }  }

最后由下面的代码进入task的调度过程中

  // TODO:最后所有的Stage都转换为TaskSet任务集去提交，最后开始执行任务      // 这里调用的是TaskScheduler的接口方法submitTasks（）提交一系列要运行的任务。所以要看其实现类TaskSchedulerImpl。      // 调用了里面的方法submitTasks      taskScheduler.submitTasks(new TaskSet(        tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))

到目前为止我们只是理通了如下代码
这里写图片描述

后续的task将在以后讲解
Spark的task任务的运行：http://blog.csdn.net/qq_21383435/article/details/78701330

阅读全文

1 0