Spark2.0.X源码深度剖析之 DAGScheduler之Stage划分 —— 国内全网最新最全最具深度!!!

来源:互联网 发布:java就业培训课程 编辑:程序博客网 时间:2024/06/07 01:14

微信号:519292115

邮箱:taosiyuan163@163.com


尊重原创,禁止转载!!


Spark目前是大数据领域中最火的框架之一,可高效实现离线批处理,实时计算和机器学习等多元化操作,阅读源码有助你加深对框架的理解和认知

本人将依次剖析Spark2.0.0.X版本的各个核心组件,包括之后章节的BlockManager,TaskScheduler等


本人也看过一些网上介绍DAGScheduler源码的文章,有一些写的还是很不错,虽然讲到了一些核心点,但是都很片面,并没有提及很多细节的实现,为什么这样实现,其他组件的连带关系等...只是知其然不知所以然.

本人此次的DAGScheduler 源码剖析将涉及最底层的数据结构,每个细节的实现原理,算法,优化细节,各个组件之间交互动作以及纠正网上的一些错误介绍等....保证国内最新最全最细的源码剖析!


此次会提及到以下主要的跟DAGScheduler 交互的组件;按照代码的顺序,走到哪说到哪:

一、RDD最常用的几种算子:

①MapPartitionsRDD——转换型操作RDD,产生OneToOneDependency依赖..代表如map,filter等

②ShuffledRDD——依赖一个父RDD,可能产生最影响集群性能的Shuffle,生成ShuffleDependency依赖,划分当前Job的Stage..代表如groupByKey,reduceByKey等

③CoGroupedRDD——依赖多个父RDD,可能产生最影响集群性能的Shuffle,生成ShuffleDependency依赖,划分当前Job的Stage..代表如join,等

④ActionRDD——一个Job的执行入口,生成ResultStage,代表如collect,count等

二、两种Stage:

①ShuffleMapStage——继承于Stage,ShuffleMapStage刚好就在在shuffle操作之前发生,并且它可能包含多个transformation操作在执行的时候,会保存map端的输出文件并且稍后可以被reduce tasks接收

②ResultStage——一个Job中最后一个Stage,当执行ActionRDD时会被触发

三、MapOutputTracker——存放shuffle阶段的输出元数据,子类Master和Worker实现都不一样

四、Partitioner,默认是HashPartitioner

五、BlockManager

等.....



DAGScheduler——作为Spark的最核心组件之一简明扼要的说就是主要负责Job作业期间的所有Stage最优化的划分部署,并把所有的Task提交给TaskScheduler

下面是源码中它自己的签名:

The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of* stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a* minimal schedule to run the job. It then submits stages as TaskSets to an underlying* TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent* tasks that can run right away based on the data that's already on the cluster (e.g. map output* files from previous stages), though it may fail if this data becomes unavailable.

DAGScheduler最初会在Driver端的SparkContext生成 具体步骤可以看看我之前的SparkContext文章,而触发DAGScheduler生成DAG Stage的是Action算子

OK,从count开始:


count为Action算子,当执行它是 里面调用的就是runJob

/** * Return the number of elements in the RDD. */def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum
跟踪进去:

def runJob[T, U: ClassTag](    rdd: RDD[T],    func: (TaskContext, Iterator[T]) => U,    partitions: Seq[Int],    resultHandler: (Int, U) => Unit): Unit = {  if (stopped.get()) {    // 如果是停止状态就跑出异常    throw new IllegalStateException("SparkContext has been shutdown")  }  val callSite = getCallSite  val cleanedFunc = clean(func)  logInfo("Starting job: " + callSite.shortForm)  if (conf.getBoolean("spark.logLineage", false)) {    logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)  }  // 进入runJob核心方法  dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get)  // ConsoleProgressBar 控制台输出的job进度条  progressBar.foreach(_.finishAll())  // 最终递归调用doCheckpoint来检查每个父RDD是否需要checkpoint  // checkpoint一般是存储数据到HDFS上,并切掉之前的RDDlineage  // 以后的RDD若要重用的话都会先检查是否有checkpoint  rdd.doCheckpoint()}

这里会调用dagScheduler的runJob,里面会返回一个阻塞线程等待Job完成的Waiter,并把Job提交到DAGSchduler上

def runJob[T, U](    rdd: RDD[T],    func: (TaskContext, Iterator[T]) => U,    partitions: Seq[Int],    callSite: CallSite,    resultHandler: (Int, U) => Unit,    properties: Properties): Unit = {  val start = System.nanoTime  // 提交job 里面会返回一个阻塞线程JobWaiter等待此Job的完成  val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)  ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf)  // 根据job完成情况匹配不同的Log  waiter.completionFuture.value.get match {    case scala.util.Success(_) =>      logInfo("Job %d finished: %s, took %f s".format        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))    case scala.util.Failure(exception) =>      logInfo("Job %d failed: %s, took %f s".format        (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9))      // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler.      val callerStackTrace = Thread.currentThread().getStackTrace.tail      exception.setStackTrace(exception.getStackTrace ++ callerStackTrace)      throw exception  }}
返回JobWaiter,并向event循环队列插入提交Job的事件消息

def submitJob[T, U](    rdd: RDD[T],    func: (TaskContext, Iterator[T]) => U,    partitions: Seq[Int],    callSite: CallSite,    resultHandler: (Int, U) => Unit,    properties: Properties): JobWaiter[U] = {  // Check to make sure we are not launching a task on a partition that does not exist.  // 检查分区是否存在,保证task正常运行  val maxPartitions = rdd.partitions.length  partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>    throw new IllegalArgumentException(      "Attempting to access a non-existent partition: " + p + ". " +        "Total number of partitions: " + maxPartitions)  }  // nextJobId增加一个JobId作当前Job的标识(+1  val jobId = nextJobId.getAndIncrement()  if (partitions.size == 0) {    // Return immediately if the job is running 0 tasks    // 如果没有task就立即返回JobWaiter    return new JobWaiter[U](this, jobId, 0, resultHandler)  }  // partitions做断言,确保下分区是否大于0  assert(partitions.size > 0)  val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]  // 首先构造一个JobWaiter阻塞线程 等待job完成 然后把完成结果提交给resultHandler  val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)  // DAGScheduler的事件队列,结构为LinkedBlockingDeque  // 因为可能集群同时运行着多个Job,而DAGSchduler默认是FIFO先进先出的资源调度  // 这里传入的事件类型为JobSubmitted,而在eventProcessLoop会调用doOnReceive  // 来匹配事件类型并执行对应的操作,最终会匹配到dagScheduler.handleJobSubmitted(....)  eventProcessLoop.post(JobSubmitted(    jobId, rdd, func2, partitions.toArray, callSite, waiter,    SerializationUtils.clone(properties)))  waiter}
 eventProcessLoop事件处理循环体继承于EventLoop,专门用来接收Job和Stage阶段中调用者发来的所有事件消息并处理

eventProcessLoop在post事件信息的时候其实是把它put进消息队列,一个单独的Java线程会不停安全阻塞去take这个队列 取出事件并根据匹配的事件类型做对应的处理

// 专门用来接收JobStage阶段中调用者发来的所有事件消息并处理private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)

private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
  extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging {  private[this] val timer = dagScheduler.metricsSource.messageProcessingTimer  /**   * The main event loop of the DAG scheduler.   */
  // 它的单独的Java线程会不停调用这个方法
  override def onReceive(event: DAGSchedulerEvent): Unit = {    val timerContext = timer.time()    try {      doOnReceive(event)    } finally {      timerContext.stop()    }  }  private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {      // 提交job    case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>      dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)
DAGSchedulerEventProcessLoop继承于基类EventLoop,下面是线程的处理方式

private[spark] abstract class EventLoop[E](name: String) extends Logging {  private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]()  private val stopped = new AtomicBoolean(false)  // 生成的java.lang.Thread线程  // 这个线程会不停的去eventQueue取出event事件消息然后onReceive做对应的  private val eventThread = new Thread(name) {    setDaemon(true)    override def run(): Unit = {      try {        while (!stopped.get) {          // 提取事件队列里的事件信息          val event = eventQueue.take()          try {            // 调用onReceive模式匹配做事件驱动            onReceive(event)          } catch {            case NonFatal(e) =>              try {                onError(e)              } catch {                case NonFatal(e) => logError("Unexpected error in " + name, e)              }          }        }      } catch {        case ie: InterruptedException => // exit even if eventQueue is not empty        case NonFatal(e) => logError("Unexpected error in " + name, e)      }    }  }
下面会开始创建ResultStage

// eventProcessLoop接受到提交job的事件任务后就会触发,开始划分stageprivate[scheduler] def handleJobSubmitted(jobId: Int,    finalRDD: RDD[_],    func: (TaskContext, Iterator[_]) => _,    partitions: Array[Int],    callSite: CallSite,    listener: JobListener,    properties: Properties) {  var finalStage: ResultStage = null  try {    // New stage creation may throw an exception if, for example, jobs are run on a    // HadoopRDD whose underlying HDFS files have been deleted.    // 创建ResultStage,这里才是真正开始处理提交的job划分stage的时候    // 它会从后往前找递归遍历它的每一个父RDD,从持久化中抽取反之重新计算    // 补充下:stage分为shuffleMapStageResultStage两种    // 每个job都是由1ResultStage0+ShuffleMapStage组成    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)  } catch {    case e: Exception =>      logWarning("Creating new stage failed due to exception - job: " + jobId, e)      listener.jobFailed(e)      return  }  // createResultStage封装在ActiveJob,你可以把它看做成Job的代表  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)  // 清除每个被持久化的RDD分区的位置  clearCacheLocs()  logInfo("Got job %s (%s) with %d output partitions".format(    job.jobId, callSite.shortForm, partitions.length))  logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")  logInfo("Parents of final stage: " + finalStage.parents)  logInfo("Missing parents: " + getMissingParentStages(finalStage))  val jobSubmissionTime = clock.getTimeMillis()  // HashMap结构,维护着jobIdjobIdToActiveJob的映射关系  jobIdToActiveJob(jobId) = job  // HashSet结构,维护着所有ActiveJob  activeJobs += job  // finalStage一旦生成就会把封装自己的ActiveJob注册到自己的_activeJob  // 而整个Job结束后就会清除掉  finalStage.setActiveJob(job)  // 提取出jobId对应的所有StageIds并转换才数组  val stageIds = jobIdToStageIds(jobId).toArray  // 提取出每个stage的最新尝试信息,当job启动时会告知SparkListenersJob  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))  // 封装一个SparkListenerEvent,通知SparkListenersJob启动了,并传递Job相关信息  // 底层会把这个event事件posteventQueue中,一个单独的Java的线程池会不停的poll出来并做对应的处理  listenerBus.post(    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))  // 开始提交Stage  submitStage(finalStage)}

在构建一个Job的时候 其实看得出是从最后一个RDD开始创建ResultStage,然后不停遍历自己的父RDD的依赖,并且查看是否之前持久化过(包括缓存,物化,以及Checkpoint)若没有就会从父RDD中提取出它的父RDD并继续检查,一直到发现持久化过 或者 第一个RDD为止,最后拿到的这个RDD的计算结果后,从前往后一次计算直到产生ResultStage

/** * Create a ResultStage associated with the provided jobId. */private def createResultStage(    rdd: RDD[_],    func: (TaskContext, Iterator[_]) => _,    partitions: Array[Int],    jobId: Int,    callSite: CallSite): ResultStage = {  // 开始创建ResultStage的父stage  // 里面有多个嵌套获取shuffle依赖和循环创建shuffleMapStage,若没有shuffle操作返回为空List  val parents = getOrCreateParentStages(rdd, jobId)  // 当前的stageId标识+1  val id = nextStageId.getAndIncrement()  // 放入刚刚生成的父stage等核心参数,生成ResultStage  val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite)  // ResultStage和它的ID加入stageIdToStage  stageIdToStage(id) = stage  // 更新jobIdsjobIdToStageIds  updateJobIdStageIdMaps(jobId, stage)  // 返回这个ResultStage  stage}


OK.......我们从getOrCreateParentStages开始...在这里请大家提高注意力,因为进入这个函数后,里面将出现特别多的嵌套迭代算法和多个组件的交互,包括我最开始看的时候也被犯了些迷糊..


/** * Get or create the list of parent stages for a given RDD.  The new Stages will be created with * the provided firstJobId. */// 创建每个父stage,而只有shuffle操作才会产生stage// 所以这里返回的Stage可能为null,也就是只有一个resultStageprivate def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {  // 遍历当前父RDD的依赖关系,直到找到它包含的第一个ShuffleDependency  // (可能多个,也可能没有)然后放入HashSet并返回  // 然后用map依次对所有ShuffleDependency创建所有的父shuffleMapStage  // 补充:在后面的代码里面会无限循环调用这段代码来创建父stage  // 如果里面匹配不到ShuffleDependency 那么代码就会在此终止,也就是创建父stage循环终止  getShuffleDependencies(rdd).map { shuffleDep =>    // 里面会创建当前拿到的ShuffleDependency的所有父ShuffleMapStage    getOrCreateShuffleMapStage(shuffleDep, firstJobId)  }.toList}
从getShuffleDependencies开始,这里仅仅是抽取当前RDD的Shuffle依赖(Job的Stage是以Shuffle划分的,1个Job中只会生成0+个ShuffleMapStage和1个ResultStage),如果不是ShuffleDependency就继续抽取父RDD...迭代遍历一直到抽取出为止或者没有

/** * Returns shuffle dependencies that are immediate parents of the given RDD. * * This function will not return more distant ancestors.  For example, if C has a shuffle * dependency on B which has a shuffle dependency on A: * * A <-- B <-- C * * calling this function with rdd C will only return the B <-- C dependency. * * This function is scheduler-visible for the purpose of unit testing. */// 只会抽取出第一个包含ShuffleDependencyRDDShuffleDependencyprivate[scheduler] def getShuffleDependencies(    rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = {  // 用来存放ShuffleDependencyHashSet  val parents = new HashSet[ShuffleDependency[_, _, _]]  // 临时存放后面遍历过的RDD  val visited = new HashSet[RDD[_]]  // Stack是一个last-in-first-out (LIFO)后进先出的数据结构  val waitingForVisit = new Stack[RDD[_]]  // rdd pushwaitingForVisit  waitingForVisit.push(rdd)  // 只要waitingForVisit不为空就循环下去  while (waitingForVisit.nonEmpty) {    // 取出顶部的第一个元素 RDD    val toVisit = waitingForVisit.pop()    // 如果刚刚拿出的RDD是否包含在visited    if (!visited(toVisit)) {      // 就把这个RDD加入visited      // 这个临时visited使用来鉴别RDD之前是否有没被这里面的代码使用过      visited += toVisit      // 遍历这个RDD的所有依赖并做匹配,返回的是Seq[Dependency[_]]序列类型      // 依次遍历出来的RDD会做匹配,非ShuffleDependencyRDD会放回waitingForVisit      // 然后把后来进来的RDD第一个pop出来继续匹配,一直匹配到有ShuffleDependency为止,当然也可能没有      // 补充:返回的ShuffleDependency可能没有,可能是一个也可能是多个      // 比如像CoGroupedRDD就是多个RDD产生的结果依赖,而ShuffledRDD只有一个父RDD      toVisit.dependencies.foreach {        case shuffleDep: ShuffleDependency[_, _, _] =>          // 如果匹配到ShuffleDependency就放进parents          parents += shuffleDep          // 如果匹配到的是其他任何依赖就把这个RDD的父RDD pushwaitingForVisit        case dependency =>          waitingForVisit.push(dependency.rdd)      }    }  }  // 遍历完后把存放ShuffleDependencyparents返回  parents}
在while循环中 它会遍历进来的RDD当前的所有依赖,注意:大伙千万别被方法的字面意思和返回类型 给误解成获取RDD以及父RDD的所有依赖,而这里只是获取当前父RDD的依赖,之所以会这样 是因为有像CoGroupedRDD依赖多个父RDD的算子(比如join),而所有算子都复写的基类RDD的getDependencies,只是实现不一样而已

/** * Get the list of dependencies of this RDD, taking into account whether the * RDD is checkpointed or not. */final def dependencies: Seq[Dependency[_]] = {  // 查看RDD之前是否被checkpoint  // 补充下:checkpoint了的RDD之前的父RDDlineage会被切断清除  // OneToOneDependency的依赖关系是子RDD每个Partition只依赖父RDD的一个Partition  // 如果有被checkpoint过的RDD就返回都是OneToOneDependency依赖的数组  checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse {    // 如果没有被checkpoint过 就判断当前RDDdependencies_是否存在    // dependencies_ 结构是Seq[Dependency[_]] 里面维护着这个RDD的所有依赖    if (dependencies_ == null) {      // 如果dependencies_为空,就调用getDependencies获取Dependencies      // 不同的RDD子类会复写getDependencies方法,比如ShuffledRDDCoGroupedRDD      // 他们都会根据父RDD或者分区数等参数来生成Dependencies      // 最后赋值给dependencies_      dependencies_ = getDependencies    }    // 返回dependencies_    dependencies_  }}

/** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */protected def getDependencies: Seq[Dependency[_]] = deps

普通RDD会直接拿取自己的依赖

abstract class RDD[T: ClassTag](    @transient private var _sc: SparkContext,    @transient private var deps: Seq[Dependency[_]]  ) extends Serializable with Logging {
而像ShuffledRDD,CoGroupedRDD,MapPartitionsRDD等都会复写getDependencies实现不同的逻辑

好吧,既然都说到这里了 接下来就顺便提提其他几种算子提取依赖的不同实现:


当我们在调用reduceByKey算子的时候,没有指定分区器的话默认是HashPartitioner,可能产生shuffle,拥有多个重载,最后调用的还是combineByKeyWithClassTag

因为shuffle是公认最影响集群性能的过程,所以Spark设计之初已经在尽量避免shuffle的产生,所以在最终生成ShuffleDependency之前都会做partitioner判断

def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {  reduceByKey(defaultPartitioner(self), func)}

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {  combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}

def combineByKeyWithClassTag[C](    createCombiner: V => C,    mergeValue: (C, V) => C,    mergeCombiners: (C, C) => C,    partitioner: Partitioner,    mapSideCombine: Boolean = true,    serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope {  require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0  if (keyClass.isArray) {    if (mapSideCombine) {      throw new SparkException("Cannot use map-side combining with array keys.")    }    if (partitioner.isInstanceOf[HashPartitioner]) {      throw new SparkException("HashPartitioner cannot partition array keys.")    }  }  // 用作map端和reduce端的聚合操作  val aggregator = new Aggregator[K, V, C](    self.context.clean(createCombiner),    self.context.clean(mergeValue),    self.context.clean(mergeCombiners))  // 判断下当前RDDpartitioner和父RDDpartitioner的属性是否相等  // 包括:partitioner中维护着不同的分区器(Hash/RangePartitioner)以及每个Key对应的分区  if (self.partitioner == Some(partitioner)) {    // 如果都一样的话就调用mapPartitions算子(Transformation算子)    // 避免了shuffle操作    self.mapPartitions(iter => {      val context = TaskContext.get()      new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context))    }, preservesPartitioning = true)  } else {    // 如果partitioner属性不相等的话就会引发shuffle,参数为当前RDDshuffled后的父RDD)和partitioner    new ShuffledRDD[K, V, C](self, partitioner)      .setSerializer(serializer)      .setAggregator(aggregator)      .setMapSideCombine(mapSideCombine)  }}

默认生成的HashPartitioner

def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {    // 如果有多的othersRDD传入就加入到rddSeq里(++是两个list组合成一起)    val rdds = (Seq(rdd) ++ others)    // filter过滤掉每个rdd是否有partitioner并且每个partitionernumPartitions是否大于0    // 就是判断下之前的RDD有没有partitioner而且分区个数不为0    val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0))    // 判断刚过滤出来的hasPartitioner是否存在    if (hasPartitioner.nonEmpty) {      // 如果rddPartitioner则用maxBy拿到刚刚过滤出来的rdd数组中分区数量最大的那个分区器      hasPartitioner.maxBy(_.partitions.length).partitioner.get    } else {      // 如果走到这里就代表之前所有的RDD都没有设置过Partitioner      // 如果之前我们通过参数设置过 就调用参数的并行度来设置分区 并生成HashPartitioner      if (rdd.context.conf.contains("spark.default.parallelism")) {        new HashPartitioner(rdd.context.defaultParallelism)      } else {        // 同样的 默认使用HashPartitioner,分区数为上游的所有RDD中最大分区数        new HashPartitioner(rdds.map(_.partitions.length).max)      }    }  }}
class HashPartitioner(partitions: Int) extends Partitioner {  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")  // 上游map端的分区个数  def numPartitions: Int = partitions  // reduce端划分分区的算法  def getPartition(key: Any): Int = key match {    case null => 0    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)  }  // 在所有RDD生成ShuffleDependency之前都会判断下两个分区数是否相等  override def equals(other: Any): Boolean = other match {    case h: HashPartitioner =>      // 比较的仅仅是分区个数      h.numPartitions == numPartitions    case _ =>      false  }  override def hashCode: Int = numPartitions}

这里顺便也补充一下HashPartitioner在产生shuffle的时候对下游reduce分区的划分算法:
def nonNegativeMod(x: Int, mod: Int): Int = {  // 对于拿到的keyhashCode然后对map端的分区数求模  val rawMod = x % mod  // 如果计算出来的余数小于零就加上分区数,反之返回余数  rawMod + (if (rawMod < 0) mod else 0)}

若果partitioner相等的话 就直接转换成MapPartitionsRDD(这个属于121依赖的算子,稍后再说)也就不会产生shuffle了

否则就会生成ShuffledRDD,现在回到之前提取依赖的时候

protected def getDependencies: Seq[Dependency[_]] = deps
ShuffledRDD复写了获取依赖的实现:

可以看见最后是new出了ShuffleDependency

// 拿到RDD依赖。override def getDependencies: Seq[Dependency[_]] = {  // 首先拿到生成ShuffleDependency的成员参数serializer,有的话就直接get  val serializer = userSpecifiedSerializer.getOrElse {    // get不到就从sparkEnv执行环境中的serializerManager中拿取    val serializerManager = SparkEnv.get.serializerManager    // 根据是否map端是否聚合触发不同的提取方法    if (mapSideCombine) {      serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]])    } else {      serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]])    }  }  // 生成的ShuffleDependency会被放进list返回  // 补充下:这里只放回一个父DD的依赖  // 因为和CoGroupedRDD都是复写的RDDprotected def getDependencies: Seq[Dependency[_]] = deps  // 所以返回的时候得满足Seq[Dependency[_]]类型 就用list封装了  // 所以大家别被这个方法和返回类型的字面意思给蒙骗了  // 包括像getCacheLocs用来做task最佳位置的判断机制,它判断的也不仅仅是MEMORY级别  List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))}

这时就会把ShuffleDependency相关信息注册到shuffleManager和ContextCleaner上,而最主要的还是封装了自己的父RDD

后面所有递归遍历父RDD都是从中提取

class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag](    @transient private val _rdd: RDD[_ <: Product2[K, V]],    val partitioner: Partitioner,    val serializer: Serializer = SparkEnv.get.serializer,    val keyOrdering: Option[Ordering[K]] = None,    val aggregator: Option[Aggregator[K, V, C]] = None,    val mapSideCombine: Boolean = false)  extends Dependency[Product2[K, V]] {  // RDD  override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]]  private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName  private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName  // Note: It's possible that the combiner class tag is null, if the combineByKey  // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag.  private[spark] val combinerClassName: Option[String] =    Option(reflect.classTag[C]).map(_.runtimeClass.getName)  // 生成shuffleId,也就是通过nextShuffleId1  val shuffleId: Int = _rdd.context.newShuffleId()  // shuffleManager注册一个shuffle并且获得一个指定类型的ShuffleHandle  // 比如:在之前章节讲到的SprakEnv中默认使用的SortShuffleManager它会复写registerShuffle  val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle(    shuffleId, _rdd.partitions.length, this)  // ShuffleDependency注册到ContextCleaner对象中  _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))}

当然,所有的RDD都有Dependency,只是不同类型的RDD集成Dependency的实现逻辑都不一样

这里我们看看之前提到过的MapPartitionsRDD,比如map函数:

/** * Return a new RDD by applying a function to all elements of this RDD. */def map[U: ClassTag](f: T => U): RDD[U] = withScope {  val cleanF = sc.clean(f)  new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))}

// 继承extends RDD[U](prev) 会产生OneToOneDependency依赖// 这里的参数:var prev: RDD[T] 是父RDDprivate[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag](    var prev: RDD[T],    f: (TaskContext, Int, Iterator[T]) => Iterator[U],  // (TaskContext, partition index, iterator)    preservesPartitioning: Boolean = false)  extends RDD[U](prev) {  // 默认:MapPartitionsRDD不会生成shuffle,也就不会产生ShuffleDependency,所以也就不会生成partitioner  override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None  // 分区数用的第一个父RDD的分区数  override def getPartitions: Array[Partition] = firstParent[T].partitions  // 计算逻辑是根据最初RDD算子的func来决定的,如下  // runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U)  override def compute(split: Partition, context: TaskContext): Iterator[U] =    f(context, split.index, firstParent[T].iterator(split, context))  // 清除依赖,比如在checkpoint的时候 就会执行此方法  override def clearDependencies() {    super.clearDependencies()    prev = null  }}

这里很难发现它的依赖是从哪里生成的,可能你会忽略继承的RDD,因为RDD默认是不生成依赖的,但是它继承的是带121依赖的RDD重载构造函数

/** Construct an RDD with just a one-to-one dependency on one parent */// OneToOneDependency依赖参数的RDD构造函数def this(@transient oneParent: RDD[_]) =  this(oneParent.context, List(new OneToOneDependency(oneParent)))
OneToOneDependency依赖的算子如map,filter,子RDD和父RDD直接的分区是一一对应的 当然也就不会发生shuffle,跟RangeDependency一样继承的是NarrowDependency窄依赖

/** * :: DeveloperApi :: * Represents a one-to-one dependency between partitions of the parent and child RDDs. */@DeveloperApiclass OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) {  override def getParents(partitionId: Int): List[Int] = List(partitionId)}

shuffle依赖跟窄依赖(继承于Dependency)是平级:

/** * :: DeveloperApi :: * Base class for dependencies. */// 两个直接子类,额外两个NarrowDependency的子类// 1ShuffleDependency// 2: NarrowDependency ———> RangeDependency//                     ———> OneToOneDependency@DeveloperApiabstract class Dependency[T] extends Serializable {  def rdd: RDD[T]}

最后再提一个CoGroupedRDD,(感觉扯偏题了很多,但我认为弄清楚DAGScheduler的Stage划分就必须得至少知道这几个核心算子的底层实现以及依赖关系,因为在划分Stage的时候算子不同 划分的一些细节也会不同,本来之前想单起一个章节关于算子RDD,但本人有点懒,索性干脆把几个模块配合着 DAGScheduler一块写了 而且这样看下来也更会有联动性


以join为例:

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {  this.cogroup(other, partitioner).flatMapValues( pair =>    for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w)  )}
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)    : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {  if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {    throw new SparkException("HashPartitioner cannot partition array keys.")  }  val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)  cg.mapValues { case Array(vs, w1s) =>    (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])  }}

直接看他的获取依赖的核心方法:跟reduceByKey类似,若之前的RDD有join过 分区数相等的话 就直接产生121依赖

不同的是由于join是多个RDD的操作,所以产生的依赖不止一个,最后以Seq[Dependency[_]]返回

override def getDependencies: Seq[Dependency[_]] = {  // 这里跟shuffledRDDgetDependencies不一样的是它是多个RDD聚合产生  // 所以这里会拿到多个RDDShuffleDependency,而shuffledRDD仅仅是拿到父RDD的依赖  rdds.map { rdd: RDD[_] =>    // 对比的其实是分区数是否相等    if (rdd.partitioner == Some(part)) {      logDebug("Adding one-to-one dependency with " + rdd)      // 相等的话 就生产OneToOneDependency依赖      new OneToOneDependency(rdd)    } else {      logDebug("Adding shuffle dependency with " + rdd)      // 不相等 就生成ShuffleDependency依赖      new ShuffleDependency[K, Any, CoGroupCombiner](        rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)    }  }}

OK,核心算子如何获取依赖的实现说完了,现在我们接着之前的DAGSchduler获取依赖开始

若果忘了,返回去看看吧:

dependencies_ = getDependencies

那么在反复的遍历,直到获得到这个RDD最近的ShuffleDependency为止(或者也可能没有),接着开始创建ShuffleMapStage

回到之前的getOrCreateParentStages方法中:

/** * Get or create the list of parent stages for a given RDD.  The new Stages will be created with * the provided firstJobId. */// 创建每个父stage,而只有shuffle操作才会产生stage// 所以这里返回的Stage可能为null,也就是只有一个resultStageprivate def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = {  // 遍历当前父RDD的依赖关系,直到找到它包含的第一个ShuffleDependency  // (可能多个,也可能没有)然后放入HashSet并返回  // 然后用map依次对所有ShuffleDependency创建所有的父shuffleMapStage  // 补充:在后面的代码里面会无限循环调用这段代码来创建父stage  // 如果里面匹配不到ShuffleDependency 那么代码就会在此终止,也就是创建父stage循环终止  getShuffleDependencies(rdd).map { shuffleDep =>    // 里面会创建当前拿到的ShuffleDependency的所有父ShuffleMapStage    getOrCreateShuffleMapStage(shuffleDep, firstJobId)  }.toList}
在创建ShuffleMapStage之前先会去shuffleIdToMapStage中根据shuffleId提取对应的ShuffleMapStage(若以前创建过 肯定会添加到shuffleIdToMapStage以便同样的算子复用)

没有的话 才会去调用getMissingAncestorShuffleDependencies

整个方法里面有多层嵌套迭代,大家好好看我的注解

/** * Gets a shuffle map stage if one exists in shuffleIdToMapStage. Otherwise, if the * shuffle map stage doesn't already exist, this method will create the shuffle map stage in * addition to any missing ancestor shuffle map stages. */private def getOrCreateShuffleMapStage(    shuffleDep: ShuffleDependency[_, _, _],    firstJobId: Int): ShuffleMapStage = {  // 通过从ShuffleDependency提取到的shuffleId来提取shuffleIdToMapStage中的ShuffleMapStage  shuffleIdToMapStage.get(shuffleDep.shuffleId) match {      //  如果能提取到 就直接返回    case Some(stage) =>      stage      // 如果提取不到就会依次找到所有父ShuffleDependencies并且构建所有父ShuffleMapStage    case None =>      // Create stages for all missing ancestor shuffle dependencies.      // 找到之前还未注册到shuffleIdToMapStage的父RDDshuffle dependencies      // 这个方法会拿到rdd的所有ShuffleDependency      // 里面还有个逻辑相似的迭代嵌套提取ShuffleDependency方法,所以这段代码很消耗性能      getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep =>        // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies        // that were not already in shuffleIdToMapStage, it's possible that by the time we        // get to a particular dependency in the foreach loop, it's been added to        // shuffleIdToMapStage by the stage creation process for an earlier dependency. See        // SPARK-13902 for more information.        // 根据遍历出来的所有ShuffleDependencies依次创建所有父ShuffleMapStage        // 因为返回出来的ShuffleDependency存储结构是Stack,所以是从最第一个ShuffleDependency开始创建        if (!shuffleIdToMapStage.contains(dep.shuffleId)) {          createShuffleMapStage(dep, firstJobId)        }      }      // Finally, create a stage for the given shuffle dependency.      // 最后会创建当前ShuffleDependencyShuffleMapStage      createShuffleMapStage(shuffleDep, firstJobId)  }}
shuffleIdToMapStage结构和作用

// shuffle依赖ID和对应的 ShuffleMapStage的映射关系,只包含在运行中的job,运行完毕会清除掉// 会在创建ShuffleMapStage的时候把该shuffleId和自己的映射加入shuffleIdToMapStage以便后面相同算子复用private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]

如果不是可复用的ShuffleMapStage那就调用getMissingAncestorShuffleDependencies

/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */private def getMissingAncestorShuffleDependencies(    rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = {  // Stack是一个last-in-first-out (LIFO)后进先出的数据结构  // 这里之所以用stack是用来待会生成ShuffleMapStage是从最后一个ShuffleDependency开始  val ancestors = new Stack[ShuffleDependency[_, _, _]]  // 临时存放RDD  val visited = new HashSet[RDD[_]]  // We are manually maintaining a stack here to prevent StackOverflowError  // caused by recursively visiting  val waitingForVisit = new Stack[RDD[_]]  // 把父RDDpushwaitingForVisit  waitingForVisit.push(rdd)  while (waitingForVisit.nonEmpty) {    val toVisit = waitingForVisit.pop()    // 判断visited是否包含刚从waitingForVisit.pop出来的RDD    if (!visited(toVisit)) {      // 如果不包含就加入      visited += toVisit      // 这里会拿到父RDDShuffleDependency,可能没有,也可能是一个或者多个      // 简单的说里面的实现其实就是一直遍历到之前有可复用的RDD为止,然后把这个阶段遍历的所有RDD的依赖      // 都加入到ancestors中,用来待会创建ShuffleMapStage      getShuffleDependencies(toVisit).foreach { shuffleDep =>        if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) {          // 如果shuffleIdToMapStage不包含ShuffleDependencyshuffleId,就pushancestors          ancestors.push(shuffleDep)          // ShuffleDependency的父RDD pushwaitingForVisit          // 继续while循环取出父RDD的父RDD依赖..直到遍历完所有ShuffleDependency或者被提取到          waitingForVisit.push(shuffleDep.rdd)        } // Otherwise, the dependency and its ancestors have already been registered.      }    }  }  // 返回的包含所有未注册或者已经注册进shuffleIdToMapStage的所有父RDD依赖,也可能返回为空  ancestors}

开始创建ShuffleMapStage——ShuffleMapStage刚好就在在shuffle操作之前发生,并且它可能包含多个transformation操作,在执行的时候,会保存map端的输出文件并且稍后可以被reduce tasks接收 

里面会继续反复回调getOrCreateParentStages,这里大家可以回去多看看前面的这个方法,它可以说的上是创建ResultStage的入口函数

最后会一直调用到拿去到可复用的stage或者第一个stage为止:

/** * Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a * previously run stage generated the same shuffle data, this function will copy the output * locations that are still available from the previous shuffle to avoid unnecessarily * regenerating data. */def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {  // ShuffleDependency的父RDD  val rdd = shuffleDep.rdd  // 多少个分区  val numTasks = rdd.partitions.length  // 用父RDD循环调用,每次调用都是前一个父RDD  // 在这里其实就会一直递归循环直到拿到首个stage才退出来  // 最后把生成的ShuffleMapStage加入shuffleIdToMapStage以便后面直接从中拿取  val parents = getOrCreateParentStages(rdd, jobId)  // 标记当前StageId nextStageId+1  val id = nextStageId.getAndIncrement()  // 拿到之前的stages等核心参数后就可以构建ShuffleMapStage  val stage = new ShuffleMapStage(    id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker)  // 把刚创建的ShuffleMapStage赋值给stageIdToStage  stageIdToStage(id) = stage  // 赋值给shuffleIdToMapStage  // 若后面的代码再次生成对应的ShuffleMapStage就可以从shuffleIdToMapStage中直接拿取了  shuffleIdToMapStage(shuffleDep.shuffleId) = stage  // 更新jobIdsjobIdToStageIds  updateJobIdStageIdMaps(jobId, stage)  // 这里会把shuffle信息注册到Driver上的MapOutputTrackerMastershuffleStatuses  if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {    // Kind of ugly: need to register RDDs with the cache and map output tracker here    // since we can't do it in the RDD constructor because # of partitions is unknown    logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")    // Shuffle信息注册到自己DriverMapOutputTrackerMaster    // 生成的是shuffleIdShuffleStatus的映射关系    // 在后面提交Job的时候还会根据它来的验证map stage是否已经准备好    mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)  }  // 最后返回生成的ShuffleMapStage  stage}

这里创建好ShuffleMapStage后 可以看到把Shuffle信息注册到自己Driver的MapOutputTrackerMaster的shuffleStatuses中,用来在后面的验证 和 reduce端拉取map输出

def registerShuffle(shuffleId: Int, numMaps: Int) {  if (shuffleStatuses.put(shuffleId, new ShuffleStatus(numMaps)).isDefined) {    throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice")  }}

// Driver上存放shuffleIdShuffleStatus的映射关系private val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala

ShuffleStatus中主要维护了一个Job中单个ShuffleMapStage的mapIds 到 MapStatus的映射关系,MapStatus维护的就是每个分区对应的Map信息,而MapStatus中主要维护着当前分区的BlockManagerId(也就是地址信息)以及包含的block大小.. 这些都会在后面的计算Task最优位置等做交互动作

// 索引长度是分区数,里面维护着每个partition对应的MapStatus// MapStatus中维护的是BlockManagerId,也就是每个task运行的位置和每个reduce task接收的block大小private[this] val mapStatuses = new Array[MapStatus](numPartitions)
private[spark] sealed trait MapStatus {  /** Location where this task was run. */  def location: BlockManagerId  /**   * Estimated size for the reduce block, in bytes.   *   * If a block is non-empty, then this method MUST return a non-zero size.  This invariant is   * necessary for correctness, since block fetchers are allowed to skip zero-size blocks.   */  def getSizeForBlock(reduceId: Int): Long}

ok,在不停迭代上一个RDD/Stage,找到离自己最近的可以创建stage后,从前往后依次构建Stage直到构建出最后一个 ResultStage(当然也可能这个Job只有一个ResultStage)。我们现在回到最初的handleJobSubmitted(忘记的话可以回头看看):

// eventProcessLoop接受到提交job的事件任务后就会触发,开始划分stageprivate[scheduler] def handleJobSubmitted(jobId: Int,    finalRDD: RDD[_],    func: (TaskContext, Iterator[_]) => _,    partitions: Array[Int],    callSite: CallSite,    listener: JobListener,    properties: Properties) {  var finalStage: ResultStage = null  try {    // New stage creation may throw an exception if, for example, jobs are run on a    // HadoopRDD whose underlying HDFS files have been deleted.    // 创建ResultStage,这里才是真正开始处理提交的job划分stage的时候    // 它会从后往前找递归遍历它的每一个父RDD,从持久化中抽取反之重新计算    // 补充下:stage分为shuffleMapStageResultStage两种    // 每个job都是由1ResultStage0+ShuffleMapStage组成    finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)  } catch {    case e: Exception =>      logWarning("Creating new stage failed due to exception - job: " + jobId, e)      listener.jobFailed(e)      return  }  // createResultStage封装在ActiveJob,你可以把它看做成Job的代表  val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)  // 清除每个被持久化的RDD分区的位置  clearCacheLocs()  logInfo("Got job %s (%s) with %d output partitions".format(    job.jobId, callSite.shortForm, partitions.length))  logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")")  logInfo("Parents of final stage: " + finalStage.parents)  logInfo("Missing parents: " + getMissingParentStages(finalStage))  val jobSubmissionTime = clock.getTimeMillis()  // HashMap结构,维护着jobIdjobIdToActiveJob的映射关系  jobIdToActiveJob(jobId) = job  // HashSet结构,维护着所有ActiveJob  activeJobs += job  // finalStage一旦生成就会把封装自己的ActiveJob注册到自己的_activeJob  // 而整个Job结束后就会清除掉  finalStage.setActiveJob(job)  // 提取出jobId对应的所有StageIds并转换才数组  val stageIds = jobIdToStageIds(jobId).toArray  // 提取出每个stage的最新尝试信息,当job启动时会告知SparkListenersJob  val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))  // 封装一个SparkListenerEvent,通知SparkListenersJob启动了,并传递Job相关信息  // 底层会把这个event事件posteventQueue中,一个单独的Java的线程池会不停的poll出来并做对应的处理  listenerBus.post(    SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))  // 开始提交Stage  submitStage(finalStage)}

我们现在拿到了finalStage,在更新和封装了一些属性后,进入submitStage提交Job的入口:

/** Submits stage, but first recursively submits any missing parents. */private def submitStage(stage: Stage) {  // 拿到第一个activeJob对应的jobId  val jobId = activeJobForStage(stage)  if (jobId.isDefined) {    logDebug("submitStage(" + stage + ")")    // waitingStages->等待运行的stages    // runningStages->正在运行的stages    // failedStages->由于获取失败需要重新提交的stages    if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {      // 依次判断当前RDD及父RDD有没有被持久化过,若没有就判断之前代码构建的shuffleMapStage有没有准备好      val missing = getMissingParentStages(stage).sortBy(_.id)      logDebug("missing: " + missing)      if (missing.isEmpty) {        logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents")        // 如果持久化过返回的就会为空,或者持久化不会空并且mapStage已经准备好那么返回也是为空        // 如果返回为空        // 开始提交Tasks        submitMissingTasks(stage, jobId.get)      } else {        // 若代码走到这里的话 就是之前的mapStage没准备好        for (parent <- missing) {          // 再次提交Stage          submitStage(parent)        }        // 然后放入等待waitingStages        waitingStages += stage      }    }  } else {    // 否则终止stage    abortStage(stage, "No active job for stage " + stage.id, None)  }}


在提交Tasks之前,首先判断下RDD有没有持久化过,map stage有没有准备好

private def getMissingParentStages(stage: Stage): List[Stage] = {  // 存放没准备好的mapStage  val missing = new HashSet[Stage]  // 存放被访问过的RDD的临时变量  val visited = new HashSet[RDD[_]]  // We are manually maintaining a stack here to prevent StackOverflowError  // caused by recursively visiting  // 又是后进先出的Stack结构  val waitingForVisit = new Stack[RDD[_]]  def visit(rdd: RDD[_]) {    if (!visited(rdd)) {      // 如果这个RDD没被访问过就加入visited,下次循环就不会访问这个RDD      visited += rdd      // 这里的getCacheLocs并不是根据字面意思的缓存来理解只是检查之前有没有仅仅缓存过RDD      // 而是做的双重检查:      // ①检查cacheLocs.contains(rdd.id) ②检查rdd.getStorageLevel == StorageLevel.NONE      // getCacheLocs返回的是executor_host_executorId标识的task位置,最后判断下是否为空      // 补充:包括在后面的task最佳位置划分算法也是会用到getCacheLocs(rdd: RDD[_])      val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)      // Nil表示空list,当没有被持久化过那么就是为true,需要继续遍历上一个RDD的依赖      if (rddHasUncachedPartitions) {        // 如果之前没持久化过 就遍历当前rdd的所有依赖        // 只有到下次while循环才会遍历父RDD的依赖,可能一个或者多个        // 其实这里主要是在检测之前的createResultStage有没有成功构建好ShuffleMapStage        for (dep <- rdd.dependencies) {          dep match {            case shufDep: ShuffleDependency[_, _, _] =>              // 在之前的代码若成功创建了ShuffleMapStage              // 那么就可以直接从shuffleIdToMapStage拿取              val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)              // 判断map阶段是否准备好,也就是所有partitions是否都有shuffle输出              // 在直接创建shuffleMapStage的时候 会把shuffle信息注册到Driver上的MapOutputTrackerMaster              // 最终会用rdd.partitions.length == ShuffleStatus._numAvailableOutputs作判断比较              if (!mapStage.isAvailable) {                // 不相等则加入missing                missing += mapStage              }              // 窄依赖就push回去,继续遍历            case narrowDep: NarrowDependency[_] =>              waitingForVisit.push(narrowDep.rdd)          }        }      }    }  }  // 把当前stageRDDpushwaitingForVisit  waitingForVisit.push(stage.rdd)  // 一直循环到pop出所有RDD  while (waitingForVisit.nonEmpty) {    visit(waitingForVisit.pop())  }  missing.toList}

这里说下getCacheLocs,网上看到一些源码介绍getCacheLocs仅仅是缓存,这是错误的,不要被字面意思迷惑了,它不仅仅会判断有没有缓存 还会判断其他持久化的方式 包括磁盘和堆外,它在后面的Task最优位置划分中也用上了。补充下:看源码不能只是以字面意思去理解,虽然现在代码都变得像自然语言 但是若要领悟框架的精髓还是得深入底层看细节 看实现。

private[scheduler]def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized {  // Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times  // cacheLocs是一个mutable类型的HashMap,里面存储的是各个RDDId和它对应的被持久化的task位置  // rdd.id底层调用的是nextRddId.getAndIncrement()这里会把自己注册到自己的SparkContext中并返回它的rddId  // 判断传进来的rddId是否存在cacheLocsmap  if (!cacheLocs.contains(rdd.id)) {    // Note: if the storage level is NONE, we don't need to get locations from block manager.    // 如果这个rdd不包含在cacheLocs就判断下是否它的存储级别为NONE,如果是就不需要从blockmanager里面获取    val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) {      // 补充 Nil是空的List extends List[Nothing]      IndexedSeq.fill(rdd.partitions.length)(Nil)    } else {      // 如果这个rddStorageLevel不为NONE但却在cacheLocs中没被找到      // 说明这个rdd它是有持久化级别设置的      // 找到这个rdd的所有task持久化的位置最后赋值给cacheLocs 包括这次以后都可以从cacheLocs拿取了      // 像这种情况:如果是这个RDD有持久化级别 但是是第一次调用 就会走到这段代码里,      // 而它的持久化信息会存储到cacheLocs中 方便下次复用直接拿取task地址      val blockIds =        // 拿到rdd的分区Array[Partition]中的每个Partition对应的索引,然后用map遍历操作        // 把拿到的每个indexrddId生成RDDBlockId并把它们转换成BlockId类型的数组        // 这里RDDBlockId继承于BlockId,只是复写了父类的name方法        // 而这个被复写的name就是BlockId作为全局的标识符        // 看见网上很多在问blockpartition的关系,而这就是他们的关系之一(一个block对应一个partition        rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId]      // 获取到每个blockId的存放地址      // 底层是通过blockManagerMaster调用Driver端的EndpointreceiveAndReply来做相应的处理      // 最后从Driver端的blockLocations中获取每个blockId对应的多个BlockManagerId      // BlockManagerIdBlockManager的唯一标识符,里面维护了hostexecutorId等核心成员      blockManagerMaster.getLocations(blockIds).map { bms =>        // 提取出blockManagerId对应的hostexecutorId(一个host可能会有多个executor        // 再通过提取出的2个参数传入调用TaskLocation,返回的是ExecutorCacheTaskLocation对象        // 返回对象里唯一成员toString最终会格式化成executor_host_executorId        // 也就是每个task运行的位置标记!!!        bms.map(bm => TaskLocation(bm.host, bm.executorId))      }    }    // 把拿到的locs地址信息赋值给cacheLocs里的rdd    // 下面的代码cacheLocs(rdd.id)会直接从中拿取    cacheLocs(rdd.id) = locs  }  // 最后根据rddcacheLocs拿去task的持久化地址  // 补充:这里只有一种情况 拿到的为空,就是cacheLocs不包含rdd并且StorageLevelNONE  cacheLocs(rdd.id)}
首先看下cacheLocs结构

// 每个被持久化的RDD分区的位置,KeyRDDIdValue是对应的分区序列// [Int, IndexedSeq[Seq[TaskLocation]]]你可以看成是[RDDId,BlockId[BlockManagerId[TaskLocation]]]private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]
在获取RDD之前的持久化的时候 这里会跟blockManagerMaster和BlockManagerMasterEndpoint交互,来取得每个之前持久化过的RDD位置

blockManagerMaster是在SparkContext构建SparkEnv的时候生成的,在Driver端的blockManagerMaster维护着集群上每个节点的BlockManager的元数据,而BlockManagerMasterEndpoint是在Driver端创建blockManagerMaster的时候把自己注册到SparkEnv中返回的消息体对象,它会根据收到的事件消息类型做对应处理,具体的细节可以参考我之前的SprakEnv章节介绍

先看下getLocations:这里涉及到了Netty的通信(之前章节有介绍过)

/** Get locations of multiple blockIds from the driver */def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {  // askSync会触发调用driver端的receiveAndReply并匹配GetLocationsMultipleBlockIds  // context.reply(getLocationsMultipleBlockIds(blockIds))  driverEndpoint.askSync[IndexedSeq[Seq[BlockManagerId]]](    GetLocationsMultipleBlockIds(blockIds))}
driverEndpoint.askSync 会触发BlockManagerMasterEndpoint的双向消息体receiveAndReply,然后会匹配到GetLocationsMultipleBlockIds

case GetLocationsMultipleBlockIds(blockIds) =>  // 通过回调函数返回给sender多个BlockId的信息  context.reply(getLocationsMultipleBlockIds(blockIds))
private def getLocationsMultipleBlockIds(    blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = {  // 拿到每个blockId对应的多个BlockManagerIds  blockIds.map(blockId => getLocations(blockId))}
最后拿到BlockId对应的BlockManagerId(里面包含host,executorId,port等成员属性)

private def getLocations(blockId: BlockId): Seq[BlockManagerId] = {  // 如果blockLocations包含blockIdget出来不然就设置为空  if (blockLocations.containsKey(blockId)) blockLocations.get(blockId).toSeq else Seq.empty}
blockLocations结构:

// Mapping from block id to the set of block managers that have the block.// BlockId对应的多个BlockmanagerId,因为可能会是StorageLevel或者checkpoint的原因,// 所以这个Block会存放在多个executor中的Blockmanager// 补充:JHashMapjavaHashMapprivate val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]

最后提取出里面的host和executorId然后通过TaskLocation封装成每个partition的对应位置标识符

 blockManagerMaster.getLocations(blockIds).map { bms =>    // 提取出blockManagerId对应的hostexecutorId(一个host可能会有多个executor    // 再通过提取出的2个参数传入调用TaskLocation,返回的是ExecutorCacheTaskLocation对象    // 返回对象里唯一成员toString最终会格式化成executor_host_executorId    // 也就是每个task运行的位置标记!!!    bms.map(bm => TaskLocation(bm.host, bm.executorId))  }}
这里调用的是TaskLocation的半生对象的apply

def apply(host: String, executorId: String): TaskLocation = {  new ExecutorCacheTaskLocation(host, executorId)}
/** * A location that includes both a host and an executor id on that host. */private [spark]case class ExecutorCacheTaskLocation(override val host: String, executorId: String)  extends TaskLocation {  // executor_host_executorId  override def toString: String = s"${TaskLocation.executorLocationTag}${host}_$executorId"}

在做好上述的一切检查工作后(是否持久化过,是否map stage准备好)我们开始进入预备提交Tasks的阶段,这里会涉及到Task最佳位置算法,分装闭包,广播变量,生成ShuffleMapTask和ResultTask(下章介绍),提交Task(下章介绍)

/** Called when stage's parents are available and we can now do its task. */private def submitMissingTasks(stage: Stage, jobId: Int) {  logDebug("submitMissingTasks(" + stage + ")")  // First figure out the indexes of partition ids to compute.  // 返回的是一个Seq[Int],索引长度是需要计算的partitionId  // 补充:shuffleStageresultStage的实现都不一样  val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()  // Use the scheduling pool, job group, description, etc. from an ActiveJob associated  // with this Stage  // 拿到该jobproperties  val properties = jobIdToActiveJob(jobId).properties  // stage加入正在运行状态  runningStages += stage  // SparkListenerStageSubmitted should be posted before testing whether tasks are  // serializable. If tasks are not serializable, a SparkListenerStageCompleted event  // will be posted, which should always come after a corresponding SparkListenerStageSubmitted  // event.  stage match {    case s: ShuffleMapStage =>      outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)    case s: ResultStage =>      outputCommitCoordinator.stageStart(        stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)  }  val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {    // map每个partitionId,根据Id和这个stageRDD调用Task最佳位置划分算法    // 补充不同类型的RDD所调用的最优位置算法逻辑都不一样    // 假如是ShuffledRDD实现核心思想是:    // 首先会查询BlockManager是否持久化过,若有就去Driver端找BlockManagerMaster获取地址    // 否则就会去查找是否checkpoint过,若有就可能会去hdfs直接获取    // 若都没持久化过,就会去找MapOutputTracker查找之前在map端写入的shuffle文件的地址    stage match {      case s: ShuffleMapStage =>        partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap      case s: ResultStage =>        partitionsToCompute.map { id =>          val p = s.partitions(id)          (id, getPreferredLocs(stage.rdd, p))        }.toMap    }  } catch {    case NonFatal(e) =>      stage.makeNewStageAttempt(partitionsToCompute.size)      listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))      abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))      runningStages -= stage      return  }  // 这里会把刚刚执行过的最新stage信息更新进_latestInfo  stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)  // If there are tasks to execute, record the submission time of the stage. Otherwise,  // post the even without the submission time, which indicates that this stage was  // skipped.  if (partitionsToCompute.nonEmpty) {    stage.latestInfo.submissionTime = Some(clock.getTimeMillis())  }  // 告诉listenerBus已经提交stage  listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))  // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times.  // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast  // the serialized copy of the RDD and for each task we will deserialize it, which means each  // task gets a different copy of the RDD. This provides stronger isolation between tasks that  // might modify state of objects referenced in their closures. This is necessary in Hadoop  // where the JobConf/Configuration object is not thread-safe.  // 下面会把task封装成闭包然后通过Broadcast分发到各个节点  var taskBinary: Broadcast[Array[Byte]] = null  try {    // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).    // For ResultTask, serialize and broadcast (rdd, func).    // 不管是ShuffleMapStagetask或者ResultStagetask都得序列化并且广播    // 这里返回的是task字节数组的闭包    val taskBinaryBytes: Array[Byte] = stage match {      case stage: ShuffleMapStage =>        // 转换成字节数组        JavaUtils.bufferToArray(          // 底层用的是java.nio.ByteBuffer缓冲区          closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))      case stage: ResultStage =>        JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))    }    // broadcast 可以把指定的对象转换成只读的广播变量发送到每个节点上    // 然后同个节点的每个executorpartition都会找worker拉取自己的闭包    // 如果这里不用broadcast 那么就会把给每个task拷贝一份闭包,这样就会产生大量IO    // 所以这里会用广播去优化,就像平时读取大的配置文件 或者避免join操作的Shuffle时候 都可以用到广播来优化    // 这里顺便提下 sparkRDD都是封装成闭包分布到各个节点的    // 闭包的特性是延迟加载和不能修改闭包外的变量(只能用累加器Accumulator实现修改变量)    taskBinary = sc.broadcast(taskBinaryBytes)  } catch {    // In the case of a failure during serialization, abort the stage.    case e: NotSerializableException =>      abortStage(stage, "Task not serializable: " + e.toString, Some(e))      runningStages -= stage      // Abort execution      return    case NonFatal(e) =>      abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e))      runningStages -= stage      return  }  val tasks: Seq[Task[_]] = try {    // 这里也会把task的指标检测对象taskMetrics封装成序列化闭包    val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()    stage match {        // 当匹配到生成的是ShuffleMapStage      case stage: ShuffleMapStage =>        // 首先保证pendingPartitions为空        // pendingPartitions中放的是还没完成的partition,还没完成的task        // 如果完成了就会从中清除        // DAGScheduler会用它来确定此state是否已完成        stage.pendingPartitions.clear()        // 开始遍历操作每个需要计算的分区        partitionsToCompute.map { id =>          // 拿到分区地址          val locs = taskIdToLocations(id)          // 拿到此stage对应的rdd的分区          val part = stage.rdd.partitions(id)          // 加入运行状态          stage.pendingPartitions += id          // 开始构建ShuffleMapTask对象,之后会通过这个对象调用runTask,具体详情会在下个章节          // 补充:Task分为两种:一种是ShuffleMapTask,一种是ResultTask          new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,            taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),            Option(sc.applicationId), sc.applicationAttemptId)        }      // 当匹配到ResultStage时生成的是ResultTask      case stage: ResultStage =>      partitionsToCompute.map { id =>          val locs = taskIdToLocations(id)          val p: Int = stage.partitions(id)          val part = stage.rdd.partitions(p)          new ResultTask(stage.id, stage.latestInfo.attemptId,            taskBinary, part, locs, id, properties, serializedTaskMetrics,            Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)        }    }  } catch {    case NonFatal(e) =>      abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e))      runningStages -= stage      return  }  if (tasks.size > 0) {    logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +      s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")    // 开始提交task    // 这里调用的是 实现taskScheduler特质的TaskSchedulerImpl    // 它会提交被taskSet封装的tasks    // 具体详细放在下个章节    taskScheduler.submitTasks(new TaskSet(      tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))  } else {    // Because we posted SparkListenerStageSubmitted earlier, we should mark    // the stage as completed here in case there are no tasks to run    // 由于某些原因 可能拿到任何task,但是得向SparkListenerStageSubmitted标记下这个stage完成了    // 因为之前我们向SparkListenerStageSubmitted提交过任务,这里得清除它。    markStageAsFinished(stage, None)    val debugString = stage match {      case stage: ShuffleMapStage =>        s"Stage ${stage} is actually done; " +          s"(available: ${stage.isAvailable}," +          s"available outputs: ${stage.numAvailableOutputs}," +          s"partitions: ${stage.numPartitions})"      case stage : ResultStage =>        s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})"    }    logDebug(debugString)    // Stage完成后,继续依次提交子Stage    submitWaitingChildStages(stage)  }}

首先找到需要计算的分区,以shuffleMapStage为例

/** Returns the sequence of partition ids that are missing (i.e. needs to be computed). */  override def findMissingPartitions(): Seq[Int] = {    mapOutputTrackerMaster      .findMissingPartitions(shuffleDep.shuffleId)      // 若返回为空的话 就直接返回所有分区个数      .getOrElse(0 until numPartitions)  }}
/** * Returns the sequence of partition ids that are missing (i.e. needs to be computed), or None * if the MapOutputTrackerMaster doesn't know about this shuffle. */def findMissingPartitions(shuffleId: Int): Option[Seq[Int]] = {  shuffleStatuses.get(shuffleId).map(_.findMissingPartitions())}
/** * Returns the sequence of partition ids that are missing (i.e. needs to be computed). */def findMissingPartitions(): Seq[Int] = synchronized {  // 遍历每一个partitionId 看是否在mapStatuses,若为null则过滤掉  // 这个mapStatuses会在task计算完成之后把对应的partition信息添加进去  // 所以若是第一次计算 mapStatuses是为空的  val missing = (0 until numPartitions).filter(id => mapStatuses(id) == null)  assert(missing.size == numPartitions - _numAvailableOutputs,    s"${missing.size} missing, expected ${numPartitions - _numAvailableOutputs}")  missing}

然后根据拿到的需要计算的分区Id计算最佳位置,还是以shuffleMapStage为例:

val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {  // map每个partitionId,根据Id和这个stageRDD调用Task最佳位置划分算法  // 补充不同类型的RDD所调用的最优位置算法逻辑都不一样  // 假如是ShuffledRDD实现核心思想是:  // 首先会查询BlockManager是否持久化过,若有就去Driver端找BlockManagerMaster获取地址  // 否则就会去查找是否checkpoint过,若有就可能会去hdfs直接获取  // 若都没持久化过,就会去找MapOutputTracker查找之前在map端写入的shuffle文件的地址  stage match {    case s: ShuffleMapStage =>      partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap    case s: ResultStage =>      partitionsToCompute.map { id =>        val p = s.partitions(id)        (id, getPreferredLocs(stage.rdd, p))      }.toMap  }
private[spark]def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = {  getPreferredLocsInternal(rdd, partition, new HashSet)}


先调用之前的使用过的getCacheLocs从内存,磁盘和堆外查找是否有持久化过,若没有的话再调用preferredLocations 判断是否checkpoint过,若还没有的话 就会判断分配的BlockManager已存在的block总和大小是否超标(默认是集群总block大小的0.2)

// 这里会根据不同的依赖调用不同的逻辑划分算法private def getPreferredLocsInternal(    rdd: RDD[_],    partition: Int,    visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {  // If the partition has already been visited, no need to re-visit.  // This avoids exponential path exploration.  SPARK-695  // 如果之前访问过这个rdd的分区就直接返回空list  if (!visited.add((rdd, partition))) {    // Nil has already been returned for previously visited partitions.    return Nil  }  // If the partition is cached, return the cache locations  // 调用getCacheLocs,之前有介绍  // 这里并不是柯理化,只是在返回值后面继续提取对应的[Seq[TaskLocation]]  // 所以最初返回类型是IndexedSeq[Seq[TaskLocation]]],可以看做是BlockId[BlockManagerId[TaskLocation]  // 然后根据partition返回[Seq[TaskLocation]]  val cached = getCacheLocs(rdd)(partition)  if (cached.nonEmpty) {    // 若有持久化的task就直接返回    return cached  }  // If the RDD has some placement preferences (as is the case for input RDDs), get those  // 这里其实是根据设定阈值筛选清洗出满足Blocks计算后的规定大小的BlockManager的地址。  // 补充:返回的地址格式也不同,这根是否之前被checkpoint有关  val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList  if (rddPrefs.nonEmpty) {    // 这里返回的三个对象里面封装就是task的地址    return rddPrefs.map(TaskLocation(_))  }  // If the RDD has narrow dependencies, pick the first partition of the first narrow dependency  // that has any placement preferences. Ideally we would choose based on transfer sizes,  // but this will do for now.  // 如果过来的RDD的依赖是窄依赖,就会迭代遍历所有父RDD的所有分区 直到任一一个有优先位置为止  rdd.dependencies.foreach {    case n: NarrowDependency[_] =>      // 遍历父RDD的所有分区      for (inPart <- n.getParents(partition)) {        // 回调getPreferredLocsInternal        val locs = getPreferredLocsInternal(n.rdd, inPart, visited)        if (locs != Nil) {          // 一直到任一一个有优先位置为止          return locs        }      }    case _ =>  }  Nil}

getCahceLocs之前介绍过,忘记的可以回去看看,这里就从preferredLocations:

/** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */final def preferredLocations(split: Partition): Seq[String] = {  // 首先会尝试从checkpoint中拿取RDD,若没有则直接调用getPreferredLocations  // 所以返回的地址格式也会不一样  checkpointRDD.map(_.getPreferredLocations(split)).getOrElse {    getPreferredLocations(split)  }}

以ShuffledRDD为例,首先获取到Driver端的MapOutputTrackerMaster(上面保存着集群所有节点blockmanager在shuffle阶段的元数据):

override protected def getPreferredLocations(partition: Partition): Seq[String] = {  // 首先拿到Driver端的MapOutputTrackerMaster  val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster]  // dependencies之前介绍过拿取到当前RDD的依赖  // 拿到的头个依赖强制转换成ShuffleDependency(本身就是ShuffledRDD,这样做也是多个保险)  val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]]  tracker.getPreferredLocationsForShuffle(dep, partition.index)}

def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int)    : Seq[String] = {  // shuffleLocalityEnabled默认true  // SHUFFLE_PREF_MAP_THRESHOLD默认=1000  // SHUFFLE_PREF_REDUCE_THRESHOLD默认=1000  // REDUCER_PREF_LOCS_FRACTION=0.2  if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD &&      dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) {    // 这里会过滤清洗出满足要求的所有BlockManagerId    // 补充:BlockManager在每个ExecutorDrvier中都存在唯一一个负责数据的传输,接收和持久化,在之后的章节会介绍    val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId,      dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION)    if (blockManagerIds.nonEmpty) {      // 拿到所有BlockManagerhost地址      blockManagerIds.get.map(_.host)    } else {      Nil    }  } else {    Nil  }}

这里会从MapOutputTrackerMaster获取到shuffleId对应的shuffleStatus的所有分区的MapStatus,MapStatus分为两种:默认的是CompressedMapStatus,另一种是HighlyCompressedMapStatus。最后从对应的BlockManager中的MapStatus中提取出的block大小做判定和过滤清洗

def getLocationsWithLargestOutputs(    shuffleId: Int,    reducerId: Int,    numReducers: Int,    fractionThreshold: Double)  : Option[Array[BlockManagerId]] = {  // 拿到这个shuffleId对应的shuffleStatuses  val shuffleStatus = shuffleStatuses.get(shuffleId).orNull  if (shuffleStatus != null) {    // 里面主要封装了synchronized用作访问这个shuffle中的mapStatuses数组的线程安全    // 补充下 :在创建一个ShuffleMapStage的时候就会把自己注册到Driver端的MapOutputTrackerMaster    // 然后同时里面也会生成对应的shuffleStatus和一个分区对应一个mapStatus    // 默认情况下mapstatus会在SortShuffleManager生成SortShuffleWriter时候生成    // 也就是ShuffleMapTask调用runTask的时候会构建    // 里面主要是两种类型:①CompressedMapStatus HighlyCompressedMapStatus    shuffleStatus.withMapStatuses { statuses =>      if (statuses.nonEmpty) {        // HashMap to add up sizes of all blocks at the same location        // Map里面存放的是相同地址BlockManagerId对应的所有blocks大小        val locs = new HashMap[BlockManagerId, Long]        var totalOutputSize = 0L        var mapIdx = 0        // 里面会遍历出所有的mapStatus        while (mapIdx < statuses.length) {          // 从第一个mapStatu开始拿取          val status = statuses(mapIdx)          // status may be null here if we are called between registerShuffle, which creates an          // array with null entries for each output, and registerMapOutputs, which populates it          // with valid status entries. This is possible if one thread schedules a job which          // depends on an RDD which is currently being computed by another thread.          // registerShuffle的时候status可能会变成null,所以这里加了个判断          if (status != null) {            // 提取并解压缩block,默认是压缩的            val blockSize = status.getSizeForBlock(reducerId)            if (blockSize > 0) {              // 提取对应的BlockManagerIdblockSize并把刚刚解压缩的block大小叠加进去              locs(status.location) = locs.getOrElse(status.location, 0L) + blockSize              // 叠加到总输出中              totalOutputSize += blockSize            }          }          // 开始遍历下个mapStatus          mapIdx = mapIdx + 1        }        val topLocs = locs.filter { case (loc, size) =>          // 过滤条件是:当前blockManagerblock总大小 / 所有block大小 >= 0.2(默认)          // 如果为true就说明当前blockManagerblock实在太多了,若果再把tasks          // 分配到这个blockManager的话就很可能造成性能瓶颈,比如说等待延迟调度等          size.toDouble / totalOutputSize >= fractionThreshold        }        // Return if we have any locations which satisfy the required threshold        if (topLocs.nonEmpty) {          // 返回满足要求的BlockManagerId的数组          return Some(topLocs.keys.toArray)        }      }    }  }  None}

若清洗出来的BlockManager都符合要求 就直接返回出去对应的格式化后的地址,回到之前的getPreferredLocsInternal:

val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toListif (rddPrefs.nonEmpty) {  // 这里返回的三个对象里面封装就是task的地址  return rddPrefs.map(TaskLocation(_))}
/**   * Create a TaskLocation from a string returned by getPreferredLocations.   * These strings have the form executor_[hostname]_[executorid], [hostname], or   * hdfs_cache_[hostname], depending on whether the location is cached.   */  // 若之checkpoint过那传递过来的str可能是hdfs_cache_[hostname]或者executor_[hostname]_[executorid]  // 若没有则是[hostname]  def apply(str: String): TaskLocation = {    // inMemoryLocationTag = "hdfs_cache_"    // 截取掉前面是hdfs_cache_字符的str,若前缀没有包含就直接返回原来的str    val hstr = str.stripPrefix(inMemoryLocationTag)    // 判断是否是被持久化到过hdfs    if (hstr.equals(str)) {      // 如果不是则判断前缀是否是executor_      if (str.startsWith(executorLocationTag)) {        // 转换成[hostname]_[executorid]        val hostAndExecutorId = str.stripPrefix(executorLocationTag)        // 返回的是Array[String](hostname,executorid)        val splits = hostAndExecutorId.split("_", 2)        require(splits.length == 2, "Illegal executor location format: " + str)        val Array(host, executorId) = splits        // 生成的对象仅包含标识符:executor_host_executorId        new ExecutorCacheTaskLocation(host, executorId)      } else {        // 走到这说明没有被checkpoint        // 生成的对象仅包含标识符:host        new HostTaskLocation(str)      }    } else {      // 走到这里说明之前有被checkpointhdfs      // 生成的对象仅包含标识符:hdfs_cache_host      new HDFSCacheTaskLocation(hstr)    }  }}

在我们拿到了Task的最佳位置后,Spark会把他们封装封装成序列化闭包,然后广播出去


// 下面会把task封装成闭包然后通过Broadcast分发到各个节点var taskBinary: Broadcast[Array[Byte]] = nulltry {  // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).  // For ResultTask, serialize and broadcast (rdd, func).  // 不管是ShuffleMapStagetask或者ResultStagetask都得序列化并且广播  // 这里返回的是task字节数组的闭包  val taskBinaryBytes: Array[Byte] = stage match {    case stage: ShuffleMapStage =>      // 转换成字节数组      JavaUtils.bufferToArray(        // 底层用的是java.nio.ByteBuffer缓冲区        closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))    case stage: ResultStage =>      JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))  }  // broadcast 可以把指定的对象转换成只读的广播变量发送到每个节点上  // 然后同个节点的每个executorpartition都会找worker拉取自己的闭包  // 如果这里不用broadcast 那么就会把给每个task拷贝一份闭包,这样就会产生大量IO  // 所以这里会用广播去优化,就像平时读取大的配置文件 或者避免join操作的Shuffle时候 都可以用到广播来优化  // 这里顺便提下 sparkRDD都是封装成闭包分布到各个节点的  // 闭包的特性是延迟加载和不能修改闭包外的变量(只能用累加器Accumulator实现修改变量)  taskBinary = sc.broadcast(taskBinaryBytes)

然后会把封装好的taskBinary跟着一系列参数生成ShuffleMapTask或者ResultTask(下个章节介绍)

val tasks: Seq[Task[_]] = try {  // 这里也会把task的指标检测对象taskMetrics封装成序列化闭包  val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array()  stage match {      // 当匹配到生成的是ShuffleMapStage    case stage: ShuffleMapStage =>      // 首先保证pendingPartitions为空      // pendingPartitions中放的是还没完成的partition,还没完成的task      // 如果完成了就会从中清除      // DAGScheduler会用它来确定此state是否已完成      stage.pendingPartitions.clear()      // 开始遍历操作每个需要计算的分区      partitionsToCompute.map { id =>        // 拿到分区地址        val locs = taskIdToLocations(id)        // 拿到此stage对应的rdd的分区        val part = stage.rdd.partitions(id)        // 加入运行状态        stage.pendingPartitions += id        // 开始构建ShuffleMapTask对象,之后会通过这个对象调用runTask,具体详情会在下个章节        // 补充:Task分为两种:一种是ShuffleMapTask,一种是ResultTask        new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,          taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId),          Option(sc.applicationId), sc.applicationAttemptId)      }    // 当匹配到ResultStage时生成的是ResultTask    case stage: ResultStage =>    partitionsToCompute.map { id =>        val locs = taskIdToLocations(id)        val p: Int = stage.partitions(id)        val part = stage.rdd.partitions(p)        new ResultTask(stage.id, stage.latestInfo.attemptId,          taskBinary, part, locs, id, properties, serializedTaskMetrics,          Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)      }  }

最后把拿到的Tasks封装成TaskSet交给taskScheduler提交到各个executor上(下个章节介绍)

if (tasks.size > 0) {  logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " +    s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})")  // 开始提交task  // 这里调用的是 实现taskScheduler特质的TaskSchedulerImpl  // 它会提交被taskSet封装的tasks  // 具体详细放在下个章节  taskScheduler.submitTasks(new TaskSet(    tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))} else {  // Because we posted SparkListenerStageSubmitted earlier, we should mark  // the stage as completed here in case there are no tasks to run  // 由于某些原因 可能拿到任何task,但是得向SparkListenerStageSubmitted标记下这个stage完成了  // 因为之前我们向SparkListenerStageSubmitted提交过任务,这里得清除它。  markStageAsFinished(stage, None)

当这个Stage完成后,如果还有等待提交的Stage就继续提交

// Stage完成后,继续依次提交子Stage    submitWaitingChildStages(stage)  }}
/** * Check for waiting stages which are now eligible for resubmission. * Submits stages that depend on the given parent stage. Called when the parent stage completes * successfully. */private def submitWaitingChildStages(parent: Stage) {  logTrace(s"Checking if any dependencies of $parent are now runnable")  logTrace("running: " + runningStages)  logTrace("waiting: " + waitingStages)  logTrace("failed: " + failedStages)  // 过滤掉已完成的父stage  // 数据结构:HashSet,用来存放等待提交的stage  // 这个会在之前调用submitStage的时候把需要提交的stage加入进去  val childStages = waitingStages.filter(_.parents.contains(parent)).toArray  waitingStages --= childStages  for (stage <- childStages.sortBy(_.firstJobId)) {    // 拿到最前面的stage,再次提交    submitStage(stage)  }}




阅读全文
1 0
原创粉丝点击