Spark2.0.X源码深度剖析之 DAGScheduler之Stage划分 —— 国内全网最新最全最具深度!!!
来源:互联网 发布:java就业培训课程 编辑:程序博客网 时间:2024/06/07 01:14
微信号:519292115
邮箱:taosiyuan163@163.com
尊重原创,禁止转载!!
Spark目前是大数据领域中最火的框架之一,可高效实现离线批处理,实时计算和机器学习等多元化操作,阅读源码有助你加深对框架的理解和认知
本人将依次剖析Spark2.0.0.X版本的各个核心组件,包括之后章节的BlockManager,TaskScheduler等
本人也看过一些网上介绍DAGScheduler源码的文章,有一些写的还是很不错,虽然讲到了一些核心点,但是都很片面,并没有提及很多细节的实现,为什么这样实现,其他组件的连带关系等...只是知其然不知所以然.
本人此次的DAGScheduler 源码剖析将涉及最底层的数据结构,每个细节的实现原理,算法,优化细节,各个组件之间交互动作以及纠正网上的一些错误介绍等....保证国内最新最全最细的源码剖析!
此次会提及到以下主要的跟DAGScheduler 交互的组件;按照代码的顺序,走到哪说到哪:
一、RDD最常用的几种算子:
①MapPartitionsRDD——转换型操作RDD,产生OneToOneDependency依赖..代表如map,filter等
②ShuffledRDD——依赖一个父RDD,可能产生最影响集群性能的Shuffle,生成ShuffleDependency依赖,划分当前Job的Stage..代表如groupByKey,reduceByKey等
③CoGroupedRDD——依赖多个父RDD,可能产生最影响集群性能的Shuffle,生成ShuffleDependency依赖,划分当前Job的Stage..代表如join,等
①ShuffleMapStage——继承于Stage,ShuffleMapStage刚好就在在shuffle操作之前发生,并且它可能包含多个transformation操作在执行的时候,会保存map端的输出文件并且稍后可以被reduce tasks接收
②ResultStage——一个Job中最后一个Stage,当执行ActionRDD时会被触发
三、MapOutputTracker——存放shuffle阶段的输出元数据,子类Master和Worker实现都不一样
四、Partitioner,默认是HashPartitioner
五、BlockManager
等.....
DAGScheduler——作为Spark的最核心组件之一简明扼要的说就是主要负责Job作业期间的所有Stage最优化的划分部署,并把所有的Task提交给TaskScheduler
下面是源码中它自己的签名:
The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of* stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a* minimal schedule to run the job. It then submits stages as TaskSets to an underlying* TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent* tasks that can run right away based on the data that's already on the cluster (e.g. map output* files from previous stages), though it may fail if this data becomes unavailable.
DAGScheduler最初会在Driver端的SparkContext生成 具体步骤可以看看我之前的SparkContext文章,而触发DAGScheduler生成DAG Stage的是Action算子
OK,从count开始:
count为Action算子,当执行它是 里面调用的就是runJob
/** * Return the number of elements in the RDD. */def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum跟踪进去:
def runJob[T, U: ClassTag]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], resultHandler: (Int, U) => Unit): Unit = { if (stopped.get()) { // 如果是停止状态就跑出异常 throw new IllegalStateException("SparkContext has been shutdown") } val callSite = getCallSite val cleanedFunc = clean(func) logInfo("Starting job: " + callSite.shortForm) if (conf.getBoolean("spark.logLineage", false)) { logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString) } // 进入runJob核心方法 dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler, localProperties.get) // ConsoleProgressBar 控制台输出的job进度条 progressBar.foreach(_.finishAll()) // 最终递归调用doCheckpoint来检查每个父RDD是否需要checkpoint // checkpoint一般是存储数据到HDFS上,并切掉之前的RDD的lineage // 以后的RDD若要重用的话都会先检查是否有checkpoint过 rdd.doCheckpoint()}
这里会调用dagScheduler的runJob,里面会返回一个阻塞线程等待Job完成的Waiter,并把Job提交到DAGSchduler上
def runJob[T, U]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], callSite: CallSite, resultHandler: (Int, U) => Unit, properties: Properties): Unit = { val start = System.nanoTime // 提交job 里面会返回一个阻塞线程JobWaiter等待此Job的完成 val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties) ThreadUtils.awaitReady(waiter.completionFuture, Duration.Inf) // 根据job完成情况匹配不同的Log waiter.completionFuture.value.get match { case scala.util.Success(_) => logInfo("Job %d finished: %s, took %f s".format (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9)) case scala.util.Failure(exception) => logInfo("Job %d failed: %s, took %f s".format (waiter.jobId, callSite.shortForm, (System.nanoTime - start) / 1e9)) // SPARK-8644: Include user stack trace in exceptions coming from DAGScheduler. val callerStackTrace = Thread.currentThread().getStackTrace.tail exception.setStackTrace(exception.getStackTrace ++ callerStackTrace) throw exception }}返回JobWaiter,并向event循环队列插入提交Job的事件消息
def submitJob[T, U]( rdd: RDD[T], func: (TaskContext, Iterator[T]) => U, partitions: Seq[Int], callSite: CallSite, resultHandler: (Int, U) => Unit, properties: Properties): JobWaiter[U] = { // Check to make sure we are not launching a task on a partition that does not exist. // 检查分区是否存在,保证task正常运行 val maxPartitions = rdd.partitions.length partitions.find(p => p >= maxPartitions || p < 0).foreach { p => throw new IllegalArgumentException( "Attempting to access a non-existent partition: " + p + ". " + "Total number of partitions: " + maxPartitions) } // 为nextJobId增加一个JobId作当前Job的标识(+1) val jobId = nextJobId.getAndIncrement() if (partitions.size == 0) { // Return immediately if the job is running 0 tasks // 如果没有task就立即返回JobWaiter return new JobWaiter[U](this, jobId, 0, resultHandler) } // 为partitions做断言,确保下分区是否大于0 assert(partitions.size > 0) val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _] // 首先构造一个JobWaiter阻塞线程 等待job完成 然后把完成结果提交给resultHandler val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler) // DAGScheduler的事件队列,结构为LinkedBlockingDeque // 因为可能集群同时运行着多个Job,而DAGSchduler默认是FIFO先进先出的资源调度 // 这里传入的事件类型为JobSubmitted,而在eventProcessLoop会调用doOnReceive // 来匹配事件类型并执行对应的操作,最终会匹配到dagScheduler.handleJobSubmitted(....) eventProcessLoop.post(JobSubmitted( jobId, rdd, func2, partitions.toArray, callSite, waiter, SerializationUtils.clone(properties))) waiter}eventProcessLoop事件处理循环体继承于EventLoop,专门用来接收Job和Stage阶段中调用者发来的所有事件消息并处理
eventProcessLoop在post事件信息的时候其实是把它put进消息队列,一个单独的Java线程会不停安全阻塞去take这个队列 取出事件并根据匹配的事件类型做对应的处理
// 专门用来接收Job和Stage阶段中调用者发来的所有事件消息并处理private[scheduler] val eventProcessLoop = new DAGSchedulerEventProcessLoop(this)
private[scheduler] class DAGSchedulerEventProcessLoop(dagScheduler: DAGScheduler)
extends EventLoop[DAGSchedulerEvent]("dag-scheduler-event-loop") with Logging { private[this] val timer = dagScheduler.metricsSource.messageProcessingTimer /** * The main event loop of the DAG scheduler. */
// 它的单独的Java线程会不停调用这个方法
override def onReceive(event: DAGSchedulerEvent): Unit = { val timerContext = timer.time() try { doOnReceive(event) } finally { timerContext.stop() } } private def doOnReceive(event: DAGSchedulerEvent): Unit = event match { // 提交job case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) => dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)DAGSchedulerEventProcessLoop继承于基类EventLoop,下面是线程的处理方式
private[spark] abstract class EventLoop[E](name: String) extends Logging { private val eventQueue: BlockingQueue[E] = new LinkedBlockingDeque[E]() private val stopped = new AtomicBoolean(false) // 生成的java.lang.Thread线程 // 这个线程会不停的去eventQueue取出event事件消息然后onReceive做对应的 private val eventThread = new Thread(name) { setDaemon(true) override def run(): Unit = { try { while (!stopped.get) { // 提取事件队列里的事件信息 val event = eventQueue.take() try { // 调用onReceive模式匹配做事件驱动 onReceive(event) } catch { case NonFatal(e) => try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } }下面会开始创建ResultStage
// 在eventProcessLoop接受到提交job的事件任务后就会触发,开始划分stageprivate[scheduler] def handleJobSubmitted(jobId: Int, finalRDD: RDD[_], func: (TaskContext, Iterator[_]) => _, partitions: Array[Int], callSite: CallSite, listener: JobListener, properties: Properties) { var finalStage: ResultStage = null try { // New stage creation may throw an exception if, for example, jobs are run on a // HadoopRDD whose underlying HDFS files have been deleted. // 创建ResultStage,这里才是真正开始处理提交的job划分stage的时候 // 它会从后往前找递归遍历它的每一个父RDD,从持久化中抽取反之重新计算 // 补充下:stage分为shuffleMapStage和ResultStage两种 // 每个job都是由1个ResultStage和0+个ShuffleMapStage组成 finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite) } catch { case e: Exception => logWarning("Creating new stage failed due to exception - job: " + jobId, e) listener.jobFailed(e) return } // 把createResultStage封装在ActiveJob中,你可以把它看做成Job的代表 val job = new ActiveJob(jobId, finalStage, callSite, listener, properties) // 清除每个被持久化的RDD分区的位置 clearCacheLocs() logInfo("Got job %s (%s) with %d output partitions".format( job.jobId, callSite.shortForm, partitions.length)) logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")") logInfo("Parents of final stage: " + finalStage.parents) logInfo("Missing parents: " + getMissingParentStages(finalStage)) val jobSubmissionTime = clock.getTimeMillis() // HashMap结构,维护着jobId和jobIdToActiveJob的映射关系 jobIdToActiveJob(jobId) = job // HashSet结构,维护着所有ActiveJob activeJobs += job // finalStage一旦生成就会把封装自己的ActiveJob注册到自己的_activeJob上 // 而整个Job结束后就会清除掉 finalStage.setActiveJob(job) // 提取出jobId对应的所有StageIds并转换才数组 val stageIds = jobIdToStageIds(jobId).toArray // 提取出每个stage的最新尝试信息,当job启动时会告知SparkListenersJob val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo)) // 封装一个SparkListenerEvent,通知SparkListenersJob启动了,并传递Job相关信息 // 底层会把这个event事件post到eventQueue中,一个单独的Java的线程池会不停的poll出来并做对应的处理 listenerBus.post( SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties)) // 开始提交Stage submitStage(finalStage)}
在构建一个Job的时候 其实看得出是从最后一个RDD开始创建ResultStage,然后不停遍历自己的父RDD的依赖,并且查看是否之前持久化过(包括缓存,物化,以及Checkpoint)若没有就会从父RDD中提取出它的父RDD并继续检查,一直到发现持久化过 或者 第一个RDD为止,最后拿到的这个RDD的计算结果后,从前往后一次计算直到产生ResultStage
/** * Create a ResultStage associated with the provided jobId. */private def createResultStage( rdd: RDD[_], func: (TaskContext, Iterator[_]) => _, partitions: Array[Int], jobId: Int, callSite: CallSite): ResultStage = { // 开始创建ResultStage的父stage // 里面有多个嵌套获取shuffle依赖和循环创建shuffleMapStage,若没有shuffle操作返回为空List val parents = getOrCreateParentStages(rdd, jobId) // 当前的stageId标识+1 val id = nextStageId.getAndIncrement() // 放入刚刚生成的父stage等核心参数,生成ResultStage val stage = new ResultStage(id, rdd, func, partitions, parents, jobId, callSite) // 把ResultStage和它的ID加入stageIdToStage stageIdToStage(id) = stage // 更新jobIds和jobIdToStageIds updateJobIdStageIdMaps(jobId, stage) // 返回这个ResultStage stage}
/** * Get or create the list of parent stages for a given RDD. The new Stages will be created with * the provided firstJobId. */// 创建每个父stage,而只有shuffle操作才会产生stage// 所以这里返回的Stage可能为null,也就是只有一个resultStageprivate def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = { // 遍历当前父RDD的依赖关系,直到找到它包含的第一个ShuffleDependency // (可能多个,也可能没有)然后放入HashSet并返回 // 然后用map依次对所有ShuffleDependency创建所有的父shuffleMapStage // 补充:在后面的代码里面会无限循环调用这段代码来创建父stage // 如果里面匹配不到ShuffleDependency 那么代码就会在此终止,也就是创建父stage循环终止 getShuffleDependencies(rdd).map { shuffleDep => // 里面会创建当前拿到的ShuffleDependency的所有父ShuffleMapStage getOrCreateShuffleMapStage(shuffleDep, firstJobId) }.toList}从getShuffleDependencies开始,这里仅仅是抽取当前RDD的Shuffle依赖(Job的Stage是以Shuffle划分的,1个Job中只会生成0+个ShuffleMapStage和1个ResultStage),如果不是ShuffleDependency就继续抽取父RDD...迭代遍历一直到抽取出为止或者没有
/** * Returns shuffle dependencies that are immediate parents of the given RDD. * * This function will not return more distant ancestors. For example, if C has a shuffle * dependency on B which has a shuffle dependency on A: * * A <-- B <-- C * * calling this function with rdd C will only return the B <-- C dependency. * * This function is scheduler-visible for the purpose of unit testing. */// 只会抽取出第一个包含ShuffleDependency的RDD的ShuffleDependencyprivate[scheduler] def getShuffleDependencies( rdd: RDD[_]): HashSet[ShuffleDependency[_, _, _]] = { // 用来存放ShuffleDependency的HashSet val parents = new HashSet[ShuffleDependency[_, _, _]] // 临时存放后面遍历过的RDD val visited = new HashSet[RDD[_]] // Stack是一个last-in-first-out (LIFO)后进先出的数据结构 val waitingForVisit = new Stack[RDD[_]] // 把rdd push进waitingForVisit waitingForVisit.push(rdd) // 只要waitingForVisit不为空就循环下去 while (waitingForVisit.nonEmpty) { // 取出顶部的第一个元素 RDD val toVisit = waitingForVisit.pop() // 如果刚刚拿出的RDD是否包含在visited中 if (!visited(toVisit)) { // 就把这个RDD加入visited // 这个临时visited使用来鉴别RDD之前是否有没被这里面的代码使用过 visited += toVisit // 遍历这个RDD的所有依赖并做匹配,返回的是Seq[Dependency[_]]序列类型 // 依次遍历出来的RDD会做匹配,非ShuffleDependency的RDD会放回waitingForVisit // 然后把后来进来的RDD第一个pop出来继续匹配,一直匹配到有ShuffleDependency为止,当然也可能没有 // 补充:返回的ShuffleDependency可能没有,可能是一个也可能是多个 // 比如像CoGroupedRDD就是多个RDD产生的结果依赖,而ShuffledRDD只有一个父RDD toVisit.dependencies.foreach { case shuffleDep: ShuffleDependency[_, _, _] => // 如果匹配到ShuffleDependency就放进parents parents += shuffleDep // 如果匹配到的是其他任何依赖就把这个RDD的父RDD push进waitingForVisit case dependency => waitingForVisit.push(dependency.rdd) } } } // 遍历完后把存放ShuffleDependency的parents返回 parents}在while循环中 它会遍历进来的RDD当前的所有依赖,注意:大伙千万别被方法的字面意思和返回类型 给误解成获取RDD以及父RDD的所有依赖,而这里只是获取当前父RDD的依赖,之所以会这样 是因为有像CoGroupedRDD依赖多个父RDD的算子(比如join),而所有算子都复写的基类RDD的getDependencies,只是实现不一样而已
/** * Get the list of dependencies of this RDD, taking into account whether the * RDD is checkpointed or not. */final def dependencies: Seq[Dependency[_]] = { // 查看RDD之前是否被checkpoint过 // 补充下:checkpoint了的RDD之前的父RDD的lineage会被切断清除 // OneToOneDependency的依赖关系是子RDD每个Partition只依赖父RDD的一个Partition // 如果有被checkpoint过的RDD就返回都是OneToOneDependency依赖的数组 checkpointRDD.map(r => List(new OneToOneDependency(r))).getOrElse { // 如果没有被checkpoint过 就判断当前RDD的dependencies_是否存在 // dependencies_ 结构是Seq[Dependency[_]] 里面维护着这个RDD的所有依赖 if (dependencies_ == null) { // 如果dependencies_为空,就调用getDependencies获取Dependencies // 不同的RDD子类会复写getDependencies方法,比如ShuffledRDD,CoGroupedRDD等 // 他们都会根据父RDD或者分区数等参数来生成Dependencies // 最后赋值给dependencies_ dependencies_ = getDependencies } // 返回dependencies_ dependencies_ }}
/** * Implemented by subclasses to return how this RDD depends on parent RDDs. This method will only * be called once, so it is safe to implement a time-consuming computation in it. */protected def getDependencies: Seq[Dependency[_]] = deps
普通RDD会直接拿取自己的依赖
abstract class RDD[T: ClassTag]( @transient private var _sc: SparkContext, @transient private var deps: Seq[Dependency[_]] ) extends Serializable with Logging {而像ShuffledRDD,CoGroupedRDD,MapPartitionsRDD等都会复写getDependencies实现不同的逻辑
好吧,既然都说到这里了 接下来就顺便提提其他几种算子提取依赖的不同实现:
当我们在调用reduceByKey算子的时候,没有指定分区器的话默认是HashPartitioner,可能产生shuffle,拥有多个重载,最后调用的还是combineByKeyWithClassTag
因为shuffle是公认最影响集群性能的过程,所以Spark设计之初已经在尽量避免shuffle的产生,所以在最终生成ShuffleDependency之前都会做partitioner判断
def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope { reduceByKey(defaultPartitioner(self), func)}
def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope { combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}
def combineByKeyWithClassTag[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null)(implicit ct: ClassTag[C]): RDD[(K, C)] = self.withScope { require(mergeCombiners != null, "mergeCombiners must be defined") // required as of Spark 0.9.0 if (keyClass.isArray) { if (mapSideCombine) { throw new SparkException("Cannot use map-side combining with array keys.") } if (partitioner.isInstanceOf[HashPartitioner]) { throw new SparkException("HashPartitioner cannot partition array keys.") } } // 用作map端和reduce端的聚合操作 val aggregator = new Aggregator[K, V, C]( self.context.clean(createCombiner), self.context.clean(mergeValue), self.context.clean(mergeCombiners)) // 判断下当前RDD的partitioner和父RDD的partitioner的属性是否相等 // 包括:partitioner中维护着不同的分区器(Hash/RangePartitioner)以及每个Key对应的分区 if (self.partitioner == Some(partitioner)) { // 如果都一样的话就调用mapPartitions算子(Transformation算子) // 避免了shuffle操作 self.mapPartitions(iter => { val context = TaskContext.get() new InterruptibleIterator(context, aggregator.combineValuesByKey(iter, context)) }, preservesPartitioning = true) } else { // 如果partitioner属性不相等的话就会引发shuffle,参数为当前RDD(shuffled后的父RDD)和partitioner new ShuffledRDD[K, V, C](self, partitioner) .setSerializer(serializer) .setAggregator(aggregator) .setMapSideCombine(mapSideCombine) }}
默认生成的HashPartitioner
def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { // 如果有多的othersRDD传入就加入到rdd的Seq里(++是两个list组合成一起) val rdds = (Seq(rdd) ++ others) // 用filter过滤掉每个rdd是否有partitioner并且每个partitioner的numPartitions是否大于0 // 就是判断下之前的RDD有没有partitioner而且分区个数不为0 val hasPartitioner = rdds.filter(_.partitioner.exists(_.numPartitions > 0)) // 判断刚过滤出来的hasPartitioner是否存在 if (hasPartitioner.nonEmpty) { // 如果rdd有Partitioner则用maxBy拿到刚刚过滤出来的rdd数组中分区数量最大的那个分区器 hasPartitioner.maxBy(_.partitions.length).partitioner.get } else { // 如果走到这里就代表之前所有的RDD都没有设置过Partitioner // 如果之前我们通过参数设置过 就调用参数的并行度来设置分区 并生成HashPartitioner if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { // 同样的 默认使用HashPartitioner,分区数为上游的所有RDD中最大分区数 new HashPartitioner(rdds.map(_.partitions.length).max) } } }}
class HashPartitioner(partitions: Int) extends Partitioner { require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.") // 上游map端的分区个数 def numPartitions: Int = partitions // reduce端划分分区的算法 def getPartition(key: Any): Int = key match { case null => 0 case _ => Utils.nonNegativeMod(key.hashCode, numPartitions) } // 在所有RDD生成ShuffleDependency之前都会判断下两个分区数是否相等 override def equals(other: Any): Boolean = other match { case h: HashPartitioner => // 比较的仅仅是分区个数 h.numPartitions == numPartitions case _ => false } override def hashCode: Int = numPartitions}
这里顺便也补充一下HashPartitioner在产生shuffle的时候对下游reduce分区的划分算法:
def nonNegativeMod(x: Int, mod: Int): Int = { // 对于拿到的key求hashCode然后对map端的分区数求模 val rawMod = x % mod // 如果计算出来的余数小于零就加上分区数,反之返回余数 rawMod + (if (rawMod < 0) mod else 0)}
若果partitioner相等的话 就直接转换成MapPartitionsRDD(这个属于121依赖的算子,稍后再说)也就不会产生shuffle了
否则就会生成ShuffledRDD,现在回到之前提取依赖的时候
protected def getDependencies: Seq[Dependency[_]] = depsShuffledRDD复写了获取依赖的实现:
可以看见最后是new出了ShuffleDependency
// 拿到RDD依赖。override def getDependencies: Seq[Dependency[_]] = { // 首先拿到生成ShuffleDependency的成员参数serializer,有的话就直接get val serializer = userSpecifiedSerializer.getOrElse { // 若get不到就从sparkEnv执行环境中的serializerManager中拿取 val serializerManager = SparkEnv.get.serializerManager // 根据是否map端是否聚合触发不同的提取方法 if (mapSideCombine) { serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[C]]) } else { serializerManager.getSerializer(implicitly[ClassTag[K]], implicitly[ClassTag[V]]) } } // 生成的ShuffleDependency会被放进list返回 // 补充下:这里只放回一个父DD的依赖 // 因为和CoGroupedRDD都是复写的RDD的protected def getDependencies: Seq[Dependency[_]] = deps // 所以返回的时候得满足Seq[Dependency[_]]类型 就用list封装了 // 所以大家别被这个方法和返回类型的字面意思给蒙骗了 // 包括像getCacheLocs用来做task最佳位置的判断机制,它判断的也不仅仅是MEMORY级别 List(new ShuffleDependency(prev, part, serializer, keyOrdering, aggregator, mapSideCombine))}
这时就会把ShuffleDependency相关信息注册到shuffleManager和ContextCleaner上,而最主要的还是封装了自己的父RDD
后面所有递归遍历父RDD都是从中提取
class ShuffleDependency[K: ClassTag, V: ClassTag, C: ClassTag]( @transient private val _rdd: RDD[_ <: Product2[K, V]], val partitioner: Partitioner, val serializer: Serializer = SparkEnv.get.serializer, val keyOrdering: Option[Ordering[K]] = None, val aggregator: Option[Aggregator[K, V, C]] = None, val mapSideCombine: Boolean = false) extends Dependency[Product2[K, V]] { // 父RDD override def rdd: RDD[Product2[K, V]] = _rdd.asInstanceOf[RDD[Product2[K, V]]] private[spark] val keyClassName: String = reflect.classTag[K].runtimeClass.getName private[spark] val valueClassName: String = reflect.classTag[V].runtimeClass.getName // Note: It's possible that the combiner class tag is null, if the combineByKey // methods in PairRDDFunctions are used instead of combineByKeyWithClassTag. private[spark] val combinerClassName: Option[String] = Option(reflect.classTag[C]).map(_.runtimeClass.getName) // 生成shuffleId,也就是通过nextShuffleId加1 val shuffleId: Int = _rdd.context.newShuffleId() // 向shuffleManager注册一个shuffle并且获得一个指定类型的ShuffleHandle // 比如:在之前章节讲到的SprakEnv中默认使用的SortShuffleManager它会复写registerShuffle val shuffleHandle: ShuffleHandle = _rdd.context.env.shuffleManager.registerShuffle( shuffleId, _rdd.partitions.length, this) // 把ShuffleDependency注册到ContextCleaner对象中 _rdd.sparkContext.cleaner.foreach(_.registerShuffleForCleanup(this))}
当然,所有的RDD都有Dependency,只是不同类型的RDD集成Dependency的实现逻辑都不一样
这里我们看看之前提到过的MapPartitionsRDD,比如map函数:
/** * Return a new RDD by applying a function to all elements of this RDD. */def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))}
// 继承extends RDD[U](prev) 会产生OneToOneDependency依赖// 这里的参数:var prev: RDD[T] 是父RDDprivate[spark] class MapPartitionsRDD[U: ClassTag, T: ClassTag]( var prev: RDD[T], f: (TaskContext, Int, Iterator[T]) => Iterator[U], // (TaskContext, partition index, iterator) preservesPartitioning: Boolean = false) extends RDD[U](prev) { // 默认:MapPartitionsRDD不会生成shuffle,也就不会产生ShuffleDependency,所以也就不会生成partitioner override val partitioner = if (preservesPartitioning) firstParent[T].partitioner else None // 分区数用的第一个父RDD的分区数 override def getPartitions: Array[Partition] = firstParent[T].partitions // 计算逻辑是根据最初RDD算子的func来决定的,如下 // runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U) override def compute(split: Partition, context: TaskContext): Iterator[U] = f(context, split.index, firstParent[T].iterator(split, context)) // 清除依赖,比如在checkpoint的时候 就会执行此方法 override def clearDependencies() { super.clearDependencies() prev = null }}
这里很难发现它的依赖是从哪里生成的,可能你会忽略继承的RDD,因为RDD默认是不生成依赖的,但是它继承的是带121依赖的RDD重载构造函数
/** Construct an RDD with just a one-to-one dependency on one parent */// 带OneToOneDependency依赖参数的RDD构造函数def this(@transient oneParent: RDD[_]) = this(oneParent.context, List(new OneToOneDependency(oneParent)))OneToOneDependency依赖的算子如map,filter,子RDD和父RDD直接的分区是一一对应的 当然也就不会发生shuffle,跟RangeDependency一样继承的是NarrowDependency窄依赖
/** * :: DeveloperApi :: * Represents a one-to-one dependency between partitions of the parent and child RDDs. */@DeveloperApiclass OneToOneDependency[T](rdd: RDD[T]) extends NarrowDependency[T](rdd) { override def getParents(partitionId: Int): List[Int] = List(partitionId)}
shuffle依赖跟窄依赖(继承于Dependency)是平级:
/** * :: DeveloperApi :: * Base class for dependencies. */// 两个直接子类,额外两个NarrowDependency的子类// 1:ShuffleDependency// 2: NarrowDependency ———> RangeDependency// ———> OneToOneDependency@DeveloperApiabstract class Dependency[T] extends Serializable { def rdd: RDD[T]}
最后再提一个CoGroupedRDD,(感觉扯偏题了很多,但我认为弄清楚DAGScheduler的Stage划分就必须得至少知道这几个核心算子的底层实现以及依赖关系,因为在划分Stage的时候算子不同 划分的一些细节也会不同,本来之前想单起一个章节关于算子RDD,但本人有点懒,索性干脆把几个模块配合着 DAGScheduler一块写了 而且这样看下来也更会有联动性)
以join为例:
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope { this.cogroup(other, partitioner).flatMapValues( pair => for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w) )}
def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner) : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope { if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) { throw new SparkException("HashPartitioner cannot partition array keys.") } val cg = new CoGroupedRDD[K](Seq(self, other), partitioner) cg.mapValues { case Array(vs, w1s) => (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]]) }}
直接看他的获取依赖的核心方法:跟reduceByKey类似,若之前的RDD有join过 分区数相等的话 就直接产生121依赖
不同的是由于join是多个RDD的操作,所以产生的依赖不止一个,最后以Seq[Dependency[_]]返回
override def getDependencies: Seq[Dependency[_]] = { // 这里跟shuffledRDD的getDependencies不一样的是它是多个RDD聚合产生 // 所以这里会拿到多个RDD的ShuffleDependency,而shuffledRDD仅仅是拿到父RDD的依赖 rdds.map { rdd: RDD[_] => // 对比的其实是分区数是否相等 if (rdd.partitioner == Some(part)) { logDebug("Adding one-to-one dependency with " + rdd) // 相等的话 就生产OneToOneDependency依赖 new OneToOneDependency(rdd) } else { logDebug("Adding shuffle dependency with " + rdd) // 不相等 就生成ShuffleDependency依赖 new ShuffleDependency[K, Any, CoGroupCombiner]( rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer) } }}
OK,核心算子如何获取依赖的实现说完了,现在我们接着之前的DAGSchduler获取依赖开始
若果忘了,返回去看看吧:
dependencies_ = getDependencies
那么在反复的遍历,直到获得到这个RDD最近的ShuffleDependency为止(或者也可能没有),接着开始创建ShuffleMapStage
回到之前的getOrCreateParentStages方法中:
/** * Get or create the list of parent stages for a given RDD. The new Stages will be created with * the provided firstJobId. */// 创建每个父stage,而只有shuffle操作才会产生stage// 所以这里返回的Stage可能为null,也就是只有一个resultStageprivate def getOrCreateParentStages(rdd: RDD[_], firstJobId: Int): List[Stage] = { // 遍历当前父RDD的依赖关系,直到找到它包含的第一个ShuffleDependency // (可能多个,也可能没有)然后放入HashSet并返回 // 然后用map依次对所有ShuffleDependency创建所有的父shuffleMapStage // 补充:在后面的代码里面会无限循环调用这段代码来创建父stage // 如果里面匹配不到ShuffleDependency 那么代码就会在此终止,也就是创建父stage循环终止 getShuffleDependencies(rdd).map { shuffleDep => // 里面会创建当前拿到的ShuffleDependency的所有父ShuffleMapStage getOrCreateShuffleMapStage(shuffleDep, firstJobId) }.toList}在创建ShuffleMapStage之前先会去shuffleIdToMapStage中根据shuffleId提取对应的ShuffleMapStage(若以前创建过 肯定会添加到shuffleIdToMapStage以便同样的算子复用)
没有的话 才会去调用getMissingAncestorShuffleDependencies
整个方法里面有多层嵌套迭代,大家好好看我的注解
/** * Gets a shuffle map stage if one exists in shuffleIdToMapStage. Otherwise, if the * shuffle map stage doesn't already exist, this method will create the shuffle map stage in * addition to any missing ancestor shuffle map stages. */private def getOrCreateShuffleMapStage( shuffleDep: ShuffleDependency[_, _, _], firstJobId: Int): ShuffleMapStage = { // 通过从ShuffleDependency提取到的shuffleId来提取shuffleIdToMapStage中的ShuffleMapStage shuffleIdToMapStage.get(shuffleDep.shuffleId) match { // 如果能提取到 就直接返回 case Some(stage) => stage // 如果提取不到就会依次找到所有父ShuffleDependencies并且构建所有父ShuffleMapStage case None => // Create stages for all missing ancestor shuffle dependencies. // 找到之前还未注册到shuffleIdToMapStage的父RDD的shuffle dependencies // 这个方法会拿到rdd的所有ShuffleDependency // 里面还有个逻辑相似的迭代嵌套提取ShuffleDependency方法,所以这段代码很消耗性能 getMissingAncestorShuffleDependencies(shuffleDep.rdd).foreach { dep => // Even though getMissingAncestorShuffleDependencies only returns shuffle dependencies // that were not already in shuffleIdToMapStage, it's possible that by the time we // get to a particular dependency in the foreach loop, it's been added to // shuffleIdToMapStage by the stage creation process for an earlier dependency. See // SPARK-13902 for more information. // 根据遍历出来的所有ShuffleDependencies依次创建所有父ShuffleMapStage // 因为返回出来的ShuffleDependency存储结构是Stack,所以是从最第一个ShuffleDependency开始创建 if (!shuffleIdToMapStage.contains(dep.shuffleId)) { createShuffleMapStage(dep, firstJobId) } } // Finally, create a stage for the given shuffle dependency. // 最后会创建当前ShuffleDependency的ShuffleMapStage createShuffleMapStage(shuffleDep, firstJobId) }}shuffleIdToMapStage结构和作用
// shuffle依赖ID和对应的 ShuffleMapStage的映射关系,只包含在运行中的job,运行完毕会清除掉// 会在创建ShuffleMapStage的时候把该shuffleId和自己的映射加入shuffleIdToMapStage以便后面相同算子复用private[scheduler] val shuffleIdToMapStage = new HashMap[Int, ShuffleMapStage]
如果不是可复用的ShuffleMapStage那就调用getMissingAncestorShuffleDependencies
/** Find ancestor shuffle dependencies that are not registered in shuffleToMapStage yet */private def getMissingAncestorShuffleDependencies( rdd: RDD[_]): Stack[ShuffleDependency[_, _, _]] = { // Stack是一个last-in-first-out (LIFO)后进先出的数据结构 // 这里之所以用stack是用来待会生成ShuffleMapStage是从最后一个ShuffleDependency开始 val ancestors = new Stack[ShuffleDependency[_, _, _]] // 临时存放RDD val visited = new HashSet[RDD[_]] // We are manually maintaining a stack here to prevent StackOverflowError // caused by recursively visiting val waitingForVisit = new Stack[RDD[_]] // 把父RDDpush进waitingForVisit waitingForVisit.push(rdd) while (waitingForVisit.nonEmpty) { val toVisit = waitingForVisit.pop() // 判断visited是否包含刚从waitingForVisit.pop出来的RDD if (!visited(toVisit)) { // 如果不包含就加入 visited += toVisit // 这里会拿到父RDD的ShuffleDependency,可能没有,也可能是一个或者多个 // 简单的说里面的实现其实就是一直遍历到之前有可复用的RDD为止,然后把这个阶段遍历的所有RDD的依赖 // 都加入到ancestors中,用来待会创建ShuffleMapStage getShuffleDependencies(toVisit).foreach { shuffleDep => if (!shuffleIdToMapStage.contains(shuffleDep.shuffleId)) { // 如果shuffleIdToMapStage不包含ShuffleDependency的shuffleId,就push进ancestors ancestors.push(shuffleDep) // 把ShuffleDependency的父RDD push进waitingForVisit // 继续while循环取出父RDD的父RDD依赖..直到遍历完所有ShuffleDependency或者被提取到 waitingForVisit.push(shuffleDep.rdd) } // Otherwise, the dependency and its ancestors have already been registered. } } } // 返回的包含所有未注册或者已经注册进shuffleIdToMapStage的所有父RDD依赖,也可能返回为空 ancestors}
开始创建ShuffleMapStage——ShuffleMapStage刚好就在在shuffle操作之前发生,并且它可能包含多个transformation操作,在执行的时候,会保存map端的输出文件并且稍后可以被reduce tasks接收
里面会继续反复回调getOrCreateParentStages,这里大家可以回去多看看前面的这个方法,它可以说的上是创建ResultStage的入口函数
最后会一直调用到拿去到可复用的stage或者第一个stage为止:
/** * Creates a ShuffleMapStage that generates the given shuffle dependency's partitions. If a * previously run stage generated the same shuffle data, this function will copy the output * locations that are still available from the previous shuffle to avoid unnecessarily * regenerating data. */def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = { // ShuffleDependency的父RDD val rdd = shuffleDep.rdd // 多少个分区 val numTasks = rdd.partitions.length // 用父RDD循环调用,每次调用都是前一个父RDD // 在这里其实就会一直递归循环直到拿到首个stage才退出来 // 最后把生成的ShuffleMapStage加入shuffleIdToMapStage以便后面直接从中拿取 val parents = getOrCreateParentStages(rdd, jobId) // 标记当前StageId nextStageId+1 val id = nextStageId.getAndIncrement() // 拿到之前的stages等核心参数后就可以构建ShuffleMapStage了 val stage = new ShuffleMapStage( id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep, mapOutputTracker) // 把刚创建的ShuffleMapStage赋值给stageIdToStage stageIdToStage(id) = stage // 赋值给shuffleIdToMapStage // 若后面的代码再次生成对应的ShuffleMapStage就可以从shuffleIdToMapStage中直接拿取了 shuffleIdToMapStage(shuffleDep.shuffleId) = stage // 更新jobIds和jobIdToStageIds updateJobIdStageIdMaps(jobId, stage) // 这里会把shuffle信息注册到Driver上的MapOutputTrackerMaster的shuffleStatuses if (!mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) { // Kind of ugly: need to register RDDs with the cache and map output tracker here // since we can't do it in the RDD constructor because # of partitions is unknown logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")") // 把Shuffle信息注册到自己Driver的MapOutputTrackerMaster上 // 生成的是shuffleId和ShuffleStatus的映射关系 // 在后面提交Job的时候还会根据它来的验证map stage是否已经准备好 mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length) } // 最后返回生成的ShuffleMapStage stage}
这里创建好ShuffleMapStage后 可以看到把Shuffle信息注册到自己Driver的MapOutputTrackerMaster的shuffleStatuses中,用来在后面的验证 和 reduce端拉取map输出
def registerShuffle(shuffleId: Int, numMaps: Int) { if (shuffleStatuses.put(shuffleId, new ShuffleStatus(numMaps)).isDefined) { throw new IllegalArgumentException("Shuffle ID " + shuffleId + " registered twice") }}
// 在Driver上存放shuffleId和ShuffleStatus的映射关系private val shuffleStatuses = new ConcurrentHashMap[Int, ShuffleStatus]().asScala
ShuffleStatus中主要维护了一个Job中单个ShuffleMapStage的mapIds 到 MapStatus的映射关系,MapStatus维护的就是每个分区对应的Map信息,而MapStatus中主要维护着当前分区的BlockManagerId(也就是地址信息)以及包含的block大小.. 这些都会在后面的计算Task最优位置等做交互动作
// 索引长度是分区数,里面维护着每个partition对应的MapStatus// MapStatus中维护的是BlockManagerId,也就是每个task运行的位置和每个reduce task接收的block大小private[this] val mapStatuses = new Array[MapStatus](numPartitions)
private[spark] sealed trait MapStatus { /** Location where this task was run. */ def location: BlockManagerId /** * Estimated size for the reduce block, in bytes. * * If a block is non-empty, then this method MUST return a non-zero size. This invariant is * necessary for correctness, since block fetchers are allowed to skip zero-size blocks. */ def getSizeForBlock(reduceId: Int): Long}
ok,在不停迭代上一个RDD/Stage,找到离自己最近的可以创建stage后,从前往后依次构建Stage直到构建出最后一个 ResultStage(当然也可能这个Job只有一个ResultStage)。我们现在回到最初的handleJobSubmitted(忘记的话可以回头看看):
// 在eventProcessLoop接受到提交job的事件任务后就会触发,开始划分stageprivate[scheduler] def handleJobSubmitted(jobId: Int, finalRDD: RDD[_], func: (TaskContext, Iterator[_]) => _, partitions: Array[Int], callSite: CallSite, listener: JobListener, properties: Properties) { var finalStage: ResultStage = null try { // New stage creation may throw an exception if, for example, jobs are run on a // HadoopRDD whose underlying HDFS files have been deleted. // 创建ResultStage,这里才是真正开始处理提交的job划分stage的时候 // 它会从后往前找递归遍历它的每一个父RDD,从持久化中抽取反之重新计算 // 补充下:stage分为shuffleMapStage和ResultStage两种 // 每个job都是由1个ResultStage和0+个ShuffleMapStage组成 finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite) } catch { case e: Exception => logWarning("Creating new stage failed due to exception - job: " + jobId, e) listener.jobFailed(e) return } // 把createResultStage封装在ActiveJob中,你可以把它看做成Job的代表 val job = new ActiveJob(jobId, finalStage, callSite, listener, properties) // 清除每个被持久化的RDD分区的位置 clearCacheLocs() logInfo("Got job %s (%s) with %d output partitions".format( job.jobId, callSite.shortForm, partitions.length)) logInfo("Final stage: " + finalStage + " (" + finalStage.name + ")") logInfo("Parents of final stage: " + finalStage.parents) logInfo("Missing parents: " + getMissingParentStages(finalStage)) val jobSubmissionTime = clock.getTimeMillis() // HashMap结构,维护着jobId和jobIdToActiveJob的映射关系 jobIdToActiveJob(jobId) = job // HashSet结构,维护着所有ActiveJob activeJobs += job // finalStage一旦生成就会把封装自己的ActiveJob注册到自己的_activeJob上 // 而整个Job结束后就会清除掉 finalStage.setActiveJob(job) // 提取出jobId对应的所有StageIds并转换才数组 val stageIds = jobIdToStageIds(jobId).toArray // 提取出每个stage的最新尝试信息,当job启动时会告知SparkListenersJob val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo)) // 封装一个SparkListenerEvent,通知SparkListenersJob启动了,并传递Job相关信息 // 底层会把这个event事件post到eventQueue中,一个单独的Java的线程池会不停的poll出来并做对应的处理 listenerBus.post( SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties)) // 开始提交Stage submitStage(finalStage)}
我们现在拿到了finalStage,在更新和封装了一些属性后,进入submitStage提交Job的入口:
/** Submits stage, but first recursively submits any missing parents. */private def submitStage(stage: Stage) { // 拿到第一个activeJob对应的jobId val jobId = activeJobForStage(stage) if (jobId.isDefined) { logDebug("submitStage(" + stage + ")") // waitingStages->等待运行的stages // runningStages->正在运行的stages // failedStages->由于获取失败需要重新提交的stages if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) { // 依次判断当前RDD及父RDD有没有被持久化过,若没有就判断之前代码构建的shuffleMapStage有没有准备好 val missing = getMissingParentStages(stage).sortBy(_.id) logDebug("missing: " + missing) if (missing.isEmpty) { logInfo("Submitting " + stage + " (" + stage.rdd + "), which has no missing parents") // 如果持久化过返回的就会为空,或者持久化不会空并且mapStage已经准备好那么返回也是为空 // 如果返回为空 // 开始提交Tasks! submitMissingTasks(stage, jobId.get) } else { // 若代码走到这里的话 就是之前的mapStage没准备好 for (parent <- missing) { // 再次提交Stage submitStage(parent) } // 然后放入等待waitingStages waitingStages += stage } } } else { // 否则终止stage abortStage(stage, "No active job for stage " + stage.id, None) }}
在提交Tasks之前,首先判断下RDD有没有持久化过,map stage有没有准备好
private def getMissingParentStages(stage: Stage): List[Stage] = { // 存放没准备好的mapStage val missing = new HashSet[Stage] // 存放被访问过的RDD的临时变量 val visited = new HashSet[RDD[_]] // We are manually maintaining a stack here to prevent StackOverflowError // caused by recursively visiting // 又是后进先出的Stack结构 val waitingForVisit = new Stack[RDD[_]] def visit(rdd: RDD[_]) { if (!visited(rdd)) { // 如果这个RDD没被访问过就加入visited,下次循环就不会访问这个RDD了 visited += rdd // 这里的getCacheLocs并不是根据字面意思的缓存来理解只是检查之前有没有仅仅缓存过RDD // 而是做的双重检查: // ①检查cacheLocs.contains(rdd.id) ②检查rdd.getStorageLevel == StorageLevel.NONE // getCacheLocs返回的是executor_host_executorId标识的task位置,最后判断下是否为空 // 补充:包括在后面的task最佳位置划分算法也是会用到getCacheLocs(rdd: RDD[_]) val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil) // Nil表示空list,当没有被持久化过那么就是为true,需要继续遍历上一个RDD的依赖 if (rddHasUncachedPartitions) { // 如果之前没持久化过 就遍历当前rdd的所有依赖 // 只有到下次while循环才会遍历父RDD的依赖,可能一个或者多个 // 其实这里主要是在检测之前的createResultStage有没有成功构建好ShuffleMapStage for (dep <- rdd.dependencies) { dep match { case shufDep: ShuffleDependency[_, _, _] => // 在之前的代码若成功创建了ShuffleMapStage // 那么就可以直接从shuffleIdToMapStage拿取 val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId) // 判断map阶段是否准备好,也就是所有partitions是否都有shuffle输出 // 在直接创建shuffleMapStage的时候 会把shuffle信息注册到Driver上的MapOutputTrackerMaster上 // 最终会用rdd.partitions.length == ShuffleStatus._numAvailableOutputs作判断比较 if (!mapStage.isAvailable) { // 不相等则加入missing missing += mapStage } // 窄依赖就push回去,继续遍历 case narrowDep: NarrowDependency[_] => waitingForVisit.push(narrowDep.rdd) } } } } } // 把当前stage的RDDpush进waitingForVisit waitingForVisit.push(stage.rdd) // 一直循环到pop出所有RDD while (waitingForVisit.nonEmpty) { visit(waitingForVisit.pop()) } missing.toList}
这里说下getCacheLocs,网上看到一些源码介绍getCacheLocs仅仅是缓存,这是错误的,不要被字面意思迷惑了,它不仅仅会判断有没有缓存 还会判断其他持久化的方式 包括磁盘和堆外,它在后面的Task最优位置划分中也用上了。补充下:看源码不能只是以字面意思去理解,虽然现在代码都变得像自然语言 但是若要领悟框架的精髓还是得深入底层看细节 看实现。
private[scheduler]def getCacheLocs(rdd: RDD[_]): IndexedSeq[Seq[TaskLocation]] = cacheLocs.synchronized { // Note: this doesn't use `getOrElse()` because this method is called O(num tasks) times // cacheLocs是一个mutable类型的HashMap,里面存储的是各个RDDId和它对应的被持久化的task位置 // rdd.id底层调用的是nextRddId.getAndIncrement()这里会把自己注册到自己的SparkContext中并返回它的rddId // 判断传进来的rddId是否存在cacheLocs的map里 if (!cacheLocs.contains(rdd.id)) { // Note: if the storage level is NONE, we don't need to get locations from block manager. // 如果这个rdd不包含在cacheLocs就判断下是否它的存储级别为NONE,如果是就不需要从blockmanager里面获取 val locs: IndexedSeq[Seq[TaskLocation]] = if (rdd.getStorageLevel == StorageLevel.NONE) { // 补充 Nil是空的List (extends List[Nothing]) IndexedSeq.fill(rdd.partitions.length)(Nil) } else { // 如果这个rdd的StorageLevel不为NONE但却在cacheLocs中没被找到 // 说明这个rdd它是有持久化级别设置的 // 找到这个rdd的所有task持久化的位置最后赋值给cacheLocs 包括这次以后都可以从cacheLocs拿取了 // 像这种情况:如果是这个RDD有持久化级别 但是是第一次调用 就会走到这段代码里, // 而它的持久化信息会存储到cacheLocs中 方便下次复用直接拿取task地址 val blockIds = // 拿到rdd的分区Array[Partition]中的每个Partition对应的索引,然后用map遍历操作 // 把拿到的每个index和rddId生成RDDBlockId并把它们转换成BlockId类型的数组 // 这里RDDBlockId继承于BlockId,只是复写了父类的name方法 // 而这个被复写的name就是BlockId作为全局的标识符 // 看见网上很多在问block和partition的关系,而这就是他们的关系之一(一个block对应一个partition) rdd.partitions.indices.map(index => RDDBlockId(rdd.id, index)).toArray[BlockId] // 获取到每个blockId的存放地址 // 底层是通过blockManagerMaster调用Driver端的Endpoint的receiveAndReply来做相应的处理 // 最后从Driver端的blockLocations中获取每个blockId对应的多个BlockManagerId // BlockManagerId是BlockManager的唯一标识符,里面维护了host,executorId等核心成员 blockManagerMaster.getLocations(blockIds).map { bms => // 提取出blockManagerId对应的host和executorId(一个host可能会有多个executor // 再通过提取出的2个参数传入调用TaskLocation,返回的是ExecutorCacheTaskLocation对象 // 返回对象里唯一成员toString最终会格式化成executor_host_executorId // 也就是每个task运行的位置标记!!! bms.map(bm => TaskLocation(bm.host, bm.executorId)) } } // 把拿到的locs地址信息赋值给cacheLocs里的rdd // 下面的代码cacheLocs(rdd.id)会直接从中拿取 cacheLocs(rdd.id) = locs } // 最后根据rdd从cacheLocs拿去task的持久化地址 // 补充:这里只有一种情况 拿到的为空,就是cacheLocs不包含rdd并且StorageLevel为NONE cacheLocs(rdd.id)}首先看下cacheLocs结构
// 每个被持久化的RDD分区的位置,Key是RDDId,Value是对应的分区序列// [Int, IndexedSeq[Seq[TaskLocation]]]你可以看成是[RDDId,BlockId[BlockManagerId[TaskLocation]]]private val cacheLocs = new HashMap[Int, IndexedSeq[Seq[TaskLocation]]]在获取RDD之前的持久化的时候 这里会跟blockManagerMaster和BlockManagerMasterEndpoint交互,来取得每个之前持久化过的RDD位置
blockManagerMaster是在SparkContext构建SparkEnv的时候生成的,在Driver端的blockManagerMaster维护着集群上每个节点的BlockManager的元数据,而BlockManagerMasterEndpoint是在Driver端创建blockManagerMaster的时候把自己注册到SparkEnv中返回的消息体对象,它会根据收到的事件消息类型做对应处理,具体的细节可以参考我之前的SprakEnv章节介绍
先看下getLocations:这里涉及到了Netty的通信(之前章节有介绍过)
/** Get locations of multiple blockIds from the driver */def getLocations(blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = { // askSync会触发调用driver端的receiveAndReply并匹配GetLocationsMultipleBlockIds // 的context.reply(getLocationsMultipleBlockIds(blockIds)) driverEndpoint.askSync[IndexedSeq[Seq[BlockManagerId]]]( GetLocationsMultipleBlockIds(blockIds))}driverEndpoint.askSync 会触发BlockManagerMasterEndpoint的双向消息体receiveAndReply,然后会匹配到GetLocationsMultipleBlockIds
case GetLocationsMultipleBlockIds(blockIds) => // 通过回调函数返回给sender多个BlockId的信息 context.reply(getLocationsMultipleBlockIds(blockIds))
private def getLocationsMultipleBlockIds( blockIds: Array[BlockId]): IndexedSeq[Seq[BlockManagerId]] = { // 拿到每个blockId对应的多个BlockManagerIds blockIds.map(blockId => getLocations(blockId))}最后拿到BlockId对应的BlockManagerId(里面包含host,executorId,port等成员属性)
private def getLocations(blockId: BlockId): Seq[BlockManagerId] = { // 如果blockLocations包含blockId就get出来不然就设置为空 if (blockLocations.containsKey(blockId)) blockLocations.get(blockId).toSeq else Seq.empty}blockLocations结构:
// Mapping from block id to the set of block managers that have the block.// BlockId对应的多个BlockmanagerId,因为可能会是StorageLevel或者checkpoint的原因,// 所以这个Block会存放在多个executor中的Blockmanager中// 补充:JHashMap是java的HashMapprivate val blockLocations = new JHashMap[BlockId, mutable.HashSet[BlockManagerId]]
最后提取出里面的host和executorId然后通过TaskLocation封装成每个partition的对应位置标识符
blockManagerMaster.getLocations(blockIds).map { bms => // 提取出blockManagerId对应的host和executorId(一个host可能会有多个executor // 再通过提取出的2个参数传入调用TaskLocation,返回的是ExecutorCacheTaskLocation对象 // 返回对象里唯一成员toString最终会格式化成executor_host_executorId // 也就是每个task运行的位置标记!!! bms.map(bm => TaskLocation(bm.host, bm.executorId)) }}这里调用的是TaskLocation的半生对象的apply
def apply(host: String, executorId: String): TaskLocation = { new ExecutorCacheTaskLocation(host, executorId)}
/** * A location that includes both a host and an executor id on that host. */private [spark]case class ExecutorCacheTaskLocation(override val host: String, executorId: String) extends TaskLocation { // executor_host_executorId override def toString: String = s"${TaskLocation.executorLocationTag}${host}_$executorId"}
在做好上述的一切检查工作后(是否持久化过,是否map stage准备好)我们开始进入预备提交Tasks的阶段,这里会涉及到Task最佳位置算法,分装闭包,广播变量,生成ShuffleMapTask和ResultTask(下章介绍),提交Task(下章介绍)
/** Called when stage's parents are available and we can now do its task. */private def submitMissingTasks(stage: Stage, jobId: Int) { logDebug("submitMissingTasks(" + stage + ")") // First figure out the indexes of partition ids to compute. // 返回的是一个Seq[Int],索引长度是需要计算的partitionId // 补充:shuffleStage和resultStage的实现都不一样 val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() // Use the scheduling pool, job group, description, etc. from an ActiveJob associated // with this Stage // 拿到该job的properties val properties = jobIdToActiveJob(jobId).properties // 把stage加入正在运行状态 runningStages += stage // SparkListenerStageSubmitted should be posted before testing whether tasks are // serializable. If tasks are not serializable, a SparkListenerStageCompleted event // will be posted, which should always come after a corresponding SparkListenerStageSubmitted // event. stage match { case s: ShuffleMapStage => outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1) case s: ResultStage => outputCommitCoordinator.stageStart( stage = s.id, maxPartitionId = s.rdd.partitions.length - 1) } val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try { // map每个partitionId,根据Id和这个stage的RDD调用Task最佳位置划分算法 // 补充不同类型的RDD所调用的最优位置算法逻辑都不一样 // 假如是ShuffledRDD实现核心思想是: // 首先会查询BlockManager是否持久化过,若有就去Driver端找BlockManagerMaster获取地址 // 否则就会去查找是否checkpoint过,若有就可能会去hdfs直接获取 // 若都没持久化过,就会去找MapOutputTracker查找之前在map端写入的shuffle文件的地址 stage match { case s: ShuffleMapStage => partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap case s: ResultStage => partitionsToCompute.map { id => val p = s.partitions(id) (id, getPreferredLocs(stage.rdd, p)) }.toMap } } catch { case NonFatal(e) => stage.makeNewStageAttempt(partitionsToCompute.size) listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e)) runningStages -= stage return } // 这里会把刚刚执行过的最新stage信息更新进_latestInfo中 stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq) // If there are tasks to execute, record the submission time of the stage. Otherwise, // post the even without the submission time, which indicates that this stage was // skipped. if (partitionsToCompute.nonEmpty) { stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) } // 告诉listenerBus已经提交stage了 listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times. // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast // the serialized copy of the RDD and for each task we will deserialize it, which means each // task gets a different copy of the RDD. This provides stronger isolation between tasks that // might modify state of objects referenced in their closures. This is necessary in Hadoop // where the JobConf/Configuration object is not thread-safe. // 下面会把task封装成闭包然后通过Broadcast分发到各个节点 var taskBinary: Broadcast[Array[Byte]] = null try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). // 不管是ShuffleMapStage的task或者ResultStage的task都得序列化并且广播 // 这里返回的是task字节数组的闭包 val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => // 转换成字节数组 JavaUtils.bufferToArray( // 底层用的是java.nio.ByteBuffer缓冲区 closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef)) case stage: ResultStage => JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef)) } // broadcast 可以把指定的对象转换成只读的广播变量发送到每个节点上 // 然后同个节点的每个executor的partition都会找worker拉取自己的闭包 // 如果这里不用broadcast 那么就会把给每个task拷贝一份闭包,这样就会产生大量IO // 所以这里会用广播去优化,就像平时读取大的配置文件 或者避免join操作的Shuffle时候 都可以用到广播来优化 // 这里顺便提下 spark的RDD都是封装成闭包分布到各个节点的 // 闭包的特性是延迟加载和不能修改闭包外的变量(只能用累加器Accumulator实现修改变量) taskBinary = sc.broadcast(taskBinaryBytes) } catch { // In the case of a failure during serialization, abort the stage. case e: NotSerializableException => abortStage(stage, "Task not serializable: " + e.toString, Some(e)) runningStages -= stage // Abort execution return case NonFatal(e) => abortStage(stage, s"Task serialization failed: $e\n${Utils.exceptionString(e)}", Some(e)) runningStages -= stage return } val tasks: Seq[Task[_]] = try { // 这里也会把task的指标检测对象taskMetrics封装成序列化闭包 val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array() stage match { // 当匹配到生成的是ShuffleMapStage case stage: ShuffleMapStage => // 首先保证pendingPartitions为空 // pendingPartitions中放的是还没完成的partition,还没完成的task // 如果完成了就会从中清除 // DAGScheduler会用它来确定此state是否已完成 stage.pendingPartitions.clear() // 开始遍历操作每个需要计算的分区 partitionsToCompute.map { id => // 拿到分区地址 val locs = taskIdToLocations(id) // 拿到此stage对应的rdd的分区 val part = stage.rdd.partitions(id) // 加入运行状态 stage.pendingPartitions += id // 开始构建ShuffleMapTask对象,之后会通过这个对象调用runTask,具体详情会在下个章节 // 补充:Task分为两种:一种是ShuffleMapTask,一种是ResultTask new ShuffleMapTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId), Option(sc.applicationId), sc.applicationAttemptId) } // 当匹配到ResultStage时生成的是ResultTask case stage: ResultStage => partitionsToCompute.map { id => val locs = taskIdToLocations(id) val p: Int = stage.partitions(id) val part = stage.rdd.partitions(p) new ResultTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, id, properties, serializedTaskMetrics, Option(jobId), Option(sc.applicationId), sc.applicationAttemptId) } } } catch { case NonFatal(e) => abortStage(stage, s"Task creation failed: $e\n${Utils.exceptionString(e)}", Some(e)) runningStages -= stage return } if (tasks.size > 0) { logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " + s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") // 开始提交task // 这里调用的是 实现taskScheduler特质的TaskSchedulerImpl // 它会提交被taskSet封装的tasks // 具体详细放在下个章节 taskScheduler.submitTasks(new TaskSet( tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties)) } else { // Because we posted SparkListenerStageSubmitted earlier, we should mark // the stage as completed here in case there are no tasks to run // 由于某些原因 可能拿到任何task,但是得向SparkListenerStageSubmitted标记下这个stage完成了 // 因为之前我们向SparkListenerStageSubmitted提交过任务,这里得清除它。 markStageAsFinished(stage, None) val debugString = stage match { case stage: ShuffleMapStage => s"Stage ${stage} is actually done; " + s"(available: ${stage.isAvailable}," + s"available outputs: ${stage.numAvailableOutputs}," + s"partitions: ${stage.numPartitions})" case stage : ResultStage => s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})" } logDebug(debugString) // 父Stage完成后,继续依次提交子Stage submitWaitingChildStages(stage) }}
首先找到需要计算的分区,以shuffleMapStage为例
/** Returns the sequence of partition ids that are missing (i.e. needs to be computed). */ override def findMissingPartitions(): Seq[Int] = { mapOutputTrackerMaster .findMissingPartitions(shuffleDep.shuffleId) // 若返回为空的话 就直接返回所有分区个数 .getOrElse(0 until numPartitions) }}
/** * Returns the sequence of partition ids that are missing (i.e. needs to be computed), or None * if the MapOutputTrackerMaster doesn't know about this shuffle. */def findMissingPartitions(shuffleId: Int): Option[Seq[Int]] = { shuffleStatuses.get(shuffleId).map(_.findMissingPartitions())}
/** * Returns the sequence of partition ids that are missing (i.e. needs to be computed). */def findMissingPartitions(): Seq[Int] = synchronized { // 遍历每一个partitionId 看是否在mapStatuses中,若为null则过滤掉 // 这个mapStatuses会在task计算完成之后把对应的partition信息添加进去 // 所以若是第一次计算 mapStatuses是为空的 val missing = (0 until numPartitions).filter(id => mapStatuses(id) == null) assert(missing.size == numPartitions - _numAvailableOutputs, s"${missing.size} missing, expected ${numPartitions - _numAvailableOutputs}") missing}
然后根据拿到的需要计算的分区Id计算最佳位置,还是以shuffleMapStage为例:
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try { // map每个partitionId,根据Id和这个stage的RDD调用Task最佳位置划分算法 // 补充不同类型的RDD所调用的最优位置算法逻辑都不一样 // 假如是ShuffledRDD实现核心思想是: // 首先会查询BlockManager是否持久化过,若有就去Driver端找BlockManagerMaster获取地址 // 否则就会去查找是否checkpoint过,若有就可能会去hdfs直接获取 // 若都没持久化过,就会去找MapOutputTracker查找之前在map端写入的shuffle文件的地址 stage match { case s: ShuffleMapStage => partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap case s: ResultStage => partitionsToCompute.map { id => val p = s.partitions(id) (id, getPreferredLocs(stage.rdd, p)) }.toMap }
private[spark]def getPreferredLocs(rdd: RDD[_], partition: Int): Seq[TaskLocation] = { getPreferredLocsInternal(rdd, partition, new HashSet)}
先调用之前的使用过的getCacheLocs从内存,磁盘和堆外查找是否有持久化过,若没有的话再调用preferredLocations 判断是否checkpoint过,若还没有的话 就会判断分配的BlockManager已存在的block总和大小是否超标(默认是集群总block大小的0.2)
// 这里会根据不同的依赖调用不同的逻辑划分算法private def getPreferredLocsInternal( rdd: RDD[_], partition: Int, visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = { // If the partition has already been visited, no need to re-visit. // This avoids exponential path exploration. SPARK-695 // 如果之前访问过这个rdd的分区就直接返回空list if (!visited.add((rdd, partition))) { // Nil has already been returned for previously visited partitions. return Nil } // If the partition is cached, return the cache locations // 调用getCacheLocs,之前有介绍 // 这里并不是柯理化,只是在返回值后面继续提取对应的[Seq[TaskLocation]] // 所以最初返回类型是IndexedSeq[Seq[TaskLocation]]],可以看做是BlockId[BlockManagerId[TaskLocation] // 然后根据partition返回[Seq[TaskLocation]] val cached = getCacheLocs(rdd)(partition) if (cached.nonEmpty) { // 若有持久化的task就直接返回 return cached } // If the RDD has some placement preferences (as is the case for input RDDs), get those // 这里其实是根据设定阈值筛选清洗出满足Blocks计算后的规定大小的BlockManager的地址。 // 补充:返回的地址格式也不同,这根是否之前被checkpoint有关 val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList if (rddPrefs.nonEmpty) { // 这里返回的三个对象里面封装就是task的地址 return rddPrefs.map(TaskLocation(_)) } // If the RDD has narrow dependencies, pick the first partition of the first narrow dependency // that has any placement preferences. Ideally we would choose based on transfer sizes, // but this will do for now. // 如果过来的RDD的依赖是窄依赖,就会迭代遍历所有父RDD的所有分区 直到任一一个有优先位置为止 rdd.dependencies.foreach { case n: NarrowDependency[_] => // 遍历父RDD的所有分区 for (inPart <- n.getParents(partition)) { // 回调getPreferredLocsInternal val locs = getPreferredLocsInternal(n.rdd, inPart, visited) if (locs != Nil) { // 一直到任一一个有优先位置为止 return locs } } case _ => } Nil}
getCahceLocs之前介绍过,忘记的可以回去看看,这里就从preferredLocations:
/** * Get the preferred locations of a partition, taking into account whether the * RDD is checkpointed. */final def preferredLocations(split: Partition): Seq[String] = { // 首先会尝试从checkpoint中拿取RDD,若没有则直接调用getPreferredLocations // 所以返回的地址格式也会不一样 checkpointRDD.map(_.getPreferredLocations(split)).getOrElse { getPreferredLocations(split) }}
以ShuffledRDD为例,首先获取到Driver端的MapOutputTrackerMaster(上面保存着集群所有节点blockmanager在shuffle阶段的元数据):
override protected def getPreferredLocations(partition: Partition): Seq[String] = { // 首先拿到Driver端的MapOutputTrackerMaster val tracker = SparkEnv.get.mapOutputTracker.asInstanceOf[MapOutputTrackerMaster] // dependencies之前介绍过拿取到当前RDD的依赖 // 拿到的头个依赖强制转换成ShuffleDependency(本身就是ShuffledRDD,这样做也是多个保险) val dep = dependencies.head.asInstanceOf[ShuffleDependency[K, V, C]] tracker.getPreferredLocationsForShuffle(dep, partition.index)}
def getPreferredLocationsForShuffle(dep: ShuffleDependency[_, _, _], partitionId: Int) : Seq[String] = { // shuffleLocalityEnabled默认true // SHUFFLE_PREF_MAP_THRESHOLD默认=1000 // SHUFFLE_PREF_REDUCE_THRESHOLD默认=1000 // REDUCER_PREF_LOCS_FRACTION=0.2 if (shuffleLocalityEnabled && dep.rdd.partitions.length < SHUFFLE_PREF_MAP_THRESHOLD && dep.partitioner.numPartitions < SHUFFLE_PREF_REDUCE_THRESHOLD) { // 这里会过滤清洗出满足要求的所有BlockManagerId // 补充:BlockManager在每个Executor和Drvier中都存在唯一一个负责数据的传输,接收和持久化,在之后的章节会介绍 val blockManagerIds = getLocationsWithLargestOutputs(dep.shuffleId, partitionId, dep.partitioner.numPartitions, REDUCER_PREF_LOCS_FRACTION) if (blockManagerIds.nonEmpty) { // 拿到所有BlockManager的host地址 blockManagerIds.get.map(_.host) } else { Nil } } else { Nil }}
这里会从MapOutputTrackerMaster获取到shuffleId对应的shuffleStatus的所有分区的MapStatus,MapStatus分为两种:默认的是CompressedMapStatus,另一种是HighlyCompressedMapStatus。最后从对应的BlockManager中的MapStatus中提取出的block大小做判定和过滤清洗
def getLocationsWithLargestOutputs( shuffleId: Int, reducerId: Int, numReducers: Int, fractionThreshold: Double) : Option[Array[BlockManagerId]] = { // 拿到这个shuffleId对应的shuffleStatuses val shuffleStatus = shuffleStatuses.get(shuffleId).orNull if (shuffleStatus != null) { // 里面主要封装了synchronized用作访问这个shuffle中的mapStatuses数组的线程安全 // 补充下 :在创建一个ShuffleMapStage的时候就会把自己注册到Driver端的MapOutputTrackerMaster上 // 然后同时里面也会生成对应的shuffleStatus和一个分区对应一个mapStatus // 默认情况下mapstatus会在SortShuffleManager生成SortShuffleWriter时候生成 // 也就是ShuffleMapTask调用runTask的时候会构建 // 里面主要是两种类型:①CompressedMapStatus ②HighlyCompressedMapStatus shuffleStatus.withMapStatuses { statuses => if (statuses.nonEmpty) { // HashMap to add up sizes of all blocks at the same location // Map里面存放的是相同地址BlockManagerId对应的所有blocks大小 val locs = new HashMap[BlockManagerId, Long] var totalOutputSize = 0L var mapIdx = 0 // 里面会遍历出所有的mapStatus while (mapIdx < statuses.length) { // 从第一个mapStatu开始拿取 val status = statuses(mapIdx) // status may be null here if we are called between registerShuffle, which creates an // array with null entries for each output, and registerMapOutputs, which populates it // with valid status entries. This is possible if one thread schedules a job which // depends on an RDD which is currently being computed by another thread. // 在registerShuffle的时候status可能会变成null,所以这里加了个判断 if (status != null) { // 提取并解压缩block,默认是压缩的 val blockSize = status.getSizeForBlock(reducerId) if (blockSize > 0) { // 提取对应的BlockManagerId的blockSize并把刚刚解压缩的block大小叠加进去 locs(status.location) = locs.getOrElse(status.location, 0L) + blockSize // 叠加到总输出中 totalOutputSize += blockSize } } // 开始遍历下个mapStatus mapIdx = mapIdx + 1 } val topLocs = locs.filter { case (loc, size) => // 过滤条件是:当前blockManager的block总大小 / 所有block大小 >= 0.2(默认) // 如果为true就说明当前blockManager的block实在太多了,若果再把tasks // 分配到这个blockManager的话就很可能造成性能瓶颈,比如说等待延迟调度等 size.toDouble / totalOutputSize >= fractionThreshold } // Return if we have any locations which satisfy the required threshold if (topLocs.nonEmpty) { // 返回满足要求的BlockManagerId的数组 return Some(topLocs.keys.toArray) } } } } None}
若清洗出来的BlockManager都符合要求 就直接返回出去对应的格式化后的地址,回到之前的getPreferredLocsInternal:
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toListif (rddPrefs.nonEmpty) { // 这里返回的三个对象里面封装就是task的地址 return rddPrefs.map(TaskLocation(_))}
/** * Create a TaskLocation from a string returned by getPreferredLocations. * These strings have the form executor_[hostname]_[executorid], [hostname], or * hdfs_cache_[hostname], depending on whether the location is cached. */ // 若之checkpoint过那传递过来的str可能是hdfs_cache_[hostname]或者executor_[hostname]_[executorid] // 若没有则是[hostname] def apply(str: String): TaskLocation = { // inMemoryLocationTag = "hdfs_cache_" // 截取掉前面是hdfs_cache_字符的str,若前缀没有包含就直接返回原来的str val hstr = str.stripPrefix(inMemoryLocationTag) // 判断是否是被持久化到过hdfs if (hstr.equals(str)) { // 如果不是则判断前缀是否是executor_ if (str.startsWith(executorLocationTag)) { // 转换成[hostname]_[executorid] val hostAndExecutorId = str.stripPrefix(executorLocationTag) // 返回的是Array[String](hostname,executorid) val splits = hostAndExecutorId.split("_", 2) require(splits.length == 2, "Illegal executor location format: " + str) val Array(host, executorId) = splits // 生成的对象仅包含标识符:executor_host_executorId new ExecutorCacheTaskLocation(host, executorId) } else { // 走到这说明没有被checkpoint过 // 生成的对象仅包含标识符:host new HostTaskLocation(str) } } else { // 走到这里说明之前有被checkpoint到hdfs // 生成的对象仅包含标识符:hdfs_cache_host new HDFSCacheTaskLocation(hstr) } }}
在我们拿到了Task的最佳位置后,Spark会把他们封装封装成序列化闭包,然后广播出去
// 下面会把task封装成闭包然后通过Broadcast分发到各个节点var taskBinary: Broadcast[Array[Byte]] = nulltry { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). // 不管是ShuffleMapStage的task或者ResultStage的task都得序列化并且广播 // 这里返回的是task字节数组的闭包 val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => // 转换成字节数组 JavaUtils.bufferToArray( // 底层用的是java.nio.ByteBuffer缓冲区 closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef)) case stage: ResultStage => JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef)) } // broadcast 可以把指定的对象转换成只读的广播变量发送到每个节点上 // 然后同个节点的每个executor的partition都会找worker拉取自己的闭包 // 如果这里不用broadcast 那么就会把给每个task拷贝一份闭包,这样就会产生大量IO // 所以这里会用广播去优化,就像平时读取大的配置文件 或者避免join操作的Shuffle时候 都可以用到广播来优化 // 这里顺便提下 spark的RDD都是封装成闭包分布到各个节点的 // 闭包的特性是延迟加载和不能修改闭包外的变量(只能用累加器Accumulator实现修改变量) taskBinary = sc.broadcast(taskBinaryBytes)
然后会把封装好的taskBinary跟着一系列参数生成ShuffleMapTask或者ResultTask(下个章节介绍)
val tasks: Seq[Task[_]] = try { // 这里也会把task的指标检测对象taskMetrics封装成序列化闭包 val serializedTaskMetrics = closureSerializer.serialize(stage.latestInfo.taskMetrics).array() stage match { // 当匹配到生成的是ShuffleMapStage case stage: ShuffleMapStage => // 首先保证pendingPartitions为空 // pendingPartitions中放的是还没完成的partition,还没完成的task // 如果完成了就会从中清除 // DAGScheduler会用它来确定此state是否已完成 stage.pendingPartitions.clear() // 开始遍历操作每个需要计算的分区 partitionsToCompute.map { id => // 拿到分区地址 val locs = taskIdToLocations(id) // 拿到此stage对应的rdd的分区 val part = stage.rdd.partitions(id) // 加入运行状态 stage.pendingPartitions += id // 开始构建ShuffleMapTask对象,之后会通过这个对象调用runTask,具体详情会在下个章节 // 补充:Task分为两种:一种是ShuffleMapTask,一种是ResultTask new ShuffleMapTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, properties, serializedTaskMetrics, Option(jobId), Option(sc.applicationId), sc.applicationAttemptId) } // 当匹配到ResultStage时生成的是ResultTask case stage: ResultStage => partitionsToCompute.map { id => val locs = taskIdToLocations(id) val p: Int = stage.partitions(id) val part = stage.rdd.partitions(p) new ResultTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, id, properties, serializedTaskMetrics, Option(jobId), Option(sc.applicationId), sc.applicationAttemptId) } }
最后把拿到的Tasks封装成TaskSet交给taskScheduler提交到各个executor上(下个章节介绍)
if (tasks.size > 0) { logInfo(s"Submitting ${tasks.size} missing tasks from $stage (${stage.rdd}) (first 15 " + s"tasks are for partitions ${tasks.take(15).map(_.partitionId)})") // 开始提交task // 这里调用的是 实现taskScheduler特质的TaskSchedulerImpl // 它会提交被taskSet封装的tasks // 具体详细放在下个章节 taskScheduler.submitTasks(new TaskSet( tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))} else { // Because we posted SparkListenerStageSubmitted earlier, we should mark // the stage as completed here in case there are no tasks to run // 由于某些原因 可能拿到任何task,但是得向SparkListenerStageSubmitted标记下这个stage完成了 // 因为之前我们向SparkListenerStageSubmitted提交过任务,这里得清除它。 markStageAsFinished(stage, None)
当这个Stage完成后,如果还有等待提交的Stage就继续提交
// 父Stage完成后,继续依次提交子Stage submitWaitingChildStages(stage) }}
/** * Check for waiting stages which are now eligible for resubmission. * Submits stages that depend on the given parent stage. Called when the parent stage completes * successfully. */private def submitWaitingChildStages(parent: Stage) { logTrace(s"Checking if any dependencies of $parent are now runnable") logTrace("running: " + runningStages) logTrace("waiting: " + waitingStages) logTrace("failed: " + failedStages) // 过滤掉已完成的父stage // 数据结构:HashSet,用来存放等待提交的stage // 这个会在之前调用submitStage的时候把需要提交的stage加入进去 val childStages = waitingStages.filter(_.parents.contains(parent)).toArray waitingStages --= childStages for (stage <- childStages.sortBy(_.firstJobId)) { // 拿到最前面的stage,再次提交 submitStage(stage) }}
- Spark2.0.X源码深度剖析之 DAGScheduler之Stage划分 —— 国内全网最新最全最具深度!!!
- Spark2.0.X源码深度剖析之 TaskScheduler之Task划分 —— 国内全网最新最全最具深度!!!
- Spark2.2 DAGScheduler源码分析[stage划分算法源码剖析]
- Spark2.0.X源码深度剖析之 Spark Submit..
- Spark2.0.X源码深度剖析之 SparkContext
- Spark2.0.X源码深度剖析之 SparkEnv
- Spark2.0.X源码深度剖析之 RpcEnv & NettyRpcEnv
- DAGScheduler源码解析之Stage划分
- [Spark源码剖析] DAGScheduler划分stage
- Spark源码分析之DAGScheduler以及stage的划分
- Spark2.0.X算子源码深度剖析之MapPartitionsRDD,绝对让你看清楚算子的计算本质
- 源码-DAGScheduler及Stage划分提交
- [spark] DAGScheduler划分stage源码解析
- [Spark源码剖析] DAGScheduler提交stage
- c语言深度剖析之—关键字
- spark源码学习(五)--- DAGScheduler中的stage的划分
- Spark源码-DAGScheduler中stage划分和task最佳位置
- Spark源码分析之三:Stage划分
- CEF完整嵌入DUI窗体(一) --Cef3简介
- TCPIP函数调用大致流程
- 压缩、解压缩.zip,.rar,.7z格式java工具类
- 1 重新启程,WebGL框架three.js
- Git 中 SSH key 生成步骤
- Spark2.0.X源码深度剖析之 DAGScheduler之Stage划分 —— 国内全网最新最全最具深度!!!
- POJ 3684 Physics Experiment【弹性碰撞】
- 代理服务器与反向代理
- 单点登录原理与简单实现
- Linux再出大危机
- sqlserver 2005
- ListView 显示属性
- Git 权限控制
- 如何追求幸福