Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系
来源:互联网 发布:张予曦的淘宝店 编辑:程序博客网 时间:2024/05/23 13:34
Spark根据RDD间的依赖关系是否是Shuffle依赖进行Stage的划分,为了说明问题在这里先执行的Stage标记为Stage1,后执行的Stage标记为Stage2。Shuffle分2步操作
Map操作和Recude操作可以通过下面这个图表示出来:
1. Map操作。Map操作在Stage1结束的时候执行;Map操作的作用是将Stage1阶段的一个partition的数据写入到Shuffle文件,Shuffle文件保存在执行Map操作的节点的本地磁盘
2. Reduce操作。 Reduce操作在Stage2开始的时候执行;Reduce操作的作用是读取Map操作生成的Shuffle文件,生成Stage2的一个partition
Map操作与Recude操作通过ShuffledRDD类和MapOutputTrackerMaster类联系起来。
通过ShuffledRDD可以找到Stage1和Stage2之间的依赖关系,这个依赖关系里面包含了横跨2个Stage间ShuffledRDD的分区算法类Partitioner,数据合并类Aggregator、数据序列化类Serializer等。
MapOutputTrackerMaster类保存了Stage1是如何生成Shuffle文件的,Stage2根据MapOutputTrackerMaster类对Shuffle文件的描述读取Shuffle磁盘的文件,产生Stage2的一个分区的数据。
Stage1(Shuffle Map任务)执行完成之后,Spark Driver采用如下时序图进行任务的调度启动Stage2:
上面的时序图对理解Spark shuffle任务调度特别重要,由于显示的不清楚请同学们放大观看,放大后能够清晰显示!
ShuffleMapTask运行结束之后,将生成的Shuffle文件信息返回给TaskRunner类,TaskRunner调用CoarseGrainedExecutorBackend.statusUpdate方法将Shuffle文件信息发送给CoarseGrainedSchedulerBackend类给CoarseGrainedSchedulerBackend类接收到消息后,会调用TaskSchedulerImpl.statusUpate处理任务状态变更消息。
TaskSchdeulerImpl.statusUpdate主要干了以下2件事情:
1. 通过调用TaskSetManager.removeRunningTask将成功完成的任务从TaskSetManager.runningTasksSet HashMap中删除
2.调用TaskResultGetter.enqueueSuccessFullTask将ShuffleMapTask返回信息deserialize,还原ShuffleMapTask返回的对象,这个对象会发送给TaskSchedulerImpl。.默认情况下,ShuffleMapTask返回结果大于1G返回结果会丢弃,并将结果丢弃信息发送给Driver;如果返回信息大于10M-200K,则将返回信息serialize后保存到BlockManager;其它情况则将ShuffleMapTask返回序列化后直接发送给Driver。所以在这一步中,TaskResultGetter.enqueueSuccessFullTask根据这几种情况对结果分别加以处理。处理完成之后会调用TaskSchedulerImpl.handleSuccessfullTask
相关代码为:
def enqueueSuccessfulTask( taskSetManager: TaskSetManager, tid: Long, serializedData: ByteBuffer) { getTaskResultExecutor.execute(new Runnable { override def run(): Unit = Utils.logUncaughtExceptions { try { val (result, size) = serializer.get().deserialize[TaskResult[_]](serializedData) match { /* * 直接返回结果的处理,这种情况下返回信息没有保存在BlockManager * */ case directResult: DirectTaskResult[_] => if (!taskSetManager.canFetchMoreResults(serializedData.limit())) { return } // deserialize "value" without holding any lock so that it won't block other threads. // We should call it here, so that when it's called again in // "TaskSetManager.handleSuccessfulTask", it does not need to deserialize the value. directResult.value() (directResult, serializedData.limit()) /* * 间接返回结果的处理,这种情况下返回信息保存在BlockManager。从BlockManager读取返回信息之后,需要将返回信息从BlockManager删除 * */ case IndirectTaskResult(blockId, size) => if (!taskSetManager.canFetchMoreResults(size)) { // dropped by executor if size is larger than maxResultSize sparkEnv.blockManager.master.removeBlock(blockId) return } logDebug("Fetching indirect task result for TID %s".format(tid)) scheduler.handleTaskGettingResult(taskSetManager, tid) val serializedTaskResult = sparkEnv.blockManager.getRemoteBytes(blockId) if (!serializedTaskResult.isDefined) { /* We won't be able to get the task result if the machine that ran the task failed * between when the task ended and when we tried to fetch the result, or if the * block manager had to flush the result. */ scheduler.handleFailedTask( taskSetManager, tid, TaskState.FINISHED, TaskResultLost) return } val deserializedResult = serializer.get().deserialize[DirectTaskResult[_]]( serializedTaskResult.get) sparkEnv.blockManager.master.removeBlock(blockId) (deserializedResult, size) } result.metrics.setResultSize(size) /* * Task返回信息已经处理完毕,调用TaskSchedulerImpl.handleSuccessfulTask进行近一步的处理 * */ scheduler.handleSuccessfulTask(taskSetManager, tid, result) } catch { case cnf: ClassNotFoundException => val loader = Thread.currentThread.getContextClassLoader taskSetManager.abort("ClassNotFound with classloader: " + loader) // Matching NonFatal so we don't catch the ControlThrowable from the "return" above. case NonFatal(ex) => logError("Exception while getting task result", ex) taskSetManager.abort("Exception while getting task result: %s".format(ex)) } } }) }
TaskSchedulerImpl.handleSuccessfulTask主要调用了TaskSetManager.handleSuccessfulTask方法进入了任务集内部的任务调度处理。
TaskSetManager.handleSuccessfulTask调用DAGScheduler.taskEnded方法给DAGSchedulerEventProcessLoop发送任务完成CompletionEvent消息,接收到消息之后调用DAGScheduler.handleTaskCompletion方法处理任务完成的收尾工作。
在DAGScheduler.handleTaskCompletion方法首先调用ShuffleMapStage.addOutputLoc将TaskResultGetter.enqueueSuccessFullTask deserialize出来的对象存入ShuffleMapStage.outLocs数组中,这个对象就是一个Shuffle文件的描述信息。当一个Stage的所有Task都完成之后,DAGScheduler把ShuffleMapStage.outLocs的信息通过调用MapOutputTrackerMaster.registerMapOutputs方法将每个Stage1分区对应的Shuffle文件消息保存到MapOutputTrackerMaster。Stage1成功完成之后,从Stage等待队列中找到所有没有Parent的Stage,执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列,然后申请资源,执行任务。在这里的包含了以Shuffle reduce操作开始的Stage2。代码如下:
private[scheduler] def handleTaskCompletion(event: CompletionEvent) { val task = event.task val stageId = task.stageId val taskType = Utils.getFormattedClassName(task) outputCommitCoordinator.taskCompleted(stageId, task.partitionId, event.taskInfo.attempt, event.reason) // The success case is dealt with separately below, since we need to compute accumulator // updates before posting. if (event.reason != Success) { val attemptId = stageIdToStage.get(task.stageId).map(_.latestInfo.attemptId).getOrElse(-1) listenerBus.post(SparkListenerTaskEnd(stageId, attemptId, taskType, event.reason, event.taskInfo, event.taskMetrics)) } if (!stageIdToStage.contains(task.stageId)) { // Skip all the actions if the stage has been cancelled. return } val stage = stageIdToStage(task.stageId) event.reason match { case Success => listenerBus.post(SparkListenerTaskEnd(stageId, stage.latestInfo.attemptId, taskType, event.reason, event.taskInfo, event.taskMetrics)) stage.pendingTasks -= task task match { case rt: ResultTask[_, _] => // Cast to ResultStage here because it's part of the ResultTask // TODO Refactor this out to a function that accepts a ResultStage val resultStage = stage.asInstanceOf[ResultStage] resultStage.resultOfJob match { case Some(job) => if (!job.finished(rt.outputId)) { updateAccumulators(event) job.finished(rt.outputId) = true job.numFinished += 1 // If the whole job has finished, remove it if (job.numFinished == job.numPartitions) { markStageAsFinished(resultStage) cleanupStateForJobAndIndependentStages(job) listenerBus.post( SparkListenerJobEnd(job.jobId, clock.getTimeMillis(), JobSucceeded)) } // taskSucceeded runs some user code that might throw an exception. Make sure // we are resilient against that. try { job.listener.taskSucceeded(rt.outputId, event.result) } catch { case e: Exception => // TODO: Perhaps we want to mark the resultStage as failed? job.listener.jobFailed(new SparkDriverExecutionException(e)) } } case None => logInfo("Ignoring result from " + rt + " because its job has finished") } case smt: ShuffleMapTask => val shuffleStage = stage.asInstanceOf[ShuffleMapStage] updateAccumulators(event) val status = event.result.asInstanceOf[MapStatus] val execId = status.location.executorId logDebug("ShuffleMapTask finished on " + execId) if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) { logInfo("Ignoring possibly bogus ShuffleMapTask completion from " + execId) } else { /* * 记录shuffle map阶段一个partition的返回结果到stage.outputLocs * shuffle map阶段将一个partition的数据根据Partitioner对Key进行运算后shuffle到reduce阶段的N个分区中,这些数据写记录到本地disk的一个文件中 * shuffle map阶段的返回结果就是reduce阶段每个partition的数据长度 * 因为在这个文件中是根据partition index从0开始依次记录的,所以知道每位partition的数据长度后,也就知道了每个partition数据在这个partition的起始地址 * */ shuffleStage.addOutputLoc(smt.partitionId, status) } /* * 一个stage的所有task都执行完,没有pending的task了 * */ if (runningStages.contains(shuffleStage) && shuffleStage.pendingTasks.isEmpty) { markStageAsFinished(shuffleStage) logInfo("looking for newly runnable stages") logInfo("running: " + runningStages) logInfo("waiting: " + waitingStages) logInfo("failed: " + failedStages) // We supply true to increment the epoch number here in case this is a // recomputation of the map outputs. In that case, some nodes may have cached // locations with holes (from when we detected the error) and will need the // epoch incremented to refetch them. // TODO: Only increment the epoch number if this is not the first time // we registered these map outputs. /* * Shuffle Stage的任务等待队列没有任务之后,将这个Stage的所有ShuffleMapTask的返回结果保存到MapOutputTrackerMaster * */ mapOutputTracker.registerMapOutputs( shuffleStage.shuffleDep.shuffleId, shuffleStage.outputLocs.map(list => if (list.isEmpty) null else list.head).toArray, changeEpoch = true) clearCacheLocs() if (shuffleStage.outputLocs.contains(Nil)) { // Some tasks had failed; let's resubmit this shuffleStage // TODO: Lower-level scheduler should also deal with this logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name + ") because some of its tasks had failed: " + shuffleStage.outputLocs.zipWithIndex.filter(_._1.isEmpty) .map(_._2).mkString(", ")) submitStage(shuffleStage) } else { val newlyRunnable = new ArrayBuffer[Stage] for (shuffleStage <- waitingStages) { logInfo("Missing parents for " + shuffleStage + ": " + getMissingParentStages(shuffleStage)) } /* * 一个Stage成功完成之后,从Stage等待队列中找到所有没有Parent的Stage,执行DAGScheduler.submitMissingTasks把每个Stage转化成任务放到任务等待队列, * 然后申请资源,执行任务。在这里的包含了以Shuffle reduce操作开始的Stage * */ for (shuffleStage <- waitingStages if getMissingParentStages(shuffleStage).isEmpty) { newlyRunnable += shuffleStage } waitingStages --= newlyRunnable runningStages ++= newlyRunnable for { shuffleStage <- newlyRunnable.sortBy(_.id) jobId <- activeJobForStage(shuffleStage) } { logInfo("Submitting " + shuffleStage + " (" + shuffleStage.rdd + "), which is now runnable") submitMissingTasks(shuffleStage, jobId) } } } } case Resubmitted => logInfo("Resubmitted " + task + ", so marking it as still running") stage.pendingTasks += task case FetchFailed(bmAddress, shuffleId, mapId, reduceId, failureMessage) => val failedStage = stageIdToStage(task.stageId) val mapStage = shuffleToMapStage(shuffleId) // It is likely that we receive multiple FetchFailed for a single stage (because we have // multiple tasks running concurrently on different executors). In that case, it is possible // the fetch failure has already been handled by the scheduler. if (runningStages.contains(failedStage)) { logInfo(s"Marking $failedStage (${failedStage.name}) as failed " + s"due to a fetch failure from $mapStage (${mapStage.name})") markStageAsFinished(failedStage, Some(failureMessage)) } if (disallowStageRetryForTest) { abortStage(failedStage, "Fetch failure will not retry stage due to testing config") } else if (failedStages.isEmpty) { // Don't schedule an event to resubmit failed stages if failed isn't empty, because // in that case the event will already have been scheduled. // TODO: Cancel running tasks in the stage logInfo(s"Resubmitting $mapStage (${mapStage.name}) and " + s"$failedStage (${failedStage.name}) due to fetch failure") messageScheduler.schedule(new Runnable { override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS) } failedStages += failedStage failedStages += mapStage // Mark the map whose fetch failed as broken in the map stage if (mapId != -1) { mapStage.removeOutputLoc(mapId, bmAddress) mapOutputTracker.unregisterMapOutput(shuffleId, mapId, bmAddress) } // TODO: mark the executor as failed only if there were lots of fetch failures on it if (bmAddress != null) { handleExecutorLost(bmAddress.executorId, fetchFailed = true, Some(task.epoch)) } case commitDenied: TaskCommitDenied => // Do nothing here, left up to the TaskScheduler to decide how to handle denied commits case ExceptionFailure(className, description, stackTrace, fullStackTrace, metrics) => // Do nothing here, left up to the TaskScheduler to decide how to handle user failures case TaskResultLost => // Do nothing here; the TaskScheduler handles these failures and resubmits the task. case other => // Unrecognized failure - also do nothing. If the task fails repeatedly, the TaskScheduler // will abort the job. } submitWaitingStages() }
MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息是一个二维数组,数组的第一维度是Stage1也就是Map操作所在的Stage的分区index;数组的第二维度是Stage2也就是reduce操作所在的Stage的分区index。
假设Stage1 Map操作有3个分区,Stage2 Reduce操作同样也有3个分区,那么MapOutputTrackerMaster中保存的shuffleMap阶段的返回信息可以用如下图表示:
上图中的Shuffle操作把Stage1的一个分区的数据Shuffle到了Stage2的3个分区,"Map2 reduce1长度"表示Stage1中第二个分区的数据Shuffle到Stage2第一个分区的数据长度
在DAGScheduler.submitMissingTasks方法里面,会调用到DAGScheduler.getPreferredLocs方法计算出这个Stage每个任务的位置在DAGScheduler.getPreferredLocs方法里面,会调用到DAGScheduler.getPreferredLocsInternal方法计算出这个Stage每个任务的位置
DAGScheduler.getPreferredLocsInternal会调用MapOutputTrackerMaster.getLocationsWithLargestOutputs确定Task的位置
表示Task位置的数据结构为TaskLocation,具体源码为:
private[spark] object TaskLocation { // We identify hosts on which the block is cached with this prefix. Because this prefix contains // underscores, which are not legal characters in hostnames, there should be no potential for // confusion. See RFC 952 and RFC 1123 for information about the format of hostnames. val inMemoryLocationTag = "hdfs_cache_" def apply(host: String, executorId: String): TaskLocation = { new ExecutorCacheTaskLocation(host, executorId) } /** * Create a TaskLocation from a string returned by getPreferredLocations. * These strings have the form [hostname] or hdfs_cache_[hostname], depending on whether the * location is cached. */ def apply(str: String): TaskLocation = { val hstr = str.stripPrefix(inMemoryLocationTag) if (hstr.equals(str)) { new HostTaskLocation(str) } else { new HostTaskLocation(hstr) } }}这个数据结构通过节点的IP地址和Executor id确定出Task的位置
计算方法为:在Stage1某个TaskLocation上partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION,则这个TaskLocation会作为Stage2任务的TaskLocation
以上图"Map2 reduce2长度"为例:(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION则使用Map2在Stage1所在节点的IP地址和Executor Id创建TaskLocation
通过TaskLocation的定义可以知道实际创建的TaskLocation类型是ExecutorCacheLocation
具体代码为:
private def getPreferredLocsInternal( rdd: RDD[_], partition: Int, visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = { // If the partition has already been visited, no need to re-visit. // This avoids exponential path exploration. SPARK-695 if (!visited.add((rdd, partition))) { // Nil has already been returned for previously visited partitions. return Nil } // If the partition is cached, return the cache locations val cached = getCacheLocs(rdd)(partition) if (cached.nonEmpty) { return cached } // If the RDD has some placement preferences (as is the case for input RDDs), get those val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList if (rddPrefs.nonEmpty) { return rddPrefs.map(TaskLocation(_)) } rdd.dependencies.foreach { case n: NarrowDependency[_] => // If the RDD has narrow dependencies, pick the first partition of the first narrow dep // that has any placement preferences. Ideally we would choose based on transfer sizes, // but this will do for now. for (inPart <- n.getParents(partition)) { val locs = getPreferredLocsInternal(n.rdd, inPart, visited) if (locs != Nil) { return locs } } case s: ShuffleDependency[_, _, _] => // For shuffle dependencies, pick locations which have at least REDUCER_PREF_LOCS_FRACTION // of data as preferred locations if (shuffleLocalityEnabled && rdd.partitions.size < SHUFFLE_PREF_REDUCE_THRESHOLD && s.rdd.partitions.size < SHUFFLE_PREF_MAP_THRESHOLD) { // Get the preferred map output locations for this reducer /* *根据Stage1 shuffle map操作各个任务的返回值确定Stage2 shuffle reduce操作各个任务的本地性 *在Stage1某个节点partition Shuffle后数据占Stage2 partition所有数据的比率大于REDUCER_PREF_LOCS_FRACTION(默认0.2),则这个 * 节点会作为Stage2任务的启动节点 * */ val topLocsForReducer = mapOutputTracker.getLocationsWithLargestOutputs(s.shuffleId, partition, rdd.partitions.size, REDUCER_PREF_LOCS_FRACTION) if (topLocsForReducer.nonEmpty) { return topLocsForReducer.get.map(loc => TaskLocation(loc.host, loc.executorId)) } } case _ => } Nil }
DAGScheduler.submitMissingTasks会根据TaskLocation创建ShuffleMapTask或者ResultTask,如果Shuffle后的数据,没有能够满足条件创建ExecutorCacheLocation,则在创建ShuffleMapTask或者ResultTask的时候,传入的locs参数为Nil
创建完某个Stage的任务之后,然后调用TaskSchedulerImpl.submitTasks创建TaskSetManager,在创建TaskSetManager的时候会根据ShuffleMapTask或者ResultTask的TaskLocality加入到不同的HashMap中,代码如下:
<pre name="code" class="java"> private def addPendingTask(index: Int, readding: Boolean = false) { // Utility method that adds `index` to a list only if readding=false or it's not already there def addTo(list: ArrayBuffer[Int]) { if (!readding || !list.contains(index)) { list += index } } for (loc <- tasks(index).preferredLocations) {//preferredLocation方法返回partition所在的IP地址和Executor id loc match { case e: ExecutorCacheTaskLocation => /* * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数类型为ExecutorCacheTaskLocation则将Task id加入到pendingTasksForExecutor * pendingTasksForExecutor HashMap的key是Executor id, value是这个Executor对应的Task的id,可以有多个Task * */ addTo(pendingTasksForExecutor.getOrElseUpdate(e.executorId, new ArrayBuffer)) case e: HDFSCacheTaskLocation => { val exe = sched.getExecutorsAliveOnHost(loc.host) exe match { case Some(set) => { for (e <- set) { addTo(pendingTasksForExecutor.getOrElseUpdate(e, new ArrayBuffer)) } logInfo(s"Pending task $index has a cached location at ${e.host} " + ", where there are executors " + set.mkString(",")) } case None => logDebug(s"Pending task $index has a cached location at ${e.host} " + ", but there are no executors alive there.") } } case _ => Unit } addTo(pendingTasksForHost.getOrElseUpdate(loc.host, new ArrayBuffer))//由于DirectDStream方式的loc.host地址不属于Spark集群和HDFS集群,所以Task加到了这个HashMap for (rack <- sched.getRackForHost(loc.host)) { addTo(pendingTasksForRack.getOrElseUpdate(rack, new ArrayBuffer)) } } /* * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数为Nil则将Task id加入到pendingTasksWithNoPrefs * pendingTasksWithNoPrefs类型是ArrayBuffe,它的每个元素的value是这Task的id * */ if (tasks(index).preferredLocations == Nil) { addTo(pendingTasksWithNoPrefs) } if (!readding) { /* * 所有任务都会加入到addPendingTasks * */ allPendingTasks += index // No point scanning this whole list to find the old task there 所有的Task都会加入到这个HashMap,包括DirectDStream情况下的Task } }
从上面的代码可见,TaskLocation类型是ExecutorCacheLocation的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForExecutor、pendingTasksForHost、allPendingTasks 中。而TaskLocation类型是Nil的ShuffleMapTask或者ResultTask会同时加入到pendingTasksForHost、allPendingTasks 中
这里有一个问题需要注意:虽然相同的任务加入到了多个任务等待队列,但是不会出现任务的重复调度,具体原因会在下面任务调度的时候讲到。
TaskScheduler.submitTasks方法创建TaskSetManager之后,调用CoarseGrainedSchedulerBackend.reviveOffers申请执行Stage2的资源.
在CoarseGrainedSchedulerBackend.reviveOffers最终会调用到TaskSchedulerImpl.resourceOffer方法分配执行资源。代码如下:
def resourceOffers(offers: Seq[WorkerOffer]): Seq[Seq[TaskDescription]] = synchronized { // Mark each slave as alive and remember its hostname // Also track if new executor is added var newExecAvail = false for (o <- offers) { executorIdToHost(o.executorId) = o.host activeExecutorIds += o.executorId if (!executorsByHost.contains(o.host)) { executorsByHost(o.host) = new HashSet[String]() executorAdded(o.executorId, o.host) newExecAvail = true } for (rack <- getRackForHost(o.host)) { hostsByRack.getOrElseUpdate(rack, new HashSet[String]()) += o.host } } // Randomly shuffle offers to avoid always placing tasks on the same set of workers. val shuffledOffers = Random.shuffle(offers) // Build a list of tasks to assign to each worker. //tasks是一个链表,这个链表的一个元素是一个数组,数组的类型是ArrayBuffer[TaskDescription],数组的大小跟跟这个executor的cores个数相同 val tasks = shuffledOffers.map(o => new ArrayBuffer[TaskDescription](o.cores)) //availableCpus是一个List,这个List的每个元素表示一个worker上这个Executor可用的cpu个数 val availableCpus = shuffledOffers.map(o => o.cores).toArray val sortedTaskSets = rootPool.getSortedTaskSetQueue //从rootPool里面拿到对应的TaskSet,会使用设置的调度算法返回TaskSet的顺序 for (taskSet <- sortedTaskSets) { logDebug("parentName: %s, name: %s, runningTasks: %s".format( taskSet.parent.name, taskSet.name, taskSet.runningTasks)) if (newExecAvail) { taskSet.executorAdded()//添加了新的Executor,重新计算任务的本地性 } } // Take each TaskSet in our scheduling order, and then offer it each node in increasing order // of locality levels so that it gets a chance to launch local tasks on all of them. // NOTE: the preferredLocality order: PROCESS_LOCAL, NODE_LOCAL, NO_PREF, RACK_LOCAL, ANY var launchedTask = false /* * 对于一个任务集合,优先执行PROCESS_LOCAL任务,最后执行ANY任务 * */ for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) { do { launchedTask = resourceOfferSingleTaskSet( taskSet, maxLocality, shuffledOffers, availableCpus, tasks) } while (launchedTask) } if (tasks.size > 0) { hasLaunchedTask = true } return tasks }
对任务集中的任务进行资源分配优先集为:PROCESS_LOCAL>NODE_LOCAL>NO_PREF>RACK_LOCAL>ANY
TaskSchedulerImpl.resourceOffer最终要调用TaskSchedulerImpl.resouceOfferSingleTaskSet为一个任务集分配资源,TaskSchedulerImpl.resouceOfferSingleTaskSet方法轮询每个Executor,从任务等待队列中拿到需要执行任务id,给轮询到的Executor执行
private def resourceOfferSingleTaskSet( taskSet: TaskSetManager, maxLocality: TaskLocality, shuffledOffers: Seq[WorkerOffer], availableCpus: Array[Int], tasks: Seq[ArrayBuffer[TaskDescription]]) : Boolean = { var launchedTask = false for (i <- 0 until shuffledOffers.size) { val execId = shuffledOffers(i).executorId val host = shuffledOffers(i).host if (availableCpus(i) >= CPUS_PER_TASK) {//按照cpu cores个数分配task try { for (task <- taskSet.resourceOffer(execId, host, maxLocality)) { tasks(i) += task //将这个task放在了第i个worker(worker顺序已经shuffle了) val tid = task.taskId taskIdToTaskSetId(tid) = taskSet.taskSet.id//记录task所在的taskset taskIdToExecutorId(tid) = execId//记录task所在的executor executorsByHost(host) += execId availableCpus(i) -= CPUS_PER_TASK assert(availableCpus(i) >= 0) launchedTask = true } } catch { case e: TaskNotSerializableException => logError(s"Resource offer failed, task set ${taskSet.name} was not serializable") // Do not offer resources for this task, but don't throw an error to allow other // task sets to be submitted. return launchedTask } } } return launchedTask }TaskSchedulerImpl.resouceOfferSingleTaskSet方法调用了TaskSetManager.resourceOffer为任务集中的单个任务进行资源分配,TaskSetManager.resourceOffer调用TaskSetManager.dequeueTask执行具体的资源分配,代码如下:
private def dequeueTask(execId: String, host: String, maxLocality: TaskLocality.Value) : Option[(Int, TaskLocality.Value, Boolean)] = { /* * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数为类型则从pendingTasksForExecutor HashMap获取任务执行的id * */ for (index <- dequeueTaskFromList(execId, getPendingTasksForExecutor(execId))) { return Some((index, TaskLocality.PROCESS_LOCAL, false)) } if (TaskLocality.isAllowed(maxLocality, TaskLocality.NODE_LOCAL)) {//由于KafkaRDD partition所在的Ip地址跟Executor的IP地址不同,所以Task不能从这个HashMap获取 for (index <- dequeueTaskFromList(execId, getPendingTasksForHost(host))) { return Some((index, TaskLocality.NODE_LOCAL, false)) } } /* * 在创建ShuffleMapTask或者ResultTask的时候,如果这个Task传入的locs参数为Nil则从pendingTasksWithNoPrefs获取任务执行的id * */ if (TaskLocality.isAllowed(maxLocality, TaskLocality.NO_PREF)) { // Look for noPref tasks after NODE_LOCAL for minimize cross-rack traffic for (index <- dequeueTaskFromList(execId, pendingTasksWithNoPrefs)) { return Some((index, TaskLocality.PROCESS_LOCAL, false)) } } if (TaskLocality.isAllowed(maxLocality, TaskLocality.RACK_LOCAL)) { for { rack <- sched.getRackForHost(host) index <- dequeueTaskFromList(execId, getPendingTasksForRack(rack)) } { return Some((index, TaskLocality.RACK_LOCAL, false)) } } if (TaskLocality.isAllowed(maxLocality, TaskLocality.ANY)) {//KafkaRDD的处理Task从addPendingTask这个HashMap获取 for (index <- dequeueTaskFromList(execId, allPendingTasks)) { return Some((index, TaskLocality.ANY, false)) } } // find a speculative task if all others tasks have been scheduled dequeueSpeculativeTask(execId, host, maxLocality).map { case (taskIndex, allowedLocality) => (taskIndex, allowedLocality, true)} }
TaskSetManager.dequeueTask调用TaskSetManager.dequeueTaskFromList从任务等待链表中获取任务:
private def dequeueTaskFromList(execId: String, list: ArrayBuffer[Int]): Option[Int] = { var indexOffset = list.size while (indexOffset > 0) { indexOffset -= 1 val index = list(indexOffset) if (!executorIsBlacklisted(execId, index)) { // This should almost always be list.trimEnd(1) to remove tail list.remove(indexOffset) /* * 若任务已经开始执行,copiesRunning(index)==1 * 若任务已经成功完成,则successful(index) ==true * 通过这个判断避免任务重复执行 * */ if (copiesRunning(index) == 0 && !successful(index)) { return Some(index) } } } None }
对于本地性非常高的任务,比如ExecutorCacheLocation任务,它被添加到pendingTasksForExecutor pendingTasksForHost allPendingTasks 3个数据结构中, 若任务已经开始执行,copiesRunning(index)==1, 若任务已经成功完成,则successful(index) ==true,通过判断copiesRunning(index) == 0 && !successful(index)避免任务重复执行
通过以上分析可以得出结论,
Stage2任务在哪个Executor执行分2中情况
情况1:创建Stage2任务的时候,传入的loc2类型是ExecutorCacheLocation,也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) > REDUCER_PREF_LOCS_FRACTION,这种任务执行的Executor会在Map2所执行的Executor
情况2:创建Stage2任务的时候,传入的loc2类型是Nil,也就是(Map2 reduce2的长度)/(Map1 reduce2长度 + Map2 reduce2长度 + Map3 reduce2长度) < REDUCER_PREF_LOCS_FRACTION,这种任务执行的Executor会随机分配一个Executor执行
- Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系
- Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系
- 【Spark系列4】Spark的shuffle原理
- 【Spark】Spark的Shuffle机制
- Spark Shuffle 的调研
- Spark的Shuffle机制
- Spark的Shuffle机制
- SPARK里的shuffle
- Spark里的shuffle
- Spark的shuffle实现
- spark shuffle mapreduce shuffle
- spark shuffle
- Spark-shuffle
- spark Shuffle
- spark shuffle
- spark shuffle
- MapReduce Shuffle原理 与 Spark Shuffle原理
- MapReduce Shuffle原理 与 Spark Shuffle原理
- NP完全性理论(算法分析与设计)
- 《小广和小明》免费的最贵……
- mac安装android开发环境
- 五个案例让你明白GCD死锁
- java maven项目开发常识
- Spark Shuffle系列-----1. Spark Shuffle与任务调度之间的关系
- C语言中的字符数组与字符串
- 数据库容灾技术之--数据容灾技术比较
- 计算机网络(三)HTTP协议相关基础
- pat1039Course List for Student (25)
- Boost.Asio简介
- [Data Mining]APriori算法C++实现
- 五个案例让你明白GCD死锁
- 黑马程序员——Java基础--------集合