spark源码学习(六)--- DAGScheduler中的task的划分
来源:互联网 发布:淘宝宝贝复制破解 编辑:程序博客网 时间:2024/06/05 14:09
前面的文章中,当执行到submitstage的方法中,会调用到submitMissingTasks(stage, jobId.get) 这个方法。
private def submitMissingTasks(stage: Stage, jobId: Int) { logDebug("submitMissingTasks(" + stage + ")") // Get our pending tasks and remember them in our pendingTasks entry stage.pendingPartitions.clear() // First figure out the indexes of partition ids to compute. //获取task的数量,也就是partition的数量 val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() // Create internal accumulators if the stage has no accumulators initialized. // Reset internal accumulators only if this stage is not partially submitted // Otherwise, we may override existing accumulator values from some tasks if (stage.internalAccumulators.isEmpty || stage.numPartitions == partitionsToCompute.size) { stage.resetInternalAccumulators() } // Use the scheduling pool, job group, description, etc. from an ActiveJob associated // with this Stage val properties = jobIdToActiveJob(jobId).properties //将stage加入到runningstage队列中 runningStages += stage // SparkListenerStageSubmitted should be posted before testing whether tasks are // serializable. If tasks are not serializable, a SparkListenerStageCompleted event // will be posted, which should always come after a corresponding SparkListenerStageSubmitted // event. //判断stage类型,是ShuffleMapStage还是ResultStage,只有最后一个stage是ResultStage。 stage match { case s: ShuffleMapStage => outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1) case s: ResultStage => outputCommitCoordinator.stageStart( stage = s.id, maxPartitionId = s.rdd.partitions.length - 1) } val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try { stage match { case s: ShuffleMapStage => partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap case s: ResultStage => val job = s.activeJob.get partitionsToCompute.map { id => val p = s.partitions(id) (id, getPreferredLocs(stage.rdd, p)) }.toMap } } catch { case NonFatal(e) => stage.makeNewStageAttempt(partitionsToCompute.size) listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e)) runningStages -= stage return } stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq) listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties)) // TODO: Maybe we can keep the taskBinary in Stage to avoid serializing it multiple times. // Broadcasted binary for the task, used to dispatch tasks to executors. Note that we broadcast // the serialized copy of the RDD and for each task we will deserialize it, which means each // task gets a different copy of the RDD. This provides stronger isolation between tasks that // might modify state of objects referenced in their closures. This is necessary in Hadoop // where the JobConf/Configuration object is not thread-safe. var taskBinary: Broadcast[Array[Byte]] = null try { // For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep). // For ResultTask, serialize and broadcast (rdd, func). val taskBinaryBytes: Array[Byte] = stage match { case stage: ShuffleMapStage => closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef).array() case stage: ResultStage => closureSerializer.serialize((stage.rdd, stage.func): AnyRef).array() } taskBinary = sc.broadcast(taskBinaryBytes) } catch { // In the case of a failure during serialization, abort the stage. case e: NotSerializableException => abortStage(stage, "Task not serializable: " + e.toString, Some(e)) runningStages -= stage // Abort execution return case NonFatal(e) => abortStage(stage, s"Task serialization failed: $e\n${e.getStackTraceString}", Some(e)) runningStages -= stage return } // 为stage创建一个指定数量的task val tasks: Seq[Task[_]] = try { stage match { case stage: ShuffleMapStage => partitionsToCompute.map { id => //给每一个partition创建一个task,给每一个task计算最佳位置 返回一个ShuffleMapTask对象 //最佳位置算法:从最后的RDD往前找,如果某个RDD被缓存了或者被checkpoint了,那么就讲该task优先分配到该rdd所在位置,否则,task位置由TaskScheduler来决定 val locs = taskIdToLocations(id) val part = stage.rdd.partitions(id) new ShuffleMapTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, stage.internalAccumulators) } case stage: ResultStage => val job = stage.activeJob.get partitionsToCompute.map { id => val p: Int = stage.partitions(id) val part = stage.rdd.partitions(p) val locs = taskIdToLocations(id) new ResultTask(stage.id, stage.latestInfo.attemptId, taskBinary, part, locs, id, stage.internalAccumulators) } } } catch { case NonFatal(e) => abortStage(stage, s"Task creation failed: $e\n${e.getStackTraceString}", Some(e)) runningStages -= stage return } if (tasks.size > 0) { logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")") stage.pendingPartitions ++= tasks.map(_.partitionId) logDebug("New pending partitions: " + stage.pendingPartitions) taskScheduler.submitTasks(new TaskSet( tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties)) stage.latestInfo.submissionTime = Some(clock.getTimeMillis()) } else { // Because we posted SparkListenerStageSubmitted earlier, we should mark // the stage as completed here in case there are no tasks to run markStageAsFinished(stage, None) val debugString = stage match { case stage: ShuffleMapStage => s"Stage ${stage} is actually done; " + s"(available: ${stage.isAvailable}," + s"available outputs: ${stage.numAvailableOutputs}," + s"partitions: ${stage.numPartitions})" case stage : ResultStage => s"Stage ${stage} is actually done; (partitions: ${stage.numPartitions})" } logDebug(debugString) } }
0 0
- spark源码学习(六)--- DAGScheduler中的task的划分
- spark源码学习(五)--- DAGScheduler中的stage的划分
- Spark源码-DAGScheduler中stage划分和task最佳位置
- Spark 源码解析 : DAGScheduler中的DAG划分与提交
- Spark源码分析之DAGScheduler以及stage的划分
- [Spark源码剖析] DAGScheduler划分stage
- [spark] DAGScheduler划分stage源码解析
- spark源码学习(五):stage的划分和task的创建
- 2. spark源码学习分享:DAGScheduler.runJob
- 【Spark】DAGScheduler源码浅析
- 7.DAGScheduler的stage算法划分和TaskScheduler的task算法划分
- spark源码学习(七);task任务的提交分析
- Spark源码阅读笔记:DAGScheduler
- 【Spark】DAGScheduler源码浅析2
- Spark源码分析之六:Task调度(二)
- spark stage的划分和task分配
- DAGScheduler源码解析之Stage划分
- 源码-DAGScheduler及Stage划分提交
- 【POJ】1258 - Agri-Net(克鲁斯塔尔)(水)
- STM32的bootloader IAP编程(转载总结)
- 改变系统默认的语言,back改为中文,delete改为中文
- Codeforces 628B New Skateboard【数学】
- Android的通讯员——notification
- spark源码学习(六)--- DAGScheduler中的task的划分
- oracle之函数使用大全
- android小问题--------------------SQLiteDatabase.insert(table, nullColumnHack, values)参数
- 测试必备技能系列4:如何用SSH向linux服务器上传下载文件
- Codeforces 628C Bear and String Distance 【构造】
- java中transient关键字
- test
- 多线程
- [BZOJ3626] [LNOI2014]LCA