DAGScheduler核心步骤解读

来源：互联网发布：科比场均数据每年编辑：程序博客网时间：2024/05/16 06:17

DAGScheduler总体说明

1、构建dag（rdd的有向无环图），完成stage划分
2、维护job和stage的对应关系
3、按照本地性原则维护RDD的存储位置
4、根据stage信息生成taskSet提交给TaskScheduler
5、内部行为监听，通过队列维护执行顺序

DAGScheduler是面向stage的高层调度层，为每个job生成一个stage的DAG（有向无环图），记录哪些rdd或者stage的输出被持久化，为job生成最优调度。之后将stage以TaskSet的方式提交给TaskScheduler的实现对象，使之在集群上开始运行。TaskSet由完全独立的Task构成，这些task能基于这个集群上已有的数据进行计算（如果数据变得无法获取，task可能失败）
spark根据DAG的宽依赖完成Stage的划分，之后dagScheduler根据数据的缓存情况来决定task的优先位置，并将这些信息传递给TaskScheduler。值得注意的是，如果shuffle的输出文件丢失会导致失败，这些任务会被重新提交。如果是stage内部的失败则会由TaskScheduler进行retry。
job是最高层的任务，当调用count()时，job会被提交。每个job包含多个stage
stage是执行job过程中用于计算中间结果的任务集合，每个任务给予rdd的不同分区进行相同的操作。基于宽依赖完成stage的划分，stage有两种：ResultStage（每个job的最终stage）和ShuffleMapStage（会输出shuffle结果）。不同的job如果用到了相同的RDD，可能会导致Stage的重复。
Task是任务的最小执行单元，每个Task会被分配给一个executor。
Cache tracking：DAGScheduler会计算出rdd的缓存位置来避免重复计算，同时记录已经计算的shuffle map stage来避免重新做shuffle。
Prefered locations:dagScheduler会根据缓存的rdd和shuffle数据位置来对stage中的task进行分配。
cleanup：当没有job依赖某些数据时，这些数据会被清理，来避免长时运行的应用发生内存泄露。

sparkjob函数调用链如下：

/**
* 在给定的rdd上执行actionjob，并将结果传给resultHandler
* @param rdd job对这个rdd进行计算
* @param func 操作
* @param partitions 需要计算的rdd分区
* @param callSite 调用位置
* @param resultHandler 回调函数
* @param properties 属性信息
* @throws Exception 抛出异常
*/
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
//调用submitJob
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
//其他……
}
/**
* 将job提交给scheduler
*
* @param rdd 目标rdd
* @param func 操作方法
* @param partitions 目标分区
* @param callSite 调用位置
* @param resultHandler 回调函数
* @param properties 属性
* @return a JobWaiter 锁定或者取消job
* @throws IllegalArgumentException 抛出异常
*/
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// 校验分区情况
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
//生成新的jobid
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
//将任务信息添加到队列中
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
之后dagScheduler从队列中获取事件进行处理
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)//jobSubmited时候执行
case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
case StageCancelled(stageId) =>
dagScheduler.handleStageCancellation(stageId)
case JobCancelled(jobId) =>
dagScheduler.handleJobCancellation(jobId)
case JobGroupCancelled(groupId) =>
dagScheduler.handleJobGroupCancelled(groupId)
case AllJobsCancelled =>
dagScheduler.doCancelAllJobs()
case ExecutorAdded(execId, host) =>
dagScheduler.handleExecutorAdded(execId, host)
case ExecutorLost(execId, reason) =>
val filesLost = reason match {
case SlaveLost(_, true) => true
case _ => false
}
dagScheduler.handleExecutorLost(execId, filesLost)
case BeginEvent(task, taskInfo) =>
dagScheduler.handleBeginEvent(task, taskInfo)
case GettingResultEvent(taskInfo) =>
dagScheduler.handleGetTaskResult(taskInfo)
case completion: CompletionEvent =>
dagScheduler.handleTaskCompletion(completion)
case TaskSetFailed(taskSet, reason, exception) =>
dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
case ResubmitFailedStages =>
dagScheduler.resubmitFailedStages()
}
//jobSubmit分支运行
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
//为每个job生成一个finalStage，是ResultStage类型
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
//生成新的job对象
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
//清除之前的缓存信息
clearCacheLocs()
//绑定一些job和stage信息
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
//获取stage的依赖关系
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
//提交到监听器里面
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//job转换为finalStage进行提交
submitStage(finalStage)
}

/** 提交stage，从丢失的最原始stage开始计算*/
private def submitStage(stage: Stage) {
//根据stage获取job
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
//判断stage是否符合条件
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//获取stage的未被计算的祖先stage，遇到宽依赖就生成新stage，按照stageid排序，从最原始的stage开始计算后续步骤，可以根据这个生成DAG
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
//如果直接祖先stage被计算了，则提交task
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
//递归获取祖先stage信息
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// 使用栈来避免StackOverflowError
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
//获取rdd的存储位置
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
//遍历rdd的依赖rdd列表
dep match {
//宽依赖就生成或者获取ShuffleMapStage，如果该shuffleMapStage现在不可达，则添加到missing列表里
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//窄依赖就将窄依赖的rdd添加到栈里
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
//从stage的目标rdd开始处理
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
//返回需要重新计算的stage列表
missing.toList
}
/** 当stage的直接祖先stage被处理完，则开始对stage进行处理*/
private def submitMissingTasks(stage: Stage, jobId: Int) {
//清空历史信息
stage.pendingPartitions.clear()
//获取stage丢失的partition信息
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// 获取job的配置信息
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
//验证stage是否可以提交
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
//获取最优位置
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
//异常处理
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
//task序列化
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
//将序列化的task信息广播出去
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
//异常处理
}
//根据stage的类型生成不同的task
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
}
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
//将taskSet提交给taskScheduler
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
markStageAsFinished(stage, None)
logDebug(debugString)
//提交stage的子孙stage
submitWaitingChildStages(stage)
}
}

如上所示，在dagScheduler分配任务时，首先会获取任务的优先位置getPreferredLocs，进一步调用getPreferredLocsInternal，优先选择cache，之后按照rdd的 preferredLocations策略，最后按照窄依赖的传递关系进行分配，如果是宽依赖，则一定会进行shuffle，获取优先位置没有意义。
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
if (!visited.add((rdd, partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
// 如果分区被cache，则返回cache的executor
val cached = getCacheLocs(rdd)(partition)
if (cached.nonEmpty) {
return cached
}
// 如果rdd有自定义的preferredLocations，则根据这个函数获取最优位置
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (rddPrefs.nonEmpty) {
return rddPrefs.map(TaskLocation(_))
}
// 如果rdd有窄依赖，则选择最原始窄依赖的第一个分区（有优先位置策略）。理想情况下，我们应该根据transfer的大小来选择，眼下我们正在实施。
rdd.dependencies.foreach {
case n: NarrowDependency[_] =>
for (inPart <- n.getParents(partition)) {
val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
if (locs != Nil) {
return locs
}
}
case _ =>
}
Nil
}
每个宽依赖都会生成对应的shuffleMapStage，在生成过程中，如果发现数据之前已经计算过，则会重用可用部分，减少重复计算。
/**
* 生成shuffleMapStage，生成给定宽依赖的分区. 如果之前的stage已经生成了同样的shuffle数据，这个函数获取可用的部分避免数据的重新生成
*/
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep)
stageIdToStage(id) = stage
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
updateJobIdStageIdMaps(jobId, stage)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
// 这个shuffle操作之前可能被计算过，目前仍然可用的话，就将output信息复制给新的stage
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
}
}
} else {
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}

整个过程，DAGScheduler会根据提交application的action对完成job识别，之后将job的相关信息添加到队列里面，每个job都会有一个ResultStage，之后dagScheduler会追溯这个ResultStage的祖先Stage，发现宽依赖就生成新的Stage，递归执行生成DAG。同时，根据每个stage的类型生成对应的TaskSet并序列化后按照数据的本地性原则进行分发。

需要注意以下几点：

1 job的id由AtomicInteger维护，id越小越先执行；

2 stage的id由AtomicInteger维护，在dagScheduler递归切分stage时，祖先stage先生成，所以id越小表示其在dag里面越靠前；

3 DAGScheduler内部有个EventLoop，用来异步处理内部事件；事件包括：JobSubmitted、MapStageSubmitted、StageCancelled、JobCancelled……

4 DAGScheduler通过LiveListenerBus异步提交事件信息给SparkListener；

5 DAGScheduler在处理完当前stage的所有依赖关系之后，会将stage提交给TaskScheduler，根据数据本地性原则绑定taskid的位置信息；

6 按照Stage的依赖关系序列化其所需的数据和依赖，并进行广播分发给所有的executor。

阅读全文

0 0