DAGScheduler核心步骤解读
来源:互联网 发布:科比场均数据每年 编辑:程序博客网 时间:2024/05/16 06:17
DAGScheduler总体说明
1、构建dag(rdd的有向无环图),完成stage划分2、维护job和stage的对应关系
3、按照本地性原则维护RDD的存储位置
4、根据stage信息生成taskSet提交给TaskScheduler
5、内部行为监听,通过队列维护执行顺序
DAGScheduler是面向stage的高层调度层,为每个job生成一个stage的DAG(有向无环图),记录哪些rdd或者stage的输出被持久化,为job生成最优调度。之后将stage以TaskSet的方式提交给TaskScheduler的实现对象,使之在集群上开始运行。TaskSet由完全独立的Task构成,这些task能基于这个集群上已有的数据进行计算(如果数据变得无法获取,task可能失败)
spark根据DAG的宽依赖完成Stage的划分,之后dagScheduler根据数据的缓存情况来决定task的优先位置,并将这些信息传递给TaskScheduler。值得注意的是,如果shuffle的输出文件丢失会导致失败,这些任务会被重新提交。如果是stage内部的失败则会由TaskScheduler进行retry。
job是最高层的任务,当调用count()时,job会被提交。每个job包含多个stage
stage是执行job过程中用于计算中间结果的任务集合,每个任务给予rdd的不同分区进行相同的操作。基于宽依赖完成stage的划分,stage有两种:ResultStage(每个job的最终stage)和ShuffleMapStage(会输出shuffle结果)。不同的job如果用到了相同的RDD,可能会导致Stage的重复。
Task是任务的最小执行单元,每个Task会被分配给一个executor。
Cache tracking:DAGScheduler会计算出rdd的缓存位置来避免重复计算,同时记录已经计算的shuffle map stage来避免重新做shuffle。
Prefered locations:dagScheduler会根据缓存的rdd和shuffle数据位置来对stage中的task进行分配。
cleanup:当没有job依赖某些数据时,这些数据会被清理,来避免长时运行的应用发生内存泄露。
sparkjob函数调用链如下:
/**
* 在给定的rdd上执行actionjob,并将结果传给resultHandler
* @param rdd job对这个rdd进行计算
* @param func 操作
* @param partitions 需要计算的rdd分区
* @param callSite 调用位置
* @param resultHandler 回调函数
* @param properties 属性信息
* @throws Exception 抛出异常
*/
def runJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): Unit = {
//调用submitJob
val waiter = submitJob(rdd, func, partitions, callSite, resultHandler, properties)
//其他……
}
/**
* 将job提交给scheduler
*
* @param rdd 目标rdd
* @param func 操作方法
* @param partitions 目标分区
* @param callSite 调用位置
* @param resultHandler 回调函数
* @param properties 属性
* @return a JobWaiter 锁定或者取消job
* @throws IllegalArgumentException 抛出异常
*/
def submitJob[T, U](
rdd: RDD[T],
func: (TaskContext, Iterator[T]) => U,
partitions: Seq[Int],
callSite: CallSite,
resultHandler: (Int, U) => Unit,
properties: Properties): JobWaiter[U] = {
// 校验分区情况
val maxPartitions = rdd.partitions.length
partitions.find(p => p >= maxPartitions || p < 0).foreach { p =>
throw new IllegalArgumentException(
"Attempting to access a non-existent partition: " + p + ". " +
"Total number of partitions: " + maxPartitions)
}
//生成新的jobid
val jobId = nextJobId.getAndIncrement()
if (partitions.size == 0) {
// Return immediately if the job is running 0 tasks
return new JobWaiter[U](this, jobId, 0, resultHandler)
}
assert(partitions.size > 0)
val func2 = func.asInstanceOf[(TaskContext, Iterator[_]) => _]
val waiter = new JobWaiter(this, jobId, partitions.size, resultHandler)
//将任务信息添加到队列中
eventProcessLoop.post(JobSubmitted(
jobId, rdd, func2, partitions.toArray, callSite, waiter,
SerializationUtils.clone(properties)))
waiter
}
之后dagScheduler从队列中获取事件进行处理
private def doOnReceive(event: DAGSchedulerEvent): Unit = event match {
case JobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties) =>
dagScheduler.handleJobSubmitted(jobId, rdd, func, partitions, callSite, listener, properties)//jobSubmited时候执行
case MapStageSubmitted(jobId, dependency, callSite, listener, properties) =>
dagScheduler.handleMapStageSubmitted(jobId, dependency, callSite, listener, properties)
case StageCancelled(stageId) =>
dagScheduler.handleStageCancellation(stageId)
case JobCancelled(jobId) =>
dagScheduler.handleJobCancellation(jobId)
case JobGroupCancelled(groupId) =>
dagScheduler.handleJobGroupCancelled(groupId)
case AllJobsCancelled =>
dagScheduler.doCancelAllJobs()
case ExecutorAdded(execId, host) =>
dagScheduler.handleExecutorAdded(execId, host)
case ExecutorLost(execId, reason) =>
val filesLost = reason match {
case SlaveLost(_, true) => true
case _ => false
}
dagScheduler.handleExecutorLost(execId, filesLost)
case BeginEvent(task, taskInfo) =>
dagScheduler.handleBeginEvent(task, taskInfo)
case GettingResultEvent(taskInfo) =>
dagScheduler.handleGetTaskResult(taskInfo)
case completion: CompletionEvent =>
dagScheduler.handleTaskCompletion(completion)
case TaskSetFailed(taskSet, reason, exception) =>
dagScheduler.handleTaskSetFailed(taskSet, reason, exception)
case ResubmitFailedStages =>
dagScheduler.resubmitFailedStages()
}
//jobSubmit分支运行
private[scheduler] def handleJobSubmitted(jobId: Int,
finalRDD: RDD[_],
func: (TaskContext, Iterator[_]) => _,
partitions: Array[Int],
callSite: CallSite,
listener: JobListener,
properties: Properties) {
var finalStage: ResultStage = null
//为每个job生成一个finalStage,是ResultStage类型
finalStage = createResultStage(finalRDD, func, partitions, jobId, callSite)
//生成新的job对象
val job = new ActiveJob(jobId, finalStage, callSite, listener, properties)
//清除之前的缓存信息
clearCacheLocs()
//绑定一些job和stage信息
jobIdToActiveJob(jobId) = job
activeJobs += job
finalStage.setActiveJob(job)
val stageIds = jobIdToStageIds(jobId).toArray
//获取stage的依赖关系
val stageInfos = stageIds.flatMap(id => stageIdToStage.get(id).map(_.latestInfo))
//提交到监听器里面
listenerBus.post(
SparkListenerJobStart(job.jobId, jobSubmissionTime, stageInfos, properties))
//job转换为finalStage进行提交
submitStage(finalStage)
}
/** 提交stage,从丢失的最原始stage开始计算*/
private def submitStage(stage: Stage) {
//根据stage获取job
val jobId = activeJobForStage(stage)
if (jobId.isDefined) {
//判断stage是否符合条件
if (!waitingStages(stage) && !runningStages(stage) && !failedStages(stage)) {
//获取stage的未被计算的祖先stage,遇到宽依赖就生成新stage,按照stageid排序,从最原始的stage开始计算后续步骤,可以根据这个生成DAG
val missing = getMissingParentStages(stage).sortBy(_.id)
logDebug("missing: " + missing)
if (missing.isEmpty) {
//如果直接祖先stage被计算了,则提交task
submitMissingTasks(stage, jobId.get)
} else {
for (parent <- missing) {
//递归获取祖先stage信息
submitStage(parent)
}
waitingStages += stage
}
}
} else {
abortStage(stage, "No active job for stage " + stage.id, None)
}
}
private def getMissingParentStages(stage: Stage): List[Stage] = {
val missing = new HashSet[Stage]
val visited = new HashSet[RDD[_]]
// 使用栈来避免StackOverflowError
val waitingForVisit = new Stack[RDD[_]]
def visit(rdd: RDD[_]) {
if (!visited(rdd)) {
visited += rdd
//获取rdd的存储位置
val rddHasUncachedPartitions = getCacheLocs(rdd).contains(Nil)
if (rddHasUncachedPartitions) {
for (dep <- rdd.dependencies) {
//遍历rdd的依赖rdd列表
dep match {
//宽依赖就生成或者获取ShuffleMapStage,如果该shuffleMapStage现在不可达,则添加到missing列表里
case shufDep: ShuffleDependency[_, _, _] =>
val mapStage = getOrCreateShuffleMapStage(shufDep, stage.firstJobId)
if (!mapStage.isAvailable) {
missing += mapStage
}
//窄依赖就将窄依赖的rdd添加到栈里
case narrowDep: NarrowDependency[_] =>
waitingForVisit.push(narrowDep.rdd)
}
}
}
}
}
//从stage的目标rdd开始处理
waitingForVisit.push(stage.rdd)
while (waitingForVisit.nonEmpty) {
visit(waitingForVisit.pop())
}
//返回需要重新计算的stage列表
missing.toList
}
/** 当stage的直接祖先stage被处理完,则开始对stage进行处理*/
private def submitMissingTasks(stage: Stage, jobId: Int) {
//清空历史信息
stage.pendingPartitions.clear()
//获取stage丢失的partition信息
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
// 获取job的配置信息
val properties = jobIdToActiveJob(jobId).properties
runningStages += stage
//验证stage是否可以提交
stage match {
case s: ShuffleMapStage =>
outputCommitCoordinator.stageStart(stage = s.id, maxPartitionId = s.numPartitions - 1)
case s: ResultStage =>
outputCommitCoordinator.stageStart(
stage = s.id, maxPartitionId = s.rdd.partitions.length - 1)
}
//获取最优位置
val taskIdToLocations: Map[Int, Seq[TaskLocation]] = try {
stage match {
case s: ShuffleMapStage =>
partitionsToCompute.map { id => (id, getPreferredLocs(stage.rdd, id))}.toMap
case s: ResultStage =>
partitionsToCompute.map { id =>
val p = s.partitions(id)
(id, getPreferredLocs(stage.rdd, p))
}.toMap
}
} catch {
//异常处理
}
stage.makeNewStageAttempt(partitionsToCompute.size, taskIdToLocations.values.toSeq)
listenerBus.post(SparkListenerStageSubmitted(stage.latestInfo, properties))
//task序列化
var taskBinary: Broadcast[Array[Byte]] = null
try {
// For ShuffleMapTask, serialize and broadcast (rdd, shuffleDep).
// For ResultTask, serialize and broadcast (rdd, func).
val taskBinaryBytes: Array[Byte] = stage match {
case stage: ShuffleMapStage =>
JavaUtils.bufferToArray(
closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
case stage: ResultStage =>
JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}
//将序列化的task信息广播出去
taskBinary = sc.broadcast(taskBinaryBytes)
} catch {
//异常处理
}
//根据stage的类型生成不同的task
val tasks: Seq[Task[_]] = try {
stage match {
case stage: ShuffleMapStage =>
partitionsToCompute.map { id =>
val locs = taskIdToLocations(id)
val part = stage.rdd.partitions(id)
new ShuffleMapTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, stage.latestInfo.taskMetrics, properties, Option(jobId),
Option(sc.applicationId), sc.applicationAttemptId)
}
case stage: ResultStage =>
partitionsToCompute.map { id =>
val p: Int = stage.partitions(id)
val part = stage.rdd.partitions(p)
val locs = taskIdToLocations(id)
new ResultTask(stage.id, stage.latestInfo.attemptId,
taskBinary, part, locs, id, properties, stage.latestInfo.taskMetrics,
Option(jobId), Option(sc.applicationId), sc.applicationAttemptId)
}
}
} catch {
}
if (tasks.size > 0) {
logInfo("Submitting " + tasks.size + " missing tasks from " + stage + " (" + stage.rdd + ")")
stage.pendingPartitions ++= tasks.map(_.partitionId)
logDebug("New pending partitions: " + stage.pendingPartitions)
//将taskSet提交给taskScheduler
taskScheduler.submitTasks(new TaskSet(
tasks.toArray, stage.id, stage.latestInfo.attemptId, jobId, properties))
stage.latestInfo.submissionTime = Some(clock.getTimeMillis())
} else {
markStageAsFinished(stage, None)
logDebug(debugString)
//提交stage的子孙stage
submitWaitingChildStages(stage)
}
}
如上所示,在dagScheduler分配任务时,首先会获取任务的优先位置getPreferredLocs,进一步调用getPreferredLocsInternal,优先选择cache,之后按照rdd的 preferredLocations策略,最后按照窄依赖的传递关系进行分配,如果是宽依赖,则一定会进行shuffle,获取优先位置没有意义。
private def getPreferredLocsInternal(
rdd: RDD[_],
partition: Int,
visited: HashSet[(RDD[_], Int)]): Seq[TaskLocation] = {
// If the partition has already been visited, no need to re-visit.
// This avoids exponential path exploration. SPARK-695
if (!visited.add((rdd, partition))) {
// Nil has already been returned for previously visited partitions.
return Nil
}
// 如果分区被cache,则返回cache的executor
val cached = getCacheLocs(rdd)(partition)
if (cached.nonEmpty) {
return cached
}
// 如果rdd有自定义的preferredLocations,则根据这个函数获取最优位置
val rddPrefs = rdd.preferredLocations(rdd.partitions(partition)).toList
if (rddPrefs.nonEmpty) {
return rddPrefs.map(TaskLocation(_))
}
// 如果rdd有窄依赖,则选择最原始窄依赖的第一个分区(有优先位置策略)。理想情况下,我们应该根据transfer的大小来选择,眼下我们正在实施。
rdd.dependencies.foreach {
case n: NarrowDependency[_] =>
for (inPart <- n.getParents(partition)) {
val locs = getPreferredLocsInternal(n.rdd, inPart, visited)
if (locs != Nil) {
return locs
}
}
case _ =>
}
Nil
}
每个宽依赖都会生成对应的shuffleMapStage,在生成过程中,如果发现数据之前已经计算过,则会重用可用部分,减少重复计算。
/**
* 生成shuffleMapStage,生成给定宽依赖的分区. 如果之前的stage已经生成了同样的shuffle数据,这个函数获取可用的部分避免数据的重新生成
*/
def createShuffleMapStage(shuffleDep: ShuffleDependency[_, _, _], jobId: Int): ShuffleMapStage = {
val rdd = shuffleDep.rdd
val numTasks = rdd.partitions.length
val parents = getOrCreateParentStages(rdd, jobId)
val id = nextStageId.getAndIncrement()
val stage = new ShuffleMapStage(id, rdd, numTasks, parents, jobId, rdd.creationSite, shuffleDep)
stageIdToStage(id) = stage
shuffleIdToMapStage(shuffleDep.shuffleId) = stage
updateJobIdStageIdMaps(jobId, stage)
if (mapOutputTracker.containsShuffle(shuffleDep.shuffleId)) {
// 这个shuffle操作之前可能被计算过,目前仍然可用的话,就将output信息复制给新的stage
val serLocs = mapOutputTracker.getSerializedMapOutputStatuses(shuffleDep.shuffleId)
val locs = MapOutputTracker.deserializeMapStatuses(serLocs)
(0 until locs.length).foreach { i =>
if (locs(i) ne null) {
// locs(i) will be null if missing
stage.addOutputLoc(i, locs(i))
}
}
} else {
logInfo("Registering RDD " + rdd.id + " (" + rdd.getCreationSite + ")")
mapOutputTracker.registerShuffle(shuffleDep.shuffleId, rdd.partitions.length)
}
stage
}
整个过程,DAGScheduler会根据提交application的action对完成job识别,之后将job的相关信息添加到队列里面,每个job都会有一个ResultStage,之后dagScheduler会追溯这个ResultStage的祖先Stage,发现宽依赖就生成新的Stage,递归执行生成DAG。同时,根据每个stage的类型生成对应的TaskSet并序列化后按照数据的本地性原则进行分发。
需要注意以下几点:
1 job的id由AtomicInteger维护,id越小越先执行;
2 stage的id由AtomicInteger维护,在dagScheduler递归切分stage时,祖先stage先生成,所以id越小表示其在dag里面越靠前;
3 DAGScheduler内部有个EventLoop,用来异步处理内部事件;事件包括:JobSubmitted、MapStageSubmitted、StageCancelled、JobCancelled……
4 DAGScheduler通过LiveListenerBus异步提交事件信息给SparkListener;
5 DAGScheduler在处理完当前stage的所有依赖关系之后,会将stage提交给TaskScheduler,根据数据本地性原则绑定taskid的位置信息;
6 按照Stage的依赖关系序列化其所需的数据和依赖,并进行广播分发给所有的executor。
- DAGScheduler核心步骤解读
- DAGScheduler原理剖析和一些核心概念
- DAGScheduler
- spark内核揭秘-07-DAGScheduler源码解读初体验
- Spark核心作业调度和任务调度之DAGScheduler源码
- 解读核心动画类
- 核心解读:三严三实
- YARN核心部件解读
- 栈帧步骤解读
- spark dagscheduler
- Java HashMap 核心源码解读
- Java HashMap 核心源码解读
- Java HashMap 核心源码解读
- Java HashMap 核心源码解读
- 解读Redis dict核心数据结构
- Java HashMap 核心源码解读
- Java HashMap 核心源码解读
- Java HashMap 核心源码解读
- 期货多品种多策略多周期组合
- oracle create tablespaces and create user ---cxl
- jenkins 从svn下载源码中途断网问题
- Retrofit请求参数注解字段说明
- 最火房卡欣欣十三水房卡棋牌源码下载
- DAGScheduler核心步骤解读
- FTPrep, 71 Simplify Path
- python enumerate的使用
- hadoop实战随笔_0713
- 关于JSP页面的basepath的作用及格式,举例详解
- NRF24L01+接收不正常的问题(只有开机才能接收到一两条数据)
- 对select下拉框的回显数据的处理
- CommonJS规范
- python安装与配置