Flink运行时之客户端提交作业图-下
来源:互联网 发布:淘宝bose官方旗舰店 编辑:程序博客网 时间:2024/05/17 06:15
submitJob方法分析
JobClientActor通过向JobManager的Actor发送SubmitJob消息来提交Job,JobManager接收到消息对象之后,构建一个JobInfo对象以封装Job的基本信息,然后将这两个对象传递给submitJob方法:
case SubmitJob(jobGraph, listeningBehaviour) => val client = sender() val jobInfo = new JobInfo(client, listeningBehaviour, System.currentTimeMillis(), jobGraph.getSessionTimeout) submitJob(jobGraph, jobInfo)
我们会以submitJob的关键方法调用来串讲其主要逻辑。首先判断jobGraph参数,如果为空则直接回应JobResultFailure消息:
if (jobGraph == null) { jobInfo.client ! decorateMessage(JobResultFailure( new SerializedThrowable( new JobSubmissionException(null, "JobGraph must not be null.") ) ))}
接着,向类库缓存管理器注册该Job相关的库文件、类路径:
libraryCacheManager.registerJob(jobGraph.getJobID, jobGraph.getUserJarBlobKeys, jobGraph.getClasspaths)
必须确保该步骤率先成功执行,因为一旦后续产生任何异常才可以确保上传的类库和Jar等被成功从类库缓存管理器中移除。从这开始的整个代码段都被包裹在try语句块中,一旦捕获到任何异常,会通过libraryCacheManager的unregisterJob方法将相关Jar文件删除:
catch { case t: Throwable => libraryCacheManager.unregisterJob(jobId) //...}
接下来是获得用户代码的类加载器classLoader以及发生失败时的重启策略restartStrategy:
val userCodeLoader = libraryCacheManager.getClassLoader(jobGraph.getJobID)val restartStrategy = Option(jobGraph.getRestartStrategyConfiguration()) .map(RestartStrategyFactory.createRestartStrategy(_)) match { case Some(strategy) => strategy case None => defaultRestartStrategy }
接着,获得执行图ExecutionGraph对象的实例。首先尝试从缓存中查找,如果缓存中存在则直接返回,否则直接创建然后加入缓存:
executionGraph = currentJobs.get(jobGraph.getJobID) match { case Some((graph, currentJobInfo)) => currentJobInfo.setLastActive() graph case None => val graph = new ExecutionGraph( executionContext, jobGraph.getJobID, jobGraph.getName, jobGraph.getJobConfiguration, timeout, restartStrategy, jobGraph.getUserJarBlobKeys, jobGraph.getClasspaths, userCodeLoader) currentJobs.put(jobGraph.getJobID, (graph, jobInfo)) graph}
获得了executionGraph之后会对其相关属性进行设置,这些属性包括调度模式、是否允许被加入调度队列、计划的Json格式表示。
executionGraph.setScheduleMode(jobGraph.getScheduleMode())executionGraph.setQueuedSchedulingAllowed(jobGraph.getAllowQueuedScheduling())executionGraph.setJsonPlan(JsonPlanGenerator.generatePlan(jobGraph))
接下来初始化JobVertex的一些属性:
val numSlots = scheduler.getTotalNumberOfSlots()for (vertex <- jobGraph.getVertices.asScala) { val executableClass = vertex.getInvokableClassName if (vertex.getParallelism() == ExecutionConfig.PARALLELISM_AUTO_MAX) { vertex.setParallelism(numSlots) } vertex.initializeOnMaster(userCodeLoader)}
获得JobGraph中从source开始的按照拓扑顺序排序的顶点集合,然后将该集合附加到ExecutionGraph上,附加的过程完成了很多事情,我们后续进行分析:
val sortedTopology = jobGraph.getVerticesSortedTopologicallyFromSources()executionGraph.attachJobGraph(sortedTopology)
接下来将快照配置和检查点配置的信息写入ExecutionGraph:
val snapshotSettings = jobGraph.getSnapshotSettingsif (snapshotSettings != null) { val jobId = jobGraph.getJobID() val idToVertex: JobVertexID => ExecutionJobVertex = id => { val vertex = executionGraph.getJobVertex(id) vertex } val triggerVertices: java.util.List[ExecutionJobVertex] = snapshotSettings.getVerticesToTrigger().asScala.map(idToVertex).asJava val ackVertices: java.util.List[ExecutionJobVertex] = snapshotSettings.getVerticesToAcknowledge().asScala.map(idToVertex).asJava val confirmVertices: java.util.List[ExecutionJobVertex] = snapshotSettings.getVerticesToConfirm().asScala.map(idToVertex).asJava val completedCheckpoints = checkpointRecoveryFactory .createCompletedCheckpoints(jobId, userCodeLoader) val checkpointIdCounter = checkpointRecoveryFactory.createCheckpointIDCounter(jobId) executionGraph.enableSnapshotCheckpointing( snapshotSettings.getCheckpointInterval, snapshotSettings.getCheckpointTimeout, snapshotSettings.getMinPauseBetweenCheckpoints, snapshotSettings.getMaxConcurrentCheckpoints, triggerVertices, ackVertices, confirmVertices, context.system, leaderSessionID.orNull, checkpointIdCounter, completedCheckpoints, recoveryMode, savepointStore)}
JobManager自身会注册Job状态变更的事件回调:
executionGraph.registerJobStatusListener(new AkkaActorGateway(self, leaderSessionID.orNull))
如果Client也需要感知到执行结果以及Job状态的变更,那么也会为Client注册事件回调:
if (jobInfo.listeningBehaviour == ListeningBehaviour.EXECUTION_RESULT_AND_STATE_CHANGES) { val gateway = new AkkaActorGateway(jobInfo.client, leaderSessionID.orNull) executionGraph.registerExecutionListener(gateway) executionGraph.registerJobStatusListener(gateway)}
以上这些代码从将Job相关的Jar加入到类库缓存管理器开始,都被包裹在try块中,如果产生异常将进入catch代码块中进行异常处理:
catch { case t: Throwable => log.error(s"Failed to submit job $jobId ($jobName)", t) libraryCacheManager.unregisterJob(jobId) currentJobs.remove(jobId) if (executionGraph != null) { executionGraph.fail(t) } val rt: Throwable = if (t.isInstanceOf[JobExecutionException]) { t } else { new JobExecutionException(jobId, s"Failed to submit job $jobId ($jobName)", t) } jobInfo.client ! decorateMessage(JobResultFailure(new SerializedThrowable(rt))) return}
异常处理时首先根据jobID移除类库缓存中跟当前Job有关的类库,接着从currentJobsMap中移除job对应的ExecutionGraph,JobInfo元组信息。然后调用ExecutionGraph的fail方法,促使其失败。最后,将产生的异常以JobResultFailure消息告知客户端并结束方法调用。
从当前开始直到最后的这段代码可能会造成阻塞,将会被包裹在future块中并以异步的方式执行。先判断当前的是否是恢复模式,如果是恢复模式则从最近的检查点恢复:
if (isRecovery) { executionGraph.restoreLatestCheckpointedState()}
如果不是恢复模式,但快照配置中存在保存点路径,也将基于保存点来重置状态:
executionGraph.restoreSavepoint(savepointPath)
然后会把当前的JobGraph信息写入SubmittedJobGraphStore,它主要用于恢复的目的
submittedJobGraphs.putJobGraph(new SubmittedJobGraph(jobGraph, jobInfo))
执行到这一步,就可以向Client回复JobSubmitSuccess消息了:
jobInfo.client ! decorateMessage(JobSubmitSuccess(jobGraph.getJobID))
接下来会基于ExecutionGraph触发Job的调度,这是Task被执行的前提:
if (leaderElectionService.hasLeadership) { executionGraph.scheduleForExecution(scheduler)} else { self ! decorateMessage(RemoveJob(jobId, removeJobFromStateBackend = false)) }
为了防止多个JobManager同时调度相同的Job的情况产生,这里首先判断当前节点是否是Leader。如果是,才会进行调度。否则将会向自身发送一条RemoveJob消息,以进入其他处理逻辑。
到此为止,submitJob方法的梳理就算完成了。因为这是JobManager接收到Client提交的Job后的主要处理方法,所以包含的逻辑比较多。
微信扫码关注公众号:Apache_Flink
QQ扫码关注QQ群:Apache Flink学习交流群(123414680)
- Flink运行时之客户端提交作业图-下
- Flink运行时之客户端提交作业图-上
- Flink运行时之生成作业图
- Hadoop作业提交之客户端作业提交
- Flink运行时之基于Netty的网络通信(下)
- 【Hadoop代码笔记】Hadoop作业提交之客户端作业提交
- Flink运行时之通信层API
- Flink运行时之TaskManager执行Task
- Flink运行时之通信层API
- flink 的datastream的作业提交问题
- Apache Flink流作业提交流程分析
- Apache Flink流作业提交流程分析
- Flink提交作业的两种方式
- Hadoop MapReduce之作业提交(客户端)
- Flink运行时之网络通信NetworkEnvironment分析
- Flink运行时之统一的数据交换对象
- Flink运行时之生产端结果分区
- Flink运行时之结果分区消费端
- 面向对象基础知识点
- 如何在ubuntu上使用校园网上网
- 文件IO流(二)
- JSP_11th_内置对象
- NumPy学习笔记(4)--数据归一化
- Flink运行时之客户端提交作业图-下
- 图解View测量、布局及绘制原理
- 蓝桥杯-表达式计算-栈的应用
- 匀速动画封装
- 【BZOJ 3295】[Cqoi2011]动态逆序对
- Unity3D优化技巧系列六
- Android自定义View资料
- python-recsys 3 Data model 3 数据模型
- 1983 等式问题