Flink运行时之客户端提交作业图-下

来源：互联网发布：淘宝bose官方旗舰店编辑：程序博客网时间：2024/05/17 06:15

submitJob方法分析

JobClientActor通过向JobManager的Actor发送SubmitJob消息来提交Job，JobManager接收到消息对象之后，构建一个JobInfo对象以封装Job的基本信息，然后将这两个对象传递给submitJob方法：

case SubmitJob(jobGraph, listeningBehaviour) =>      val client = sender()      val jobInfo = new JobInfo(client, listeningBehaviour, System.currentTimeMillis(),            jobGraph.getSessionTimeout)      submitJob(jobGraph, jobInfo)

我们会以submitJob的关键方法调用来串讲其主要逻辑。首先判断jobGraph参数，如果为空则直接回应JobResultFailure消息：

if (jobGraph == null) {      jobInfo.client ! decorateMessage(JobResultFailure(            new SerializedThrowable(                  new JobSubmissionException(null, "JobGraph must not be null.")            )      ))}

接着，向类库缓存管理器注册该Job相关的库文件、类路径：

libraryCacheManager.registerJob(jobGraph.getJobID, jobGraph.getUserJarBlobKeys,                                  jobGraph.getClasspaths)

必须确保该步骤率先成功执行，因为一旦后续产生任何异常才可以确保上传的类库和Jar等被成功从类库缓存管理器中移除。从这开始的整个代码段都被包裹在try语句块中，一旦捕获到任何异常，会通过libraryCacheManager的unregisterJob方法将相关Jar文件删除：

catch {  case t: Throwable =>        libraryCacheManager.unregisterJob(jobId)    //...}

接下来是获得用户代码的类加载器classLoader以及发生失败时的重启策略restartStrategy：

val userCodeLoader = libraryCacheManager.getClassLoader(jobGraph.getJobID)val restartStrategy = Option(jobGraph.getRestartStrategyConfiguration())      .map(RestartStrategyFactory.createRestartStrategy(_)) match {            case Some(strategy) => strategy            case None => defaultRestartStrategy  }

接着，获得执行图ExecutionGraph对象的实例。首先尝试从缓存中查找，如果缓存中存在则直接返回，否则直接创建然后加入缓存：

executionGraph = currentJobs.get(jobGraph.getJobID) match {      case Some((graph, currentJobInfo)) =>            currentJobInfo.setLastActive()            graph      case None =>            val graph = new ExecutionGraph(                  executionContext,                  jobGraph.getJobID,                  jobGraph.getName,                  jobGraph.getJobConfiguration,                  timeout,                  restartStrategy,                  jobGraph.getUserJarBlobKeys,                  jobGraph.getClasspaths,                  userCodeLoader)            currentJobs.put(jobGraph.getJobID, (graph, jobInfo))            graph}

获得了executionGraph之后会对其相关属性进行设置，这些属性包括调度模式、是否允许被加入调度队列、计划的Json格式表示。

executionGraph.setScheduleMode(jobGraph.getScheduleMode())executionGraph.setQueuedSchedulingAllowed(jobGraph.getAllowQueuedScheduling())executionGraph.setJsonPlan(JsonPlanGenerator.generatePlan(jobGraph))

接下来初始化JobVertex的一些属性：

val numSlots = scheduler.getTotalNumberOfSlots()for (vertex <- jobGraph.getVertices.asScala) {      val executableClass = vertex.getInvokableClassName     if (vertex.getParallelism() == ExecutionConfig.PARALLELISM_AUTO_MAX) {            vertex.setParallelism(numSlots)      }      vertex.initializeOnMaster(userCodeLoader)}

获得JobGraph中从source开始的按照拓扑顺序排序的顶点集合，然后将该集合附加到ExecutionGraph上，附加的过程完成了很多事情，我们后续进行分析：

val sortedTopology = jobGraph.getVerticesSortedTopologicallyFromSources()executionGraph.attachJobGraph(sortedTopology)

接下来将快照配置和检查点配置的信息写入ExecutionGraph：

val snapshotSettings = jobGraph.getSnapshotSettingsif (snapshotSettings != null) {      val jobId = jobGraph.getJobID()      val idToVertex: JobVertexID => ExecutionJobVertex = id => {            val vertex = executionGraph.getJobVertex(id)              vertex      }      val triggerVertices: java.util.List[ExecutionJobVertex] =            snapshotSettings.getVerticesToTrigger().asScala.map(idToVertex).asJava      val ackVertices: java.util.List[ExecutionJobVertex] =            snapshotSettings.getVerticesToAcknowledge().asScala.map(idToVertex).asJava      val confirmVertices: java.util.List[ExecutionJobVertex] =            snapshotSettings.getVerticesToConfirm().asScala.map(idToVertex).asJava      val completedCheckpoints = checkpointRecoveryFactory            .createCompletedCheckpoints(jobId, userCodeLoader)      val checkpointIdCounter = checkpointRecoveryFactory.createCheckpointIDCounter(jobId)      executionGraph.enableSnapshotCheckpointing(            snapshotSettings.getCheckpointInterval,            snapshotSettings.getCheckpointTimeout,            snapshotSettings.getMinPauseBetweenCheckpoints,            snapshotSettings.getMaxConcurrentCheckpoints,            triggerVertices,            ackVertices,            confirmVertices,            context.system,            leaderSessionID.orNull,            checkpointIdCounter,            completedCheckpoints,            recoveryMode,            savepointStore)}

JobManager自身会注册Job状态变更的事件回调：

executionGraph.registerJobStatusListener(new AkkaActorGateway(self, leaderSessionID.orNull))

如果Client也需要感知到执行结果以及Job状态的变更，那么也会为Client注册事件回调：

if (jobInfo.listeningBehaviour == ListeningBehaviour.EXECUTION_RESULT_AND_STATE_CHANGES) {        val gateway = new AkkaActorGateway(jobInfo.client, leaderSessionID.orNull)      executionGraph.registerExecutionListener(gateway)      executionGraph.registerJobStatusListener(gateway)}

以上这些代码从将Job相关的Jar加入到类库缓存管理器开始，都被包裹在try块中，如果产生异常将进入catch代码块中进行异常处理：

catch {      case t: Throwable =>            log.error(s"Failed to submit job $jobId ($jobName)", t)            libraryCacheManager.unregisterJob(jobId)            currentJobs.remove(jobId)            if (executionGraph != null) {                  executionGraph.fail(t)            }            val rt: Throwable = if (t.isInstanceOf[JobExecutionException]) {                  t            } else {                  new JobExecutionException(jobId, s"Failed to submit job $jobId ($jobName)", t)            }            jobInfo.client ! decorateMessage(JobResultFailure(new SerializedThrowable(rt)))            return}

异常处理时首先根据jobID移除类库缓存中跟当前Job有关的类库，接着从currentJobsMap中移除job对应的ExecutionGraph，JobInfo元组信息。然后调用ExecutionGraph的fail方法，促使其失败。最后，将产生的异常以JobResultFailure消息告知客户端并结束方法调用。

从当前开始直到最后的这段代码可能会造成阻塞，将会被包裹在future块中并以异步的方式执行。先判断当前的是否是恢复模式，如果是恢复模式则从最近的检查点恢复：

if (isRecovery) {      executionGraph.restoreLatestCheckpointedState()}

如果不是恢复模式，但快照配置中存在保存点路径，也将基于保存点来重置状态：

executionGraph.restoreSavepoint(savepointPath)

然后会把当前的JobGraph信息写入SubmittedJobGraphStore，它主要用于恢复的目的

submittedJobGraphs.putJobGraph(new SubmittedJobGraph(jobGraph, jobInfo))

执行到这一步，就可以向Client回复JobSubmitSuccess消息了：

jobInfo.client ! decorateMessage(JobSubmitSuccess(jobGraph.getJobID))

接下来会基于ExecutionGraph触发Job的调度，这是Task被执行的前提：

if (leaderElectionService.hasLeadership) {      executionGraph.scheduleForExecution(scheduler)} else {      self ! decorateMessage(RemoveJob(jobId, removeJobFromStateBackend = false))  }

为了防止多个JobManager同时调度相同的Job的情况产生，这里首先判断当前节点是否是Leader。如果是，才会进行调度。否则将会向自身发送一条RemoveJob消息，以进入其他处理逻辑。

到此为止，submitJob方法的梳理就算完成了。因为这是JobManager接收到Client提交的Job后的主要处理方法，所以包含的逻辑比较多。

微信扫码关注公众号：Apache_Flink

apache_flink_weichat

QQ扫码关注QQ群：Apache Flink学习交流群（123414680）

qrcode_for_apache_flink_qq_group

1 0