Executor到底是什么时候启动的？

来源：互联网发布：保定网络舆情日报编辑：程序博客网时间：2024/05/02 04:17

6.4.1 Executor到底是什么时候启动的？

在SparkContext启动之后，StandaloneSchedulerBackend中会new出一个StandaloneAppClient，StandaloneAppClient中有一个名叫ClientEndPoint的内部类，在创建ClientEndpoint会传入Command来指定具体为当前应用程序启动的Executor进行的入口类的名称为CoarseGrainedExecutorBackend ClientEndPoint继承自ThreadSafeRpcEndpoint，其通过RPC机制完成和Master的通信。在ClientEndPoint的start方法中，会通过registerWithMaster方法向Master发送RegisterApplication请求，Master收到该请求消息之后，首先通过registerApplication方法完成信息登记，之后将会调用schedule方法，在Worker上启动Executor，Master对RegisterApplication请求处理源代码如下所示。

Master.scala源码：

1. caseRegisterApplication(description, driver) =>

2. // TODO Prevent repeatedregistrations from some driver

3. //Master处于STANDBY(备用)状态，不作处理

4. if (state ==RecoveryState.STANDBY) {

5. // ignore, don't sendresponse

6. } else {

7. logInfo("Registeringapp " + description.name)

8. //由description描述，构建ApplicationInfo

9. val app =createApplication(description, driver)

10. registerApplication(app)

11. logInfo("Registeredapp " + description.name + " with ID " + app.id)

12. //在持久化引擎中加入application

13. persistenceEngine.addApplication(app)

14. driver.send(RegisteredApplication(app.id,self))

15. //调用schedule方法，在worker节点上启动Executor

16. schedule()

17. }

在上面的代码中，Master匹配到RegisterApplication请求，先判断Master的状态是否为STANDBY(备用)状态，如果不是说明Master为ALIVE状态，在这种状态下，调用createApplication(description,sender)方法创建ApplicationInfo，完成之后调用persistenceEngine.addApplication(app)方法，将新创建的ApplicationInfo持久化，以便错误恢复。完成这两步操作之后，通过driver.send(RegisteredApplication(app.id,self))向StandaloneAppClient返回注册成功后ApplicationInfo的Id和master的url地址。

ApplicationInfo对象是对application的描述，先来看一下createApplication这个方法的源代码，源代码如下所示。

Master.scala源码：

1. private def createApplication(desc:ApplicationDescription, driver: RpcEndpointRef):

2. ApplicationInfo = {

3. //ApplicationInfo创建时间

4. val now =System.currentTimeMillis()

5. val date = new Date(now)

6. //由date生成application id

7. val appId =newApplicationId(date)

8. //创建ApplicationInfo

9. new ApplicationInfo(now,appId, desc, date, driver, defaultCores)

10. }

上面代码中，createApplication方法接收ApplicationDescription和ActorRef两种类型的参数。并调用newAppicationId方法生成appId，关键代码如下所示。

1. val appId = "app-%s-%04d".format(createDateFormat.format(submitDate), nextAppNumber)

由代码所决定，appid的格式形如：app-20160429101010-0001。在desc这个对象中，包含一些基本的配置，包括从系统中传入的一些配置信息，如appname、maxCores、memoryPerExecutorMB等。最后使用desc、date、driver、defaultCores等做为参数构造一个ApplicatinInfo对象并返回。函数返回之后，调用registerApplication方法，完成application的注册，该方法是如何完成注册的呢？方法代码如下所示。

Master.scala源码：

1. private def registerApplication(app:ApplicationInfo): Unit = {

2. //Driver的地址，用于Master和Driver通信

3. val appAddress =app.driver.address

4. //如果addressToApp中已经有了该Driver地址,说明该Driver已经注册过了，直接return

6. if(addressToApp.contains(appAddress)) {

7. logInfo("Attempted tore-register application at same address: " + appAddress)

8. return

9. }

10. //向度量系统注册

11. applicationMetricsSystem.registerSource(app.appSource)

12. //apps是一个HashSet，保存数据不能重复，向HashSet中加入app

13. apps += app

14. //idToApp是一个HashMap，该HashMap用于保存id和app的对应关系

15. idToApp(app.id) = app

16. //endpointToApp是一个HashMap， driver和app的对应关系

17. endpointToApp(app.driver) =app

18. //addressToApp是一个HashMap，记录app Driver的地址和app的对应关系

19. addressToApp(appAddress) = app

20. /waitingApps是一个数组，记录等待调度的app记录

21. waitingApps += app

22. if (reverseProxy) {

23. webUi.addProxyTargets(app.id,app.desc.appUiUrl)

24. }

25. }

上面代码中，首先通过app.driver.path.address得到driver的地址，然后查看appAdress映射表中是否已经存在这个路径，如果存在表示该application已经注册，直接返回；如果不存在，则在waitingApps数组中加入该application，同时在idToApp、endpointToApp、addressToApp映射表中加入映射关系。加入waitingApps数组中的application等待schedule方法的调度。

schedule方法有两个作用，第一，完成Driver的调度，将waitingDrivers数组中的Driver发送到满足运行条件的Worker上运行。第二，在满足条件的Worker节点上为application启动Executor。schedule方法源代码如下所示。

Master.scala的schedule方法源码：

1. private def schedule(): Unit = {

2. …….

3. launchDriver(worker, driver)

4. …….

5. startExecutorsOnWorkers()

6. }

在Master中，schedule方法是一个很重要的方法，每一次新的Driver的注册application的注册或者可用资源发生变动，都将调用schedule方法。Schedule方法用于为当前等待调度的application调度可用的资源，在满足条件的Worker节点上启动Executor。这个方法还有另外一个作用，就是当有Driver提交的时候，负责将Driver发送到一个可用资源满足Driver需求的Worker节点上运行，launchDriver(worker,driver)方法负责完成这一任务。

application调度成功之后，Master将会为appication在Worker节点上启动Executors，调用startExecutorsOnWorkers方法完成此操作，其源代码如下所示。

Master.scala源码：

1. privatedef startExecutorsOnWorkers(): Unit = {

2. // Right now this is a verysimple FIFO scheduler. We keep trying to fit in the first app

3. // in the queue, then thesecond app, etc.

4. for (app <- waitingApps ifapp.coresLeft > 0) {

5. val coresPerExecutor:Option[Int] = app.desc.coresPerExecutor

6. // Filter out workers thatdon't have enough resources to launch an executor

7. val usableWorkers =workers.toArray.filter(_.state == WorkerState.ALIVE)

8. .filter(worker =>worker.memoryFree >= app.desc.memoryPerExecutorMB &&

9. worker.coresFree >=coresPerExecutor.getOrElse(1))

10. .sortBy(_.coresFree).reverse

11. val assignedCores =scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

12.

13. // Now that we've decidedhow many cores to allocate on each worker, let's allocate them

14. for (pos <- 0 untilusableWorkers.length if assignedCores(pos) > 0) {

15. allocateWorkerResourceToExecutors(

16. app, assignedCores(pos), coresPerExecutor,usableWorkers(pos))

17. }

18. }

19. }

在scheduleExecutorsOnWorkers方法中，有两种启动Executor的策略，第一种是轮流均摊策略(round-robin)，采用圆桌算法依次轮流均摊，直到满足资源需求，轮流均摊策略通常会有更好的数据本地性，因此它是默认的选择策略。第二种是依次全占，在usableWorkers中，依次获取每个Worker上的全部资源，直到满足资源需求。

scheduleExecutorsOnWorkers方法为application分配好逻辑意义上的资源后，还不能真正在Worker 节点为application 分配出资源，需要调用动作函数为application真正的分配资源，allocateWorkerResourceToExecutors 方法的调用，将会在Worker节点上实际分配资源，下面是allocateWorkerResourceToExecutors的源代码。

Master.scala源码：

1. private def allocateWorkerResourceToExecutors(

2. ……

3. launchExecutor(worker, exec)

4. …….

上面代码调用了launchExecutor(worker,exec)方法，这个方法有两个参数，第一个参数是满足条件的WorkerInfo信息，第二个参数是描述Executor的ExecutorDesc对象。这个方法将会向Worker节点发送LaunchExecutor的请求，Worker节点收到该请求之后，将会负者启动Executor。launchExecutor方法代码清单如下所示。

Master.scala源码：

1. private def launchExecutor(worker: WorkerInfo,exec: ExecutorDesc): Unit = {

2. logInfo("Launchingexecutor " + exec.fullId + " on worker " + worker.id)

3. //向WorkerInfo中加入exec这个描述Executor的ExecutorDesc对象

4. worker.addExecutor(exec)

5. //向worker发送LaunchExecutor消息，加载Executor消息中携带了masterUrl地址、application id、Executor id、Executor描述desc、Executor核的个数、Executor分配的内存大小

7. worker.endpoint.send(LaunchExecutor(masterUrl,

8. exec.application.id,exec.id, exec.application.desc, exec.cores, exec.memory))

9. //向Driver发回ExecutorAdded消息，消息携带worker的id号,worker的host和port，分配的核的个数和内存大小

10. exec.application.driver.send(

11. ExecutorAdded(exec.id,worker.id, worker.hostPort, exec.cores, exec.memory))

12. }

launchExecutor有两个参数，第一个参数是worker：WorkerInfo，代表着Worker的基本信息，第二个参数是exec:ExecutorDesc，这个参数保存了Executor的基本配置信息，如memory、cores等。此方法中，有worker.endpoint.send(LaunchExecutor(...))，向Worker发送LaunchExecutor请求，Worker收到该请求之后将会调用方法启动Executor。

向Worker发送LaunchExecutor消息的同时，通过exec.application.driver.send(ExecutorAdded(…))向Driver发送ExecutorAdded消息，该消息为Driver反馈Master都在哪些Worker上启动了Executor，Executor的编号是多少，为每个Executor分配了多少个核，多大的内存以及Worker的联系hostport等消息。

Worker收到LaunchExecutor消息会做相应的处理，在Worker节点中，LaunchExecutor处理逻辑源代码如下所示。

Worker.scala源码：

1. caseLaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>

2. //若masterUrl和activeMasterUrl不是同一个url，说明非法的Master尝试加载Executor，打印错误信息

3. if (masterUrl != activeMasterUrl) {

4. logWarning("InvalidMaster (" + masterUrl + ") attempted to launch executor.")

5. } else {

6. try {

7. logInfo("Asked tolaunch executor %s/%d for %s".format(appId, execId, appDesc.name))

9. //在workDir/appId/目录下创建以execId为名的Executor工作目录

10. val executorDir = newFile(workDir, appId + "/" + execId)

11. //调用mkdirs创建目录

12. if(!executorDir.mkdirs()) {

13. throw newIOException("Failed to create directory " + executorDir)

14. }

15.

16. //为Executor创建本地目录，该目录通过变量SPARK_EXECUTOR_DIRS设置并传递，该目录在application运行结束时由Worker负责删除

17. val appLocalDirs =appDirectories.getOrElse(appId,

18. Utils.getOrCreateLocalRootDirs(conf).map{ dir =>

19. val appDir =Utils.createDirectory(dir, namePrefix = "executor")

20. Utils.chmod700(appDir)

21. appDir.getAbsolutePath()

22. }.toSeq)

23. //在哈希表appDirectories中加入appId和appLocalDirs的对应关系

24. appDirectories(appId) =appLocalDirs

25. //创建ExecutorRunner

26. val manager = newExecutorRunner(

27. appId,

28. execId,

29. appDesc.copy(command =Worker.maybeUpdateSSLSettings(appDesc.command, conf)),

30. cores_,

31. memory_,

32. self,

33. workerId,

34. host,

35. webUi.boundPort,

36. publicAddress,

37. sparkHome,

38. executorDir,

39. workerUri,

40. conf,

41. appLocalDirs,ExecutorState.RUNNING)

42. //在哈希表executors中加入appId+”/”+execId和ExecutorRunner的对应关系

43. executors(appId + "/" + execId)= manager

44. //启动ExecutorRunner

45. manager.start()

46. coresUsed += cores_

47. //Worker上已经使用的核增加cores_个，cores_分配给Executor的核的个数

48. memoryUsed += memory_

49. //向Master发送ExecutorStateChanged消息，该消息携带appId，exeId，ExecutorRunner的状态

50. sendToMaster(ExecutorStateChanged(appId,execId, manager.state, None, None))

51. } catch {

52. case e: Exception =>

53. logError(s"Failedto launch executor $appId/$execId for ${appDesc.name}.", e)

54. if (executors.contains(appId + "/"+ execId)) {

55. executors(appId +"/" + execId).kill()

56. executors -= appId +"/" + execId

57. }

58. sendToMaster(ExecutorStateChanged(appId,execId, ExecutorState.FAILED,

59. Some(e.toString), None))

60. }

61. }

上面代码中，首先判断传过来的masterUrl是否和activeMasterUrl相同，如果不相同，说明收到的不是处于ALIVE状态的Master发送过来的请求，这种情况直接打印警告信息。如果相同，则说明该请求来自ALIVE Master，于是为Executor创建工作目录，创建好工作目录之后，使用appid、execid、appDes等参数创建ExecutorRunner，顾名思义，ExecutorRunner是Executor运行的地方，在ExecutorRunner中，有一个工作线程，这个线程负责下载依赖的文件，并启动CoarseGaindExecutorBackend进程，该进程单独在一个JVM上运行。下面是ExecutorRunner中的线程启动的源代码。

ExecutorRunner.scala源码：

1. private[worker] def start() {

2. //创建线程

3. workerThread = newThread("ExecutorRunner for " + fullId) {

4. //线程run方法中调用fetchAndRunExcutor

5. override def run() {fetchAndRunExecutor() }

6. }

7. //启动线程

8. workerThread.start()

10. // 终止回调函数，用于杀死进程

11. shutdownHook =ShutdownHookManager.addShutdownHook { () =>

12. // It's possible that wearrive here before calling `fetchAndRunExecutor`, then `state` will

13. // be `ExecutorState.RUNNING`.In this case, we should set `state` to `FAILED`.

14. if (state ==ExecutorState.RUNNING) {

15. state =ExecutorState.FAILED

16. }

17. killProcess(Some("Worker shuttingdown")) }

18. }

上面代码中，定义了一个Thread，这个Thread的run方法中调用fetchAndRunExecutor方法，fetchAndRunExecutor负责以进程的方式启动ApplicationDescription中携带的org.apache.spark.executor.CoarseGrainedExecutorBackend进程。fetchAndRunExecutor方法源代码如下所示。

ExecutorRunner.scala源码：

1. private def fetchAndRunExecutor() {

2. try {

3. // Launch the process

4. val builder =CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),

5. memory,sparkHome.getAbsolutePath, substituteVariables)

6. val command =builder.command()

7. val formattedCommand =command.asScala.mkString("\"", "\" \"","\"")

8. logInfo(s"Launchcommand: $formattedCommand")

10. builder.directory(executorDir)

11. builder.environment.put("SPARK_EXECUTOR_DIRS",appLocalDirs.mkString(File.pathSeparator))

12. // In case we are runningthis from within the Spark Shell, avoid creating a "scala"

13. // parent process for theexecutor command

14. builder.environment.put("SPARK_LAUNCH_WITH_SCALA","0")

15.

16. // Add webUI log urls

17. val baseUrl =

18. if(conf.getBoolean("spark.ui.reverseProxy", false)) {

19. s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="

20. } else {

21. s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="

22. }

23. builder.environment.put("SPARK_LOG_URL_STDERR",s"${baseUrl}stderr")

24. builder.environment.put("SPARK_LOG_URL_STDOUT",s"${baseUrl}stdout")

25.

26. process = builder.start()

27. val header = "SparkExecutor Command: %s\n%s\n\n".format(

28. formattedCommand, "=" * 40)

29.

30. // Redirect its stdout andstderr to files

31. val stdout = newFile(executorDir, "stdout")

32. stdoutAppender =FileAppender(process.getInputStream, stdout, conf)

33.

34. val stderr = newFile(executorDir, "stderr")

35. Files.write(header, stderr,StandardCharsets.UTF_8)

36. stderrAppender =FileAppender(process.getErrorStream, stderr, conf)

37.

38. // Wait for it to exit;executor may exit with code 0 (when driver instructs it to shutdown)

39. // or with nonzero exit code

40. val exitCode =process.waitFor()

41. state = ExecutorState.EXITED

42. val message = "Commandexited with code " + exitCode

43. worker.send(ExecutorStateChanged(appId,execId, state, Some(message), Some(exitCode)))

44. } catch {

45. case interrupted:InterruptedException =>

46. logInfo("Runnerthread for executor " + fullId + " interrupted")

47. state =ExecutorState.KILLED

48. killProcess(None)

49. case e: Exception =>

50. logError("Errorrunning executor", e)

51. state =ExecutorState.FAILED

52. killProcess(Some(e.toString))

53. }

54. }

其中fetchAndRunExecutor()方法中的CommandUtils.buildProcessBuilder(appDesc.command,传入的入口类是："org.apache.spark.executor.CoarseGrainedExecutorBackend"，当Worker节点中启动ExecutorRunner时，ExecutorRunner中会启动CoarseGrainedExecutorBackend进程，在CoarseGrainedExecutorBackend的onStart方法中，向Driver发出RegisterExecutor注册请求。

CoarseGrainedExecutorBackend的onStart方法源码：

1. override def onStart() {

2. …….

3. driver = Some(ref)

4. //向driver发送ask请求，等待driver的回应

5. ref.ask[Boolean](RegisterExecutor(executorId,self, hostname, cores, extractLogUrls))

6. ……

Driver端收到注册请求，将会注册Executor的请求，

CoarseGrainedSchedulerBackend.scala的 receiveAndReply方法源码：

1. override def receiveAndReply(context:RpcCallContext): PartialFunction[Any, Unit] = {

3. caseRegisterExecutor(executorId, executorRef, hostname, cores, logUrls) =>

4. …….

5. executorRef.send(RegisteredExecutor)

6. ……

如上面代码所示，Driver向CoarseGrainedExecutorBackend发送RegisteredExecutor消息后，CoarseGrainedExecutorBackend收到RegisteredExecutor消息后将会新建一个Executor执行器，并为此Executor充当信使，与Driver通信。CoarseGrainedExecutorBackend收到RegisteredExecutor消息源代码如下所示。

CoarseGrainedExecutorBackend.scala的receive源码：

1. override def receive:PartialFunction[Any, Unit] = {

2. case RegisteredExecutor =>

3. logInfo("Successfullyregistered with driver")

4. try {

5. //收到RegisteredExecutor消息，立即创建Executor

6. executor = newExecutor(executorId, hostname, env, userClassPath, isLocal = false)

7. } catch {

8. case NonFatal(e) =>

9. exitExecutor(1,"Unable to create executor due to " + e.getMessage, e)

10. }

上面代码中可以看到，CoarseGrainedExecutorBackend收到RegisteredExecutor消息后，将会新创建一个org.apache.spark.executor.Executor对象，至此Executor创建完毕。

阅读全文

0 0