Driver到底是什么时候产生的

来源:互联网 发布:保定网络舆情日报 编辑:程序博客网 时间:2024/05/02 01:09

6.3   从Application提交的角度看重新审视Driver   

6.3.1 Driver到底是什么时候产生的 

在SparkContext实例化的时候通过createTaskScheduler来创建TaskSchedulerImpl和StandaloneSchedulerBackend。

SparkContext.scala源码:

1.       class SparkContext(config: SparkConf) extends Logging {

2.       ……..

3.       val (sched, ts) = SparkContext.createTaskScheduler(this, master,deployMode)

4.           _schedulerBackend = sched

5.           _taskScheduler = ts

6.        

7.           _dagScheduler = newDAGScheduler(this)

8.           _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)   

9.       ……

10.    private def createTaskScheduler(

11.       .......

12.          case SPARK_REGEX(sparkUrl)=>

13.            val scheduler = newTaskSchedulerImpl(sc)

14.            val masterUrls =sparkUrl.split(",").map("spark://" + _)

15.            val backend = new StandaloneSchedulerBackend(scheduler,sc, masterUrls)

16.            scheduler.initialize(backend)

17.            (backend, scheduler)

18.    ……

 

在createTaskScheduler中调用scheduler.initialize(backend),initialize的方法参数把StandaloneSchedulerBackend传进来.

TaskSchedulerImpl的initialize源码如下:

1.               definitialize(backend: SchedulerBackend) {

2.             this.backend = backend

3.             ……

 

initialize的方法把StandaloneSchedulerBackend传进来了,但还没有启动StandaloneSchedulerBackend。在TaskSchedulerImpl的initialize方法中把StandaloneSchedulerBackend传进来赋值为TaskSchedulerImpl的backend。

在TaskSchedulerImpl调用start方法的时候会调用backend.start方法,在start方法中会注册应用程序。

     SparkContext.scala的taskScheduler的启动:

1.           val (sched, ts) =SparkContext.createTaskScheduler(this, master, deployMode)

2.             _schedulerBackend = sched

3.             _taskScheduler = ts

4.             _dagScheduler = newDAGScheduler(this)

5.         ……

6.             _taskScheduler.start()

7.             _applicationId =_taskScheduler.applicationId()

8.             _applicationAttemptId =taskScheduler.applicationAttemptId()

9.             _conf.set("spark.app.id",_applicationId)

10.      ……

 

其中调用了_taskScheduler的start方法:

1.          private[spark] traitTaskScheduler {

2.         ......

3.          

4.           def start(): Unit

5.         …..

TaskScheduler的start()方法没具体实现,TaskScheduler子类的TaskSchedulerImpl的start()方法源码如下:

1.            override def start() {

2.             backend.start()

3.         ……

 

TaskSchedulerImpl的start()这里就通过 backend.start()启动了StandaloneSchedulerBackend的start方法:

1.           override def start() {

2.             super.start()

3.             launcherBackend.connect()

4.           ......

5.           val command =Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",

6.               args, sc.executorEnvs,classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)

7.           .......

8.             val appDesc = newApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,

9.               appUIAddress,sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)

10.          client = newStandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)

11.          client.start()

12.        ........

13.        }

 

StandaloneSchedulerBackend的start方法中,将command封装注册给Master,Master转过来要Worker启动具体的Executor。command已经封装好指令,Executor具体要启动进程入口类CoarseGrainedExecutorBackend。然后new出来一个StandaloneAppClient,通过client.start()启动client。

StandaloneAppClient的start方法中new出来一个ClientEndpoint:

1.            defstart() {

2.             // Just launch an rpcEndpoint;it will call back into the listener.

3.             endpoint.set(rpcEnv.setupEndpoint("AppClient",new ClientEndpoint(rpcEnv)))

4.           }

ClientEndpoint源码如下:

1.             private classClientEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint

2.             with Logging {

3.         ……

4.             override def onStart(): Unit ={

5.               try {

6.                 registerWithMaster(1)

7.               } catch {

8.                 case e: Exception =>

9.                   logWarning("Failedto connect to master", e)

10.                markDisconnected()

11.                stop()

12.            }

13.          }

 

ClientEndpoint是一个ThreadSafeRpcEndpoint, ClientEndpoint的onStart()方法中调用registerWithMaster(1)进行注册,向Master注册程序,registerWithMaster方法如下:

StandaloneAppClient.scala源码:

1.               private defregisterWithMaster(nthRetry: Int) {

2.               registerMasterFutures.set(tryRegisterAllMasters())

3.              ……

registerWithMaster中调用了tryRegisterAllMasters方法,在tryRegisterAllMasters方法中ClientEndpoint向Master发送RegisterApplication消息进行应用程序的注册。

StandaloneAppClient.scala源码:

1.              private deftryRegisterAllMasters(): Array[JFuture[_]] = {

2.             ......

3.                     masterRef.send(RegisterApplication(appDescription,self))

4.               ......

程序注册以后,Master通过 schedule()为我们分配资源,通知Worker启动Executor,Executor启动的进程是CoarseGrainedExecutorBackend,Executor启动以后又转过来向Driver注册,Driver其实是StandaloneSchedulerBackend的父类CoarseGrainedSchedulerBackend的一个消息循环体DriverEndpoint。

Master.scala的receive方法源码:

1.           override def receive:PartialFunction[Any, Unit] = {   

2.           caseRegisterApplication(description, driver) =>

3.             .......

4.                 registerApplication(app)

5.                 logInfo("Registeredapp " + description.name + " with ID " + app.id)

6.                 persistenceEngine.addApplication(app)

7.                 driver.send(RegisteredApplication(app.id,self))

8.                 schedule()

9.               }

 

在Master的receive方法中调用了schedule方法,Schedule方法在等待的应用程序中调度当前可用的资源。每次一个新的应用程序连接或资源发生可用性的变化,此方法将被调用。

Master.scala的schedule方法源码:

1.           private def schedule(): Unit = {

2.           .......

3.                 if (worker.memoryFree>= driver.desc.mem && worker.coresFree >= driver.desc.cores) {

4.                   launchDriver(worker,driver)

5.                   waitingDrivers -= driver

6.                   launched = true

7.                 }

8.                 curPos = (curPos + 1) %numWorkersAlive

9.               }

10.          }

11.          startExecutorsOnWorkers()

12.        }

 

 Master.scala在schedule方法调用launchDriver方法,launchDriver方法中给Worker发生launchDriver的消息,Master.scala的launchDriver源码如下:

1.          private def launchDriver(worker: WorkerInfo,driver: DriverInfo) {

2.           logInfo("Launching driver " +driver.id + " on worker " + worker.id)

3.           worker.addDriver(driver)

4.           driver.worker = Some(worker)

5.           worker.endpoint.send(LaunchDriver(driver.id,driver.desc))

6.           driver.state = DriverState.RUNNING

7.         }

 

launchDriver本身是一个case class ,包括driverId、driverDesc等信息。

1.           caseclass LaunchDriver(driverId: String, driverDesc: DriverDescription) extendsDeployMessage

 

DriverDescription包含了jarUrl、memory、cores、supervise、command等内容。       

1.        private[deploy] case class DriverDescription(

2.           jarUrl: String,

3.           mem: Int,

4.           cores: Int,

5.           supervise: Boolean,

6.           command: Command) {

7.        

8.         override def toString: String =s"DriverDescription (${command.mainClass})"

9.       }

 

Master.scala中launchDriver启动了Driver,接下来launchExecutor是启动Executor,Master.scala的launchExecutor源码如下:

1.         private def launchExecutor(worker: WorkerInfo,exec: ExecutorDesc): Unit = {

2.           logInfo("Launching executor " +exec.fullId + " on worker " + worker.id)

3.           worker.addExecutor(exec)

4.           worker.endpoint.send(LaunchExecutor(masterUrl,

5.             exec.application.id, exec.id,exec.application.desc, exec.cores, exec.memory))

6.           exec.application.driver.send(

7.             ExecutorAdded(exec.id, worker.id,worker.hostPort, exec.cores, exec.memory))

8.         }

Master 给我们的Worker发送一个消息LaunchDriver启动Drvier,然后是launchExecutor启动Executor,launchExecutor有自己的调度方式,资源调度之后,也是给我们的Worker发生了一个消息LaunchExecutor。

         Worker 就收到Master发送的LaunchDriver、LaunchExecutor消息。

下面是Worker原理内幕和流程机制:

图 5- 6 Worker原理内幕和流程机制

Master、Worker部署在不同的机器上,Master、Worker为进程存在。Master给我们的Worker发2种不同的指令,一种指令是LaunchDriver、一种指令是LaunchExecutor。

l  Worker收到Master的LaunchDriver的消息以后,new出来一个DriverRunner,然后启动driver.start()方法。

Worker.scala源码:

1.       case LaunchDriver(driverId,driverDesc) =>

2.          ......

3.       val driver = new DriverRunner(

4.       ......

5.        driver.start()

 

l  Worker收到Master的LaunchExecutor的消息以后,new出来一个ExecutorRunner,然后启动manager.start()方法。

Worker.scala源码:

1.        case LaunchExecutor(masterUrl, appId, execId,appDesc, cores_, memory_) =>

2.       ......

3.       val manager = newExecutorRunner(

4.       ......

5.       manager.start()

 

无论是WorKer的DriverRunner、ExecutorRunner在调用start方法时,在start内部都启动了一条线程,内部使用Thread来处理Driver、Executor的启动。以Worker收到LaunchDriver消息,new出DriverRunnerDriverRunner为例,DriverRunner.scala的start源码如下:

1.       /** Starts a thread to run andmanage the driver. */

2.         private[worker] def start() = {

3.           new Thread("DriverRunner for " +driverId) {

4.             override def run() {

5.               var shutdownHook: AnyRef = null

6.               try {

7.                 shutdownHook =ShutdownHookManager.addShutdownHook { () =>

8.                   logInfo(s"Worker shuttingdown, killing driver $driverId")

9.                   kill()

10.              }

11.     

12.              // prepare driver jars and run driver

13.              val exitCode = prepareAndRunDriver()

14.     

15.              // set final state depending on ifforcibly killed and process exit code

16.              finalState = if (exitCode == 0) {

17.                Some(DriverState.FINISHED)

18.              } else if (killed) {

19.                Some(DriverState.KILLED)

20.              } else {

21.                Some(DriverState.FAILED)

22.              }

23.            } catch {

24.              case e: Exception =>

25.                kill()

26.                finalState =Some(DriverState.ERROR)

27.                finalException = Some(e)

28.            } finally {

29.              if (shutdownHook != null) {

30.                ShutdownHookManager.removeShutdownHook(shutdownHook)

31.              }

32.            }

33.     

34.            // notify worker of final driver state,possible exception

35.            worker.send(DriverStateChanged(driverId,finalState.get, finalException))

36.          }

37.        }.start()

38.      }

 

DriverRunner.scala的start方法中调用prepareAndRunDriver方法,准备Driver的jar包和启动Driver,prepareAndRunDriver源码如下:

1.         private[worker] def prepareAndRunDriver():Int = {

2.           val driverDir = createWorkingDirectory()

3.           val localJarFilename =downloadUserJar(driverDir)

4.        

5.           def substituteVariables(argument: String):String = argument match {

6.             case "{{WORKER_URL}}" =>workerUrl

7.             case "{{USER_JAR}}" =>localJarFilename

8.             case other => other

9.           }

10.     

11.        // TODO: If we add ability to submitmultiple jars they should also be added here

12.        val builder =CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,

13.          driverDesc.mem,sparkHome.getAbsolutePath, substituteVariables)

14.     

15.        runDriver(builder, driverDir,driverDesc.supervise)

16.      }

 

LaunchDriver的启动过程:

l  Worker进程:WorKer的DriverRunner调用start方法,内部使用Thread来处理Driver启动。DriverRunner创建Driver在本地系统的工作目录(即Linux的文件目录),每次工作都有自己的目录,封装好Driver的启动Command,通过ProcessBuilder来启动Driver。这些内容都属于Worker进程。

l  Driver进程:启动的Driver是属于Driver进程。

LaunchExecutor的启动过程:

l  Worker进程:WorKer的ExecutorRunner调用start方法,内部使用Thread来处理Executor启动。ExecutorRunner创建Executor在本地系统的工作目录(即Linux的文件目录),每次工作都有自己的目录,封装好Executor的启动Command,通过ProcessBuilder来启动Executor。这些内容都属于Worker进程。

l  Executor进程:启动的Executor是属于Executor进程。Executor在ExecutorBackend里面,ExecutorBackend在Spark standalone模式中是CoarseGrainedExecutorBackend,CoarseGrainedExecutorBackend继承至ExecutorBackend。Executor和ExecutorBackend是一对一的关系,一个ExecutorBackend有一个Executor,在Executor内部是线程池并发处理的方式来处理Spark提交过来的Task。

l  Executor启动之后要向Driver注册,注册给SchedulerBackend。

 

CoarseGrainedExecutorBackend的源码,CoarseGrainedExecutorBackend有我们的Executor本身:

1.          private[spark] class CoarseGrainedExecutorBackend(

2.           override val rpcEnv: RpcEnv,

3.           driverUrl: String,

4.           executorId: String,

5.           hostname: String,

6.           cores: Int,

7.           userClassPath: Seq[URL],

8.           env: SparkEnv)

9.         extends ThreadSafeRpcEndpoint withExecutorBackend with Logging {

10.     

11.      private[this] val stopping = newAtomicBoolean(false)

12.      var executor: Executor = null

13.      @volatile var driver: Option[RpcEndpointRef]= None

14.    ……

 

我们再次看一下Master的schedule()方法:

1.           private def schedule(): Unit = {

2.            ……

3.               if (worker.memoryFree >=driver.desc.mem && worker.coresFree >= driver.desc.cores) {

4.                 launchDriver(worker, driver)

5.                 waitingDrivers -= driver

6.                 launched = true

7.               }

8.               curPos = (curPos + 1) % numWorkersAlive

9.             }

10.        }

11.        startExecutorsOnWorkers()

12.      }

 

Master的schedule()方法中,如果Driver运行在集群中,通过launchDriver来启动Driver。launchDriver发送一个消息交给worker的endpoint,这个是RPC的通信机制。

1.          private def launchDriver(worker: WorkerInfo,driver: DriverInfo) {

2.           logInfo("Launching driver " +driver.id + " on worker " + worker.id)

3.           worker.addDriver(driver)

4.           driver.worker = Some(worker)

5.           worker.endpoint.send(LaunchDriver(driver.id,driver.desc))

6.           driver.state = DriverState.RUNNING

7.         }

 

Master的schedule()方法中启动Executor的部分,通过startExecutorsOnWorkers()启动,startExecutorsOnWorkers也是通过RPC的通信方式:

1.       private defstartExecutorsOnWorkers(): Unit = {

2.           // Right now this is a very simple FIFOscheduler. We keep trying to fit in the first app

3.           // in the queue, then the second app, etc.

4.           for (app <- waitingApps if app.coresLeft> 0) {

5.             val coresPerExecutor: Option[Int] =app.desc.coresPerExecutor

6.             // Filter out workers that don't haveenough resources to launch an executor

7.             val usableWorkers =workers.toArray.filter(_.state == WorkerState.ALIVE)

8.               .filter(worker => worker.memoryFree>= app.desc.memoryPerExecutorMB &&

9.                 worker.coresFree >=coresPerExecutor.getOrElse(1))

10.            .sortBy(_.coresFree).reverse

11.          val assignedCores =scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)

12.     

13.          // Now that we've decided how many coresto allocate on each worker, let's allocate them

14.          for (pos <- 0 untilusableWorkers.length if assignedCores(pos) > 0) {

15.            allocateWorkerResourceToExecutors(

16.              app, assignedCores(pos),coresPerExecutor, usableWorkers(pos))

17.          }

18.        }

19.      } 

 

Master.scala的方法中调用 allocateWorkerResourceToExecutors方法进行正式分配:

1.          private defallocateWorkerResourceToExecutors(

2.             app: ApplicationInfo,

3.             assignedCores: Int,

4.             coresPerExecutor: Option[Int],

5.             worker: WorkerInfo): Unit = {

6.           // If the number of cores per executor isspecified, we divide the cores assigned

7.           // to this worker evenly among theexecutors with no remainder.

8.           // Otherwise, we launch a single executorthat grabs all the assignedCores on this worker.

9.           val numExecutors = coresPerExecutor.map {assignedCores / _ }.getOrElse(1)

10.        val coresToAssign =coresPerExecutor.getOrElse(assignedCores)

11.        for (i <- 1 to numExecutors) {

12.          val exec = app.addExecutor(worker,coresToAssign)

13.          launchExecutor(worker, exec)

14.          app.state = ApplicationState.RUNNING

15.        }

16.      }

 

allocateWorkerResourceToExecutors正式分配的时候就通过launchExecutor方法启动Executor

1.        private def launchExecutor(worker: WorkerInfo,exec: ExecutorDesc): Unit = {

2.           logInfo("Launching executor " +exec.fullId + " on worker " + worker.id)

3.           worker.addExecutor(exec)

4.           worker.endpoint.send(LaunchExecutor(masterUrl,

5.             exec.application.id, exec.id,exec.application.desc, exec.cores, exec.memory))

6.           exec.application.driver.send(

7.             ExecutorAdded(exec.id, worker.id,worker.hostPort, exec.cores, exec.memory))

8.         }

 

Master发送消息给Worker,发送2个消息:一个是LaunchDriver、一个是LaunchExecutor。Worker收到Master的LaunchDriver、 LaunchExecutor消息。我们看一下Worker:

1.          private[deploy] class Worker(

2.           override val rpcEnv: RpcEnv,

3.           webUiPort: Int,

4.           cores: Int,

5.           memory: Int,

6.           masterRpcAddresses: Array[RpcAddress],

7.           endpointName: String,

8.           workDirPath: String = null,

9.           val conf: SparkConf,

10.        val securityMgr: SecurityManager)

11.      extends ThreadSafeRpcEndpoint with Logging {

 

Worker实现RPC通信,继承至ThreadSafeRpcEndpoint,ThreadSafeRpcEndpoint是一个trait,其它的RPC对象可以给它发消息:

1.           private[spark] trait ThreadSafeRpcEndpointextends RpcEndpoint

 

Worker在receive方法中收消息。就像一个邮箱,不断的循环邮箱接收邮件,我们可以把消息看成是邮件。

1.          override def receive: PartialFunction[Any,Unit] = synchronized {

2.           case SendHeartbeat =>

3.           ……

4.           case WorkDirCleanup =>

5.           ......

6.           case MasterChanged(masterRef,masterWebUiUrl) =>

7.           ......

8.           case ReconnectWorker(masterUrl) =>

9.            …….

10.         caseLaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>

11.         ......

12.       case executorStateChanged @ExecutorStateChanged(appId, execId, state, message, exitStatus)  

13.               ......

14.         case KillExecutor(masterUrl,appId, execId) =>

15.          ......

16.        case LaunchDriver(driverId, driverDesc)=>

17.          ……

 

Worker.scala的receive方法LaunchDriver启动Driver的源码如下:

1.        case LaunchDriver(driverId, driverDesc) =>

2.             logInfo(s"Asked to launch driver$driverId")

3.             val driver = new DriverRunner(

4.               conf,

5.               driverId,

6.               workDir,

7.               sparkHome,

8.               driverDesc.copy(command =Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),

9.               self,

10.            workerUri,

11.            securityMgr)

12.          drivers(driverId) = driver

13.          driver.start()

14.     

15.          coresUsed += driverDesc.cores

16.          memoryUsed += driverDesc.mem

 

LaunchDriver方法首先打印日志,传进来的时候肯定会告诉driverId。在启动Driver或者Executor的时候,Driver或者Executor所在的进程一定是满足内存级别的要求,但不一定满足Cores的要求,实际的Cores可能比期待的Cores多,也有可能少。

logInfo方法打印日志使用了封装,

1.           protected def logInfo(msg: => String) {

2.           if (log.isInfoEnabled) log.info(msg)

3.         }

 

回到LaunchDriver方法,其中new出一个DriverRunner,DriverRunner包括driverId、工作目录(workDir)、spark的路径(sparkHome)、driverDesc、workerUri、securityMgr等内容。在drivers(driverId) = driver代码中,将driver交给一个数据结构drivers,drivers是一个HashMap,是Key-Value的方式,其中Key是Driver的ID,Value是DriverRunner。Worker下可能启动很多Executor,需根据具体的ID管理DriverRunner。DriverRunner内部通过线程的方式启动另外一个进程Driver,DriverRunner是Driver所在进程的代理。

1.            valdrivers = new HashMap[String, DriverRunner]

 

回到Worker.scala的LaunchDriver,Worker在启动driver之前,将相关的DriverRunner数据保存到Worker的内存数据结构中,然后进行driver.start()。start之后,将消耗的cores、memory增加到coresUsed 、memoryUsed 。

接下来我们进入DriverRunner.scala源代码,DriverRunner管理 Driver的执行,包括在 Driver失败的时候自动重启。如Driver运行在集群模式中,加入supervise关键字可以自动重启:

1.            private[deploy] class DriverRunner(

2.           conf: SparkConf,

3.           val driverId: String,

4.           val workDir: File,

5.           val sparkHome: File,

6.           val driverDesc: DriverDescription,

7.           val worker: RpcEndpointRef,

8.           val workerUrl: String,

9.           val securityManager: SecurityManager)

10.      extends Logging {

 

其中的DriverDescription源码如下,其中包括DriverDescription 的成员supervise,supervise是一个布尔值,如果设置为true,在集群模式中Driver运行失败的时候,Worker会负责重新启动Driver:

1.          private[deploy] case classDriverDescription(

2.           jarUrl: String,

3.           mem: Int,

4.           cores: Int,

5.           supervise: Boolean,

6.           command: Command) {

7.        

8.         override def toString: String =s"DriverDescription (${command.mainClass})"

9.       }

 

回到Worker.scala的LaunchDriver,DriverRunner构造出来以后,调用其start方法,通过一个线程管理Driver,包括启动Driver及关闭Driver。其中Thread("DriverRunner for " + driverId),DriverRunner for  driverId是线程的名字,Thread是Java的代码,scala可以无缝的连接Java。

Start源码如下:

1.        private[worker] def start() = {

2.           new Thread("DriverRunner for " +driverId) {

3.             override def run() {

4.               var shutdownHook: AnyRef = null

5.               try {

6.                 shutdownHook =ShutdownHookManager.addShutdownHook { () =>

7.                   logInfo(s"Worker shuttingdown, killing driver $driverId")

8.                   kill()

9.                 }

10.      

11.               // prepare driverjars and run driver

12.               val exitCode =prepareAndRunDriver()

13.      

14.               // set final statedepending on if forcibly killed and process exit code

15.               finalState = if (exitCode == 0) {

16.                 Some(DriverState.FINISHED)

17.               } else if (killed) {

18.                 Some(DriverState.KILLED)

19.               } else {

20.                 Some(DriverState.FAILED)

21.               }

22.             } catch {

23.               case e: Exception=>

24.                 kill()

25.                 finalState =Some(DriverState.ERROR)

26.                 finalException =Some(e)

27.             } finally {

28.               if (shutdownHook !=null) {

29.                 ShutdownHookManager.removeShutdownHook(shutdownHook)

30.               }

31.             }

32.      

33.             // notify worker offinal driver state, possible exception

34.             worker.send(DriverStateChanged(driverId,finalState.get, finalException))

35.           }

36.         }.start()

37.      }

 

DriverRunner的start方法调用prepareAndRunDriver来实现driver jar包的准备及启动driver。prepareAndRunDriver源码如下:

1.           private[worker] def prepareAndRunDriver():Int = {

2.           val driverDir = createWorkingDirectory()

3.           val localJarFilename =downloadUserJar(driverDir)

4.        

5.           def substituteVariables(argument: String):String = argument match {

6.             case "{{WORKER_URL}}" =>workerUrl

7.             case "{{USER_JAR}}" =>localJarFilename

8.             case other => other

9.           }

10.     

11.        // TODO: If we add ability to submitmultiple jars they should also be added here

12.        val builder =CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,

13.          driverDesc.mem,sparkHome.getAbsolutePath, substituteVariables)

14.     

15.        runDriver(builder, driverDir,driverDesc.supervise)

16.      }

 

prepareAndRunDriver方法中调用了createWorkingDirectory方法创建目录。通过Java的 new File创建了Driver的工作目录,如果目录不存在而且创建不成功,就提示失败。在本地文件系统创建一个目录一般不会失败,除非磁盘满。createWorkingDirectory源码如下:

1.        private def createWorkingDirectory(): File = {

2.           val driverDir = new File(workDir, driverId)

3.           if (!driverDir.exists() &&!driverDir.mkdirs()) {

4.             throw new IOException("Failed tocreate directory " + driverDir)

5.           }

6.           driverDir

7.         }

 

回到DriverRunner.scala的prepareAndRunDriver方法,其中downloadUserJar方法下载jar包。我们自己写的代码是一个jar包,这里下载用户的jar包到本地。jar包在Hdfs中,我们从Hdfs中获取jar包下载到本地。

downloadUserJar方法源码如下:

1.             private def downloadUserJar(driverDir:File): String = {

2.           val jarFileName = newURI(driverDesc.jarUrl).getPath.split("/").last

3.           val localJarFile = new File(driverDir,jarFileName)

4.           if (!localJarFile.exists()) { // Mayalready exist if running multiple workers on one node

5.             logInfo(s"Copying user jar${driverDesc.jarUrl} to $localJarFile")

6.             Utils.fetchFile(

7.               driverDesc.jarUrl,

8.               driverDir,

9.               conf,

10.            securityManager,

11.            SparkHadoopUtil.get.newConfiguration(conf),

12.            System.currentTimeMillis(),

13.            useCache = false)

14.          if (!localJarFile.exists()) { // Verifycopy succeeded

15.            throw new IOException(

16.              s"Can not find expected jar$jarFileName which should have been loaded in $driverDir")

17.          }

18.        }

19.        localJarFile.getAbsolutePath

20.      }

 

downloadUserJar方法调用了fetchFile,fetchFile借助Hadoop,从Hdfs中下载文件。我们在提交文件的时候,将jar包上传到Hdfs上,提交一份大家都可以从Hdfs中下载。Utile. fetchFile方法源码如下:

1.        def fetchFile(

2.             url: String,

3.             targetDir: File,

4.             conf: SparkConf,

5.             securityMgr: SecurityManager,

6.             hadoopConf: Configuration,

7.             timestamp: Long,

8.             useCache: Boolean) {

9.           val fileName = decodeFileNameInURI(newURI(url))

10.        val targetFile = new File(targetDir,fileName)

11.        val fetchCacheEnabled =conf.getBoolean("spark.files.useFetchCache", defaultValue = true)

12.        if (useCache && fetchCacheEnabled){

13.          val cachedFileName =s"${url.hashCode}${timestamp}_cache"

14.          val lockFileName =s"${url.hashCode}${timestamp}_lock"

15.          val localDir = newFile(getLocalDir(conf))

16.          val lockFile = new File(localDir,lockFileName)

17.          val lockFileChannel = newRandomAccessFile(lockFile, "rw").getChannel()

18.          // Only one executor entry.

19.          // The FileLock is only used to controlsynchronization for executors download file,

20.          // it's always safe regardless of locktype (mandatory or advisory).

21.          val lock = lockFileChannel.lock()

22.          val cachedFile = new File(localDir,cachedFileName)

23.          try {

24.            if (!cachedFile.exists()) {

25.              doFetchFile(url, localDir,cachedFileName, conf, securityMgr, hadoopConf)

26.            }

27.          } finally {

28.            lock.release()

29.            lockFileChannel.close()

30.          }

31.          copyFile(

32.            url,

33.            cachedFile,

34.            targetFile,

35.            conf.getBoolean("spark.files.overwrite",false)

36.          )

37.        } else {

38.          doFetchFile(url, targetDir, fileName,conf, securityMgr, hadoopConf)

39.        }

 

回到DriverRunner.scala的prepareAndRunDriver方法,driverDesc.command表明运行什么类,构建进程运行类的入口,然后是runDriver启动Driver。

1.           private[worker] def prepareAndRunDriver():Int = {

2.          .......

3.           val builder =CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,

4.             driverDesc.mem,sparkHome.getAbsolutePath, substituteVariables)

5.        

6.           runDriver(builder, driverDir,driverDesc.supervise)

7.         }

 

DriverRunner.scala的runDriver方法如下,runDriver中重定向输出文件和err文件,可以通过log日志文件查看执行的情况。最后是调用runCommandWithRetry方法:

1.        private def runDriver(builder: ProcessBuilder,baseDir: File, supervise: Boolean): Int = {

2.           builder.directory(baseDir)

3.           def initialize(process: Process): Unit = {

4.             // Redirect stdout and stderr to files

5.             val stdout = new File(baseDir,"stdout")

6.             CommandUtils.redirectStream(process.getInputStream,stdout)

7.        

8.             val stderr = new File(baseDir,"stderr")

9.             val formattedCommand =builder.command.asScala.mkString("\"", "\"\"", "\"")

10.          val header = "Launch Command:%s\n%s\n\n".format(formattedCommand, "=" * 40)

11.          Files.append(header, stderr,StandardCharsets.UTF_8)

12.          CommandUtils.redirectStream(process.getErrorStream,stderr)

13.        }

14.        runCommandWithRetry(ProcessBuilderLike(builder),initialize, supervise)

15.      }

 

runCommandWithRetry中传入的参数是ProcessBuilderLike(builder),这里new出来一个ProcessBuilderLike ,在重载方法start()执行processBuilder.start()。ProcessBuilderLike源码如下:

1.       private[deploy] objectProcessBuilderLike {

2.         def apply(processBuilder: ProcessBuilder):ProcessBuilderLike = new ProcessBuilderLike {

3.           override def start(): Process =processBuilder.start()

4.           override def command: Seq[String] =processBuilder.command().asScala

5.         }

6.       }

 

我们看一下runCommandWithRetry的源码:

1.        private[worker] def runCommandWithRetry(

2.             command: ProcessBuilderLike, initialize:Process => Unit, supervise: Boolean): Int = {

3.           var exitCode = -1

4.           // Time to wait between submission retries.

5.           var waitSeconds = 1

6.           // A run of this many seconds resets theexponential back-off.

7.           val successfulRunDuration = 5

8.           var keepTrying = !killed

9.        

10.        while (keepTrying) {

11.          logInfo("Launch Command: " +command.command.mkString("\"", "\" \"","\""))

12.     

13.          synchronized {

14.            if (killed) { return exitCode }

15.            process = Some(command.start())

16.            initialize(process.get)

17.          }

18.     

19.          val processStart = clock.getTimeMillis()

20.          exitCode = process.get.waitFor()

21.     

22.          // check if attempting another run

23.          keepTrying = supervise &&exitCode != 0 && !killed

24.          if (keepTrying) {

25.            if (clock.getTimeMillis() - processStart> successfulRunDuration * 1000) {

26.              waitSeconds = 1

27.            }

28.            logInfo(s"Command exited withstatus $exitCode, re-launching after $waitSeconds s.")

29.            sleeper.sleep(waitSeconds)

30.            waitSeconds = waitSeconds * 2 //exponential back-off

31.          }

32.        }

33.     

34.        exitCode

35.      }

36.    }

 

runCommandWithRetry第一次不一定能申请成功,因此循环遍历重试。DriverRunner启动进程是通过ProcessBuilder中的process.get.waitFor来完成;如果supervise设置为True,exitCode不等于0 以及不是被killed,我们将keepTrying设置为True,继续循环重试启动进程。

         回到DriverRunner.scala的LaunchDriver方法:

1.          caseLaunchDriver(driverId, driverDesc) =>

2.           ......

3.             drivers(driverId) = driver

4.             driver.start()

 

driver.start()启动Driver,进入start的源码:

1.         private[worker] def start() = {

2.           new Thread("DriverRunner for " +driverId) {

3.             override def run() {

4.             ......

5.               } catch {

6.                 case e: Exception =>

7.                   kill()

8.                   finalState =Some(DriverState.ERROR)

9.                   finalException = Some(e)

10.            } finally {

11.              if (shutdownHook != null) {

12.                ShutdownHookManager.removeShutdownHook(shutdownHook)

13.              }

14.            }

15.     

16.            // notify worker of final driver state,possible exception

17.            worker.send(DriverStateChanged(driverId,finalState.get, finalException))

18.          }

19.        }.start()

20.      }

 

Start启动时运行到了 finalState ,可能是Spark运行出状况了,如Driver运行时KILLED、或者FAILED,出状况以后,通过 worker.send给自己发一个消息,通知DriverStateChanged状态改变。我们在Worker.scala看一下driverStateChanged的源码:

1.         case driverStateChanged @DriverStateChanged(driverId, state, exception) =>

2.             handleDriverStateChanged(driverStateChanged)

 

在其中调用handleDriverStateChanged方法,handleDriverStateChanged源码如下:

1.         private[worker] defhandleDriverStateChanged(driverStateChanged: DriverStateChanged): Unit = {

2.           val driverId = driverStateChanged.driverId

3.           val exception = driverStateChanged.exception

4.           val state = driverStateChanged.state

5.           state match {

6.             case DriverState.ERROR =>

7.               logWarning(s"Driver $driverIdfailed with unrecoverable exception: ${exception.get}")

8.             case DriverState.FAILED =>

9.               logWarning(s"Driver $driverIdexited with failure")

10.          case DriverState.FINISHED =>

11.            logInfo(s"Driver $driverId exitedsuccessfully")

12.          case DriverState.KILLED =>

13.            logInfo(s"Driver $driverId waskilled by user")

14.          case _ =>

15.            logDebug(s"Driver $driverIdchanged state to $state")

16.        }

17.        sendToMaster(driverStateChanged)

18.        val driver = drivers.remove(driverId).get

19.        finishedDrivers(driverId) = driver

20.        trimFinishedDriversIfNecessary()

21.        memoryUsed -= driver.driverDesc.mem

22.        coresUsed -= driver.driverDesc.cores

23.      }

 

Worker.scala的handleDriverStateChanged方法中对于state的不同情况,打印相关日志。关键的代码是sendToMaster(driverStateChanged),发一个消息给Master告知Driver进程挂掉。消息内容是driverStateChanged。sendToMaster源码如下:

1.         private def sendToMaster(message: Any): Unit= {

2.           master match {

3.             case Some(masterRef) =>masterRef.send(message)

4.             case None =>

5.               logWarning(

6.                 s"Dropping $message because theconnection to master has not yet been established")

7.           }

8.         }

 

     我们看一下Master的源码,Master收到DriverStateChanged消息以后,无论Driver的状态是 DriverState.ERROR |DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED ,都把Driver从内存数据结构中删掉,并把持久化引擎中的数据清理掉。

1.             case DriverStateChanged(driverId, state,exception) =>

2.             state match {

3.               case DriverState.ERROR |DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>

4.                 removeDriver(driverId, state,exception)

5.               case _ =>

6.                 throw new Exception(s"Receivedunexpected state update for driver $driverId: $state")

7.             }

 

进入removeDriver源码,清理掉相关数据以后,再次调用 schedule():

1.        private def removeDriver(

2.             driverId: String,

3.             finalState: DriverState,

4.             exception: Option[Exception]) {

5.           drivers.find(d => d.id == driverId)match {

6.             case Some(driver) =>

7.               logInfo(s"Removing driver:$driverId")

8.               drivers -= driver

9.               if (completedDrivers.size >=RETAINED_DRIVERS) {

10.              val toRemove =math.max(RETAINED_DRIVERS / 10, 1)

11.              completedDrivers.trimStart(toRemove)

12.            }

13.            completedDrivers += driver

14.            persistenceEngine.removeDriver(driver)

15.            driver.state = finalState

16.            driver.exception = exception

17.            driver.worker.foreach(w =>w.removeDriver(driver))

18.            schedule()

19.          case None =>

20.            logWarning(s"Asked to removeunknown driver: $driverId")

21.        }

22.      }

23.    }

 

接下来我们看一下启动Executor。Worker.scala的LaunchExecutor方法源码如下:

1.        case LaunchExecutor(masterUrl, appId, execId,appDesc, cores_, memory_) =>

2.             if (masterUrl != activeMasterUrl) {

3.               logWarning("Invalid Master ("+ masterUrl + ") attempted to launch executor.")

4.             } else {

5.               try {

6.                 logInfo("Asked to launchexecutor %s/%d for %s".format(appId, execId, appDesc.name))

7.        

8.                 // Create the executor's workingdirectory

9.                 val executorDir = new File(workDir,appId + "/" + execId)

10.              if (!executorDir.mkdirs()) {

11.                throw new IOException("Failedto create directory " + executorDir)

12.              }

13.     

14.              // Create local dirs for theexecutor. These are passed to the executor via the

15.              // SPARK_EXECUTOR_DIRS environmentvariable, and deleted by the Worker when the

16.              // application finishes.

17.              val appLocalDirs =appDirectories.getOrElse(appId,

18.                Utils.getOrCreateLocalRootDirs(conf).map{ dir =>

19.                  val appDir =Utils.createDirectory(dir, namePrefix = "executor")

20.                  Utils.chmod700(appDir)

21.                  appDir.getAbsolutePath()

22.                }.toSeq)

23.              appDirectories(appId) = appLocalDirs

24.              val manager = new ExecutorRunner(

25.                appId,

26.                execId,

27.                appDesc.copy(command =Worker.maybeUpdateSSLSettings(appDesc.command, conf)),

28.                cores_,

29.                memory_,

30.                self,

31.                workerId,

32.                host,

33.                webUi.boundPort,

34.                publicAddress,

35.                sparkHome,

36.                executorDir,

37.                workerUri,

38.                conf,

39.                appLocalDirs,ExecutorState.RUNNING)

40.              executors(appId + "/" +execId) = manager

41.              manager.start()

42.              coresUsed += cores_

43.              memoryUsed += memory_

44.              sendToMaster(ExecutorStateChanged(appId,execId, manager.state, None, None))

45.            } catch {

46.              case e: Exception =>

47.                logError(s"Failed to launchexecutor $appId/$execId for ${appDesc.name}.", e)

48.                if (executors.contains(appId +"/" + execId)) {

49.                  executors(appId + "/" +execId).kill()

50.                  executors -= appId +"/" + execId

51.                }

52.                sendToMaster(ExecutorStateChanged(appId,execId, ExecutorState.FAILED,

53.                  Some(e.toString), None))

54.            }

55.          }

 

直接看一下manager.start()方法,启动一个线程Thread,在run方法中调用fetchAndRunExecutor:

1.            private[worker] def start() {

2.           workerThread = newThread("ExecutorRunner for " + fullId) {

3.             override def run() {fetchAndRunExecutor() }

4.           }

5.           workerThread.start()

6.           // Shutdown hook that kills actors onshutdown.

7.           shutdownHook =ShutdownHookManager.addShutdownHook { () =>

8.             // It's possible that we arrive herebefore calling `fetchAndRunExecutor`, then `state` will

9.             // be `ExecutorState.RUNNING`. In thiscase, we should set `state` to `FAILED`.

10.          if (state == ExecutorState.RUNNING) {

11.            state = ExecutorState.FAILED

12.          }

13.          killProcess(Some("Worker shuttingdown")) }

14.      }

 

其中fetchAndRunExecutor的源码如下:

1.            private def fetchAndRunExecutor() {

2.           try {

3.             // Launch the process

4.             val builder =CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),

5.               memory, sparkHome.getAbsolutePath,substituteVariables)

6.             val command = builder.command()

7.             val formattedCommand =command.asScala.mkString("\"", "\" \"","\"")

8.             logInfo(s"Launch command:$formattedCommand")

9.        

10.           builder.directory(executorDir)

11.           builder.environment.put("SPARK_EXECUTOR_DIRS",appLocalDirs.mkString(File.pathSeparator))

12.           // In case we arerunning this from within the Spark Shell, avoid creating a "scala"

13.           // parent process forthe executor command

14.           builder.environment.put("SPARK_LAUNCH_WITH_SCALA","0")

15.      

16.           // Add webUI log urls

17.           val baseUrl =

18.             if(conf.getBoolean("spark.ui.reverseProxy", false)) {

19.               s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="

20.             } else {

21.               s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="

22.             }

23.           builder.environment.put("SPARK_LOG_URL_STDERR",s"${baseUrl}stderr")

24.           builder.environment.put("SPARK_LOG_URL_STDOUT",s"${baseUrl}stdout")

25.      

26.           process =builder.start()

27.           val header = "SparkExecutor Command: %s\n%s\n\n".format(

28.             formattedCommand,"=" * 40)

29.      

30.           // Redirect its stdoutand stderr to files

31.           val stdout = newFile(executorDir, "stdout")

32.           stdoutAppender =FileAppender(process.getInputStream, stdout, conf)

33.      

34.           val stderr = newFile(executorDir, "stderr")

35.           Files.write(header,stderr, StandardCharsets.UTF_8)

36.           stderrAppender =FileAppender(process.getErrorStream, stderr, conf)

37.      

38.           // Wait for it to exit;executor may exit with code 0 (when driver instructs it to shutdown)

39.           // or with nonzero exitcode

40.           val exitCode =process.waitFor()

41.           state =ExecutorState.EXITED

42.           val message ="Command exited with code " + exitCode

43.           worker.send(ExecutorStateChanged(appId,execId, state, Some(message), Some(exitCode)))

44.         } catch {

45.           case interrupted:InterruptedException =>

46.             logInfo("Runnerthread for executor " + fullId + " interrupted")

47.             state =ExecutorState.KILLED

48.             killProcess(None)

49.           case e: Exception =>

50.             logError("Errorrunning executor", e)

51.             state =ExecutorState.FAILED

52.             killProcess(Some(e.toString))

53.         }

54.       }

55.     }

 

fetchAndRunExecutor类似于启动Driver的过程,在启动Executor时候首先构建CommandUtils.buildProcessBuilder,然后是builder.start(),退出是发送ExecutorStateChanged消息给我们的Worker。

在Worker.scala源码中executorStateChanged:

1.          case executorStateChanged @ExecutorStateChanged(appId, execId, state, message, exitStatus) =>

2.             handleExecutorStateChanged(executorStateChanged)

 

进入handleExecutorStateChanged源码,sendToMaster(executorStateChanged)发给executorStateChanged消息给Master :

1.        private[worker] defhandleExecutorStateChanged(executorStateChanged: ExecutorStateChanged):

2.           Unit = {

3.           sendToMaster(executorStateChanged)

4.           val state = executorStateChanged.state

5.           if (ExecutorState.isFinished(state)) {

6.             val appId = executorStateChanged.appId

7.             val fullId = appId + "/" +executorStateChanged.execId

8.             val message =executorStateChanged.message

9.             val exitStatus =executorStateChanged.exitStatus

10.          executors.get(fullId) match {

11.            case Some(executor) =>

12.              logInfo("Executor " +fullId + " finished with state " + state +

13.                message.map(" message " +_).getOrElse("") +

14.                exitStatus.map(" exitStatus" + _).getOrElse(""))

15.              executors -= fullId

16.              finishedExecutors(fullId) = executor

17.              trimFinishedExecutorsIfNecessary()

18.              coresUsed -= executor.cores

19.              memoryUsed -= executor.memory

20.            case None =>

21.              logInfo("Unknown Executor "+ fullId + " finished with state " + state +

22.                message.map(" message " +_).getOrElse("") +

23.                exitStatus.map(" exitStatus" + _).getOrElse(""))

24.          }

25.          maybeCleanupApplication(appId)

26.        }

27.      }

28.    }

 

我们看一下Master.scala,Master收到ExecutorStateChanged消息。如状态发生改变,通过exec.application.driver.send给Driver也发送一个ExecutorUpdated消息,流程和启动Driver基本一样的。ExecutorStateChanged源码如下:

1.        case ExecutorStateChanged(appId, execId,state, message, exitStatus) =>

2.             val execOption = idToApp.get(appId).flatMap(app=> app.executors.get(execId))

3.             execOption match {

4.               case Some(exec) =>

5.                 val appInfo = idToApp(appId)

6.                 val oldState = exec.state

7.                 exec.state = state

8.        

9.                 if (state == ExecutorState.RUNNING) {

10.                assert(oldState ==ExecutorState.LAUNCHING,

11.                  s"executor $execId statetransfer from $oldState to RUNNING is illegal")

12.                appInfo.resetRetryCount()

13.              }

14.     

15.              exec.application.driver.send(ExecutorUpdated(execId,state, message, exitStatus, false))

16.     

17.              if (ExecutorState.isFinished(state)){

18.                // Remove this executor from theworker and app

19.                logInfo(s"Removing executor${exec.fullId} because it is $state")

20.                // If an application has alreadyfinished, preserve its

21.                // state to display its informationproperly on the UI

22.                if (!appInfo.isFinished) {

23.                  appInfo.removeExecutor(exec)

24.                }

25.                exec.worker.removeExecutor(exec)

26.     

27.                val normalExit = exitStatus ==Some(0)

28.                // Only retry certain number oftimes so we don't go into an infinite loop.

29.                // Important note: this code pathis not exercised by tests, so be very careful when

30.                // changing this `if` condition.

31.                if (!normalExit

32.                    &&appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES

33.                    && MAX_EXECUTOR_RETRIES>= 0) { // < 0 disables this application-killing path

34.                  val execs = appInfo.executors.values

35.                  if (!execs.exists(_.state ==ExecutorState.RUNNING)) {

36.                    logError(s"Application${appInfo.desc.name} with ID ${appInfo.id} failed " +

37.                      s"${appInfo.retryCount}times; removing it")

38.                    removeApplication(appInfo,ApplicationState.FAILED)

39.                  }

40.                }

41.              }

42.              schedule()

43.            case None =>

44.              logWarning(s"Got status updatefor unknown executor $appId/$execId")

45.          }

 

原创粉丝点击