Driver到底是什么时候产生的
来源:互联网 发布:保定网络舆情日报 编辑:程序博客网 时间:2024/05/02 01:09
6.3 从Application提交的角度看重新审视Driver
6.3.1 Driver到底是什么时候产生的
在SparkContext实例化的时候通过createTaskScheduler来创建TaskSchedulerImpl和StandaloneSchedulerBackend。
SparkContext.scala源码:
1. class SparkContext(config: SparkConf) extends Logging {
2. ……..
3. val (sched, ts) = SparkContext.createTaskScheduler(this, master,deployMode)
4. _schedulerBackend = sched
5. _taskScheduler = ts
6.
7. _dagScheduler = newDAGScheduler(this)
8. _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)
9. ……
10. private def createTaskScheduler(
11. .......
12. case SPARK_REGEX(sparkUrl)=>
13. val scheduler = newTaskSchedulerImpl(sc)
14. val masterUrls =sparkUrl.split(",").map("spark://" + _)
15. val backend = new StandaloneSchedulerBackend(scheduler,sc, masterUrls)
16. scheduler.initialize(backend)
17. (backend, scheduler)
18. ……
在createTaskScheduler中调用scheduler.initialize(backend),initialize的方法参数把StandaloneSchedulerBackend传进来.
TaskSchedulerImpl的initialize源码如下:
1. definitialize(backend: SchedulerBackend) {
2. this.backend = backend
3. ……
initialize的方法把StandaloneSchedulerBackend传进来了,但还没有启动StandaloneSchedulerBackend。在TaskSchedulerImpl的initialize方法中把StandaloneSchedulerBackend传进来赋值为TaskSchedulerImpl的backend。
在TaskSchedulerImpl调用start方法的时候会调用backend.start方法,在start方法中会注册应用程序。
SparkContext.scala的taskScheduler的启动:
1. val (sched, ts) =SparkContext.createTaskScheduler(this, master, deployMode)
2. _schedulerBackend = sched
3. _taskScheduler = ts
4. _dagScheduler = newDAGScheduler(this)
5. ……
6. _taskScheduler.start()
7. _applicationId =_taskScheduler.applicationId()
8. _applicationAttemptId =taskScheduler.applicationAttemptId()
9. _conf.set("spark.app.id",_applicationId)
10. ……
其中调用了_taskScheduler的start方法:
1. private[spark] traitTaskScheduler {
2. ......
3.
4. def start(): Unit
5. …..
TaskScheduler的start()方法没具体实现,TaskScheduler子类的TaskSchedulerImpl的start()方法源码如下:
1. override def start() {
2. backend.start()
3. ……
TaskSchedulerImpl的start()这里就通过 backend.start()启动了StandaloneSchedulerBackend的start方法:
1. override def start() {
2. super.start()
3. launcherBackend.connect()
4. ......
5. val command =Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",
6. args, sc.executorEnvs,classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)
7. .......
8. val appDesc = newApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,
9. appUIAddress,sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)
10. client = newStandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)
11. client.start()
12. ........
13. }
StandaloneSchedulerBackend的start方法中,将command封装注册给Master,Master转过来要Worker启动具体的Executor。command已经封装好指令,Executor具体要启动进程入口类CoarseGrainedExecutorBackend。然后new出来一个StandaloneAppClient,通过client.start()启动client。
StandaloneAppClient的start方法中new出来一个ClientEndpoint:
1. defstart() {
2. // Just launch an rpcEndpoint;it will call back into the listener.
3. endpoint.set(rpcEnv.setupEndpoint("AppClient",new ClientEndpoint(rpcEnv)))
4. }
ClientEndpoint源码如下:
1. private classClientEndpoint(override val rpcEnv: RpcEnv) extends ThreadSafeRpcEndpoint
2. with Logging {
3. ……
4. override def onStart(): Unit ={
5. try {
6. registerWithMaster(1)
7. } catch {
8. case e: Exception =>
9. logWarning("Failedto connect to master", e)
10. markDisconnected()
11. stop()
12. }
13. }
ClientEndpoint是一个ThreadSafeRpcEndpoint, ClientEndpoint的onStart()方法中调用registerWithMaster(1)进行注册,向Master注册程序,registerWithMaster方法如下:
StandaloneAppClient.scala源码:
1. private defregisterWithMaster(nthRetry: Int) {
2. registerMasterFutures.set(tryRegisterAllMasters())
3. ……
registerWithMaster中调用了tryRegisterAllMasters方法,在tryRegisterAllMasters方法中ClientEndpoint向Master发送RegisterApplication消息进行应用程序的注册。
StandaloneAppClient.scala源码:
1. private deftryRegisterAllMasters(): Array[JFuture[_]] = {
2. ......
3. masterRef.send(RegisterApplication(appDescription,self))
4. ......
程序注册以后,Master通过 schedule()为我们分配资源,通知Worker启动Executor,Executor启动的进程是CoarseGrainedExecutorBackend,Executor启动以后又转过来向Driver注册,Driver其实是StandaloneSchedulerBackend的父类CoarseGrainedSchedulerBackend的一个消息循环体DriverEndpoint。
Master.scala的receive方法源码:
1. override def receive:PartialFunction[Any, Unit] = {
2. caseRegisterApplication(description, driver) =>
3. .......
4. registerApplication(app)
5. logInfo("Registeredapp " + description.name + " with ID " + app.id)
6. persistenceEngine.addApplication(app)
7. driver.send(RegisteredApplication(app.id,self))
8. schedule()
9. }
在Master的receive方法中调用了schedule方法,Schedule方法在等待的应用程序中调度当前可用的资源。每次一个新的应用程序连接或资源发生可用性的变化,此方法将被调用。
Master.scala的schedule方法源码:
1. private def schedule(): Unit = {
2. .......
3. if (worker.memoryFree>= driver.desc.mem && worker.coresFree >= driver.desc.cores) {
4. launchDriver(worker,driver)
5. waitingDrivers -= driver
6. launched = true
7. }
8. curPos = (curPos + 1) %numWorkersAlive
9. }
10. }
11. startExecutorsOnWorkers()
12. }
Master.scala在schedule方法调用launchDriver方法,launchDriver方法中给Worker发生launchDriver的消息,Master.scala的launchDriver源码如下:
1. private def launchDriver(worker: WorkerInfo,driver: DriverInfo) {
2. logInfo("Launching driver " +driver.id + " on worker " + worker.id)
3. worker.addDriver(driver)
4. driver.worker = Some(worker)
5. worker.endpoint.send(LaunchDriver(driver.id,driver.desc))
6. driver.state = DriverState.RUNNING
7. }
launchDriver本身是一个case class ,包括driverId、driverDesc等信息。
1. caseclass LaunchDriver(driverId: String, driverDesc: DriverDescription) extendsDeployMessage
DriverDescription包含了jarUrl、memory、cores、supervise、command等内容。
1. private[deploy] case class DriverDescription(
2. jarUrl: String,
3. mem: Int,
4. cores: Int,
5. supervise: Boolean,
6. command: Command) {
7.
8. override def toString: String =s"DriverDescription (${command.mainClass})"
9. }
Master.scala中launchDriver启动了Driver,接下来launchExecutor是启动Executor,Master.scala的launchExecutor源码如下:
1. private def launchExecutor(worker: WorkerInfo,exec: ExecutorDesc): Unit = {
2. logInfo("Launching executor " +exec.fullId + " on worker " + worker.id)
3. worker.addExecutor(exec)
4. worker.endpoint.send(LaunchExecutor(masterUrl,
5. exec.application.id, exec.id,exec.application.desc, exec.cores, exec.memory))
6. exec.application.driver.send(
7. ExecutorAdded(exec.id, worker.id,worker.hostPort, exec.cores, exec.memory))
8. }
Master 给我们的Worker发送一个消息LaunchDriver启动Drvier,然后是launchExecutor启动Executor,launchExecutor有自己的调度方式,资源调度之后,也是给我们的Worker发生了一个消息LaunchExecutor。
Worker 就收到Master发送的LaunchDriver、LaunchExecutor消息。
下面是Worker原理内幕和流程机制:
图 5- 6 Worker原理内幕和流程机制
Master、Worker部署在不同的机器上,Master、Worker为进程存在。Master给我们的Worker发2种不同的指令,一种指令是LaunchDriver、一种指令是LaunchExecutor。
l Worker收到Master的LaunchDriver的消息以后,new出来一个DriverRunner,然后启动driver.start()方法。
Worker.scala源码:
1. case LaunchDriver(driverId,driverDesc) =>
2. ......
3. val driver = new DriverRunner(
4. ......
5. driver.start()
l Worker收到Master的LaunchExecutor的消息以后,new出来一个ExecutorRunner,然后启动manager.start()方法。
Worker.scala源码:
1. case LaunchExecutor(masterUrl, appId, execId,appDesc, cores_, memory_) =>
2. ......
3. val manager = newExecutorRunner(
4. ......
5. manager.start()
无论是WorKer的DriverRunner、ExecutorRunner在调用start方法时,在start内部都启动了一条线程,内部使用Thread来处理Driver、Executor的启动。以Worker收到LaunchDriver消息,new出DriverRunnerDriverRunner为例,DriverRunner.scala的start源码如下:
1. /** Starts a thread to run andmanage the driver. */
2. private[worker] def start() = {
3. new Thread("DriverRunner for " +driverId) {
4. override def run() {
5. var shutdownHook: AnyRef = null
6. try {
7. shutdownHook =ShutdownHookManager.addShutdownHook { () =>
8. logInfo(s"Worker shuttingdown, killing driver $driverId")
9. kill()
10. }
11.
12. // prepare driver jars and run driver
13. val exitCode = prepareAndRunDriver()
14.
15. // set final state depending on ifforcibly killed and process exit code
16. finalState = if (exitCode == 0) {
17. Some(DriverState.FINISHED)
18. } else if (killed) {
19. Some(DriverState.KILLED)
20. } else {
21. Some(DriverState.FAILED)
22. }
23. } catch {
24. case e: Exception =>
25. kill()
26. finalState =Some(DriverState.ERROR)
27. finalException = Some(e)
28. } finally {
29. if (shutdownHook != null) {
30. ShutdownHookManager.removeShutdownHook(shutdownHook)
31. }
32. }
33.
34. // notify worker of final driver state,possible exception
35. worker.send(DriverStateChanged(driverId,finalState.get, finalException))
36. }
37. }.start()
38. }
DriverRunner.scala的start方法中调用prepareAndRunDriver方法,准备Driver的jar包和启动Driver,prepareAndRunDriver源码如下:
1. private[worker] def prepareAndRunDriver():Int = {
2. val driverDir = createWorkingDirectory()
3. val localJarFilename =downloadUserJar(driverDir)
4.
5. def substituteVariables(argument: String):String = argument match {
6. case "{{WORKER_URL}}" =>workerUrl
7. case "{{USER_JAR}}" =>localJarFilename
8. case other => other
9. }
10.
11. // TODO: If we add ability to submitmultiple jars they should also be added here
12. val builder =CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
13. driverDesc.mem,sparkHome.getAbsolutePath, substituteVariables)
14.
15. runDriver(builder, driverDir,driverDesc.supervise)
16. }
LaunchDriver的启动过程:
l Worker进程:WorKer的DriverRunner调用start方法,内部使用Thread来处理Driver启动。DriverRunner创建Driver在本地系统的工作目录(即Linux的文件目录),每次工作都有自己的目录,封装好Driver的启动Command,通过ProcessBuilder来启动Driver。这些内容都属于Worker进程。
l Driver进程:启动的Driver是属于Driver进程。
LaunchExecutor的启动过程:
l Worker进程:WorKer的ExecutorRunner调用start方法,内部使用Thread来处理Executor启动。ExecutorRunner创建Executor在本地系统的工作目录(即Linux的文件目录),每次工作都有自己的目录,封装好Executor的启动Command,通过ProcessBuilder来启动Executor。这些内容都属于Worker进程。
l Executor进程:启动的Executor是属于Executor进程。Executor在ExecutorBackend里面,ExecutorBackend在Spark standalone模式中是CoarseGrainedExecutorBackend,CoarseGrainedExecutorBackend继承至ExecutorBackend。Executor和ExecutorBackend是一对一的关系,一个ExecutorBackend有一个Executor,在Executor内部是线程池并发处理的方式来处理Spark提交过来的Task。
l Executor启动之后要向Driver注册,注册给SchedulerBackend。
CoarseGrainedExecutorBackend的源码,CoarseGrainedExecutorBackend有我们的Executor本身:
1. private[spark] class CoarseGrainedExecutorBackend(
2. override val rpcEnv: RpcEnv,
3. driverUrl: String,
4. executorId: String,
5. hostname: String,
6. cores: Int,
7. userClassPath: Seq[URL],
8. env: SparkEnv)
9. extends ThreadSafeRpcEndpoint withExecutorBackend with Logging {
10.
11. private[this] val stopping = newAtomicBoolean(false)
12. var executor: Executor = null
13. @volatile var driver: Option[RpcEndpointRef]= None
14. ……
我们再次看一下Master的schedule()方法:
1. private def schedule(): Unit = {
2. ……
3. if (worker.memoryFree >=driver.desc.mem && worker.coresFree >= driver.desc.cores) {
4. launchDriver(worker, driver)
5. waitingDrivers -= driver
6. launched = true
7. }
8. curPos = (curPos + 1) % numWorkersAlive
9. }
10. }
11. startExecutorsOnWorkers()
12. }
Master的schedule()方法中,如果Driver运行在集群中,通过launchDriver来启动Driver。launchDriver发送一个消息交给worker的endpoint,这个是RPC的通信机制。
1. private def launchDriver(worker: WorkerInfo,driver: DriverInfo) {
2. logInfo("Launching driver " +driver.id + " on worker " + worker.id)
3. worker.addDriver(driver)
4. driver.worker = Some(worker)
5. worker.endpoint.send(LaunchDriver(driver.id,driver.desc))
6. driver.state = DriverState.RUNNING
7. }
Master的schedule()方法中启动Executor的部分,通过startExecutorsOnWorkers()启动,startExecutorsOnWorkers也是通过RPC的通信方式:
1. private defstartExecutorsOnWorkers(): Unit = {
2. // Right now this is a very simple FIFOscheduler. We keep trying to fit in the first app
3. // in the queue, then the second app, etc.
4. for (app <- waitingApps if app.coresLeft> 0) {
5. val coresPerExecutor: Option[Int] =app.desc.coresPerExecutor
6. // Filter out workers that don't haveenough resources to launch an executor
7. val usableWorkers =workers.toArray.filter(_.state == WorkerState.ALIVE)
8. .filter(worker => worker.memoryFree>= app.desc.memoryPerExecutorMB &&
9. worker.coresFree >=coresPerExecutor.getOrElse(1))
10. .sortBy(_.coresFree).reverse
11. val assignedCores =scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)
12.
13. // Now that we've decided how many coresto allocate on each worker, let's allocate them
14. for (pos <- 0 untilusableWorkers.length if assignedCores(pos) > 0) {
15. allocateWorkerResourceToExecutors(
16. app, assignedCores(pos),coresPerExecutor, usableWorkers(pos))
17. }
18. }
19. }
Master.scala的方法中调用 allocateWorkerResourceToExecutors方法进行正式分配:
1. private defallocateWorkerResourceToExecutors(
2. app: ApplicationInfo,
3. assignedCores: Int,
4. coresPerExecutor: Option[Int],
5. worker: WorkerInfo): Unit = {
6. // If the number of cores per executor isspecified, we divide the cores assigned
7. // to this worker evenly among theexecutors with no remainder.
8. // Otherwise, we launch a single executorthat grabs all the assignedCores on this worker.
9. val numExecutors = coresPerExecutor.map {assignedCores / _ }.getOrElse(1)
10. val coresToAssign =coresPerExecutor.getOrElse(assignedCores)
11. for (i <- 1 to numExecutors) {
12. val exec = app.addExecutor(worker,coresToAssign)
13. launchExecutor(worker, exec)
14. app.state = ApplicationState.RUNNING
15. }
16. }
allocateWorkerResourceToExecutors正式分配的时候就通过launchExecutor方法启动Executor
1. private def launchExecutor(worker: WorkerInfo,exec: ExecutorDesc): Unit = {
2. logInfo("Launching executor " +exec.fullId + " on worker " + worker.id)
3. worker.addExecutor(exec)
4. worker.endpoint.send(LaunchExecutor(masterUrl,
5. exec.application.id, exec.id,exec.application.desc, exec.cores, exec.memory))
6. exec.application.driver.send(
7. ExecutorAdded(exec.id, worker.id,worker.hostPort, exec.cores, exec.memory))
8. }
Master发送消息给Worker,发送2个消息:一个是LaunchDriver、一个是LaunchExecutor。Worker收到Master的LaunchDriver、 LaunchExecutor消息。我们看一下Worker:
1. private[deploy] class Worker(
2. override val rpcEnv: RpcEnv,
3. webUiPort: Int,
4. cores: Int,
5. memory: Int,
6. masterRpcAddresses: Array[RpcAddress],
7. endpointName: String,
8. workDirPath: String = null,
9. val conf: SparkConf,
10. val securityMgr: SecurityManager)
11. extends ThreadSafeRpcEndpoint with Logging {
Worker实现RPC通信,继承至ThreadSafeRpcEndpoint,ThreadSafeRpcEndpoint是一个trait,其它的RPC对象可以给它发消息:
1. private[spark] trait ThreadSafeRpcEndpointextends RpcEndpoint
Worker在receive方法中收消息。就像一个邮箱,不断的循环邮箱接收邮件,我们可以把消息看成是邮件。
1. override def receive: PartialFunction[Any,Unit] = synchronized {
2. case SendHeartbeat =>
3. ……
4. case WorkDirCleanup =>
5. ......
6. case MasterChanged(masterRef,masterWebUiUrl) =>
7. ......
8. case ReconnectWorker(masterUrl) =>
9. …….
10. caseLaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) =>
11. ......
12. case executorStateChanged @ExecutorStateChanged(appId, execId, state, message, exitStatus)
13. ......
14. case KillExecutor(masterUrl,appId, execId) =>
15. ......
16. case LaunchDriver(driverId, driverDesc)=>
17. ……
Worker.scala的receive方法LaunchDriver启动Driver的源码如下:
1. case LaunchDriver(driverId, driverDesc) =>
2. logInfo(s"Asked to launch driver$driverId")
3. val driver = new DriverRunner(
4. conf,
5. driverId,
6. workDir,
7. sparkHome,
8. driverDesc.copy(command =Worker.maybeUpdateSSLSettings(driverDesc.command, conf)),
9. self,
10. workerUri,
11. securityMgr)
12. drivers(driverId) = driver
13. driver.start()
14.
15. coresUsed += driverDesc.cores
16. memoryUsed += driverDesc.mem
LaunchDriver方法首先打印日志,传进来的时候肯定会告诉driverId。在启动Driver或者Executor的时候,Driver或者Executor所在的进程一定是满足内存级别的要求,但不一定满足Cores的要求,实际的Cores可能比期待的Cores多,也有可能少。
logInfo方法打印日志使用了封装,
1. protected def logInfo(msg: => String) {
2. if (log.isInfoEnabled) log.info(msg)
3. }
回到LaunchDriver方法,其中new出一个DriverRunner,DriverRunner包括driverId、工作目录(workDir)、spark的路径(sparkHome)、driverDesc、workerUri、securityMgr等内容。在drivers(driverId) = driver代码中,将driver交给一个数据结构drivers,drivers是一个HashMap,是Key-Value的方式,其中Key是Driver的ID,Value是DriverRunner。Worker下可能启动很多Executor,需根据具体的ID管理DriverRunner。DriverRunner内部通过线程的方式启动另外一个进程Driver,DriverRunner是Driver所在进程的代理。
1. valdrivers = new HashMap[String, DriverRunner]
回到Worker.scala的LaunchDriver,Worker在启动driver之前,将相关的DriverRunner数据保存到Worker的内存数据结构中,然后进行driver.start()。start之后,将消耗的cores、memory增加到coresUsed 、memoryUsed 。
接下来我们进入DriverRunner.scala源代码,DriverRunner管理 Driver的执行,包括在 Driver失败的时候自动重启。如Driver运行在集群模式中,加入supervise关键字可以自动重启:
1. private[deploy] class DriverRunner(
2. conf: SparkConf,
3. val driverId: String,
4. val workDir: File,
5. val sparkHome: File,
6. val driverDesc: DriverDescription,
7. val worker: RpcEndpointRef,
8. val workerUrl: String,
9. val securityManager: SecurityManager)
10. extends Logging {
其中的DriverDescription源码如下,其中包括DriverDescription 的成员supervise,supervise是一个布尔值,如果设置为true,在集群模式中Driver运行失败的时候,Worker会负责重新启动Driver:
1. private[deploy] case classDriverDescription(
2. jarUrl: String,
3. mem: Int,
4. cores: Int,
5. supervise: Boolean,
6. command: Command) {
7.
8. override def toString: String =s"DriverDescription (${command.mainClass})"
9. }
回到Worker.scala的LaunchDriver,DriverRunner构造出来以后,调用其start方法,通过一个线程管理Driver,包括启动Driver及关闭Driver。其中Thread("DriverRunner for " + driverId),DriverRunner for driverId是线程的名字,Thread是Java的代码,scala可以无缝的连接Java。
Start源码如下:
1. private[worker] def start() = {
2. new Thread("DriverRunner for " +driverId) {
3. override def run() {
4. var shutdownHook: AnyRef = null
5. try {
6. shutdownHook =ShutdownHookManager.addShutdownHook { () =>
7. logInfo(s"Worker shuttingdown, killing driver $driverId")
8. kill()
9. }
10.
11. // prepare driverjars and run driver
12. val exitCode =prepareAndRunDriver()
13.
14. // set final statedepending on if forcibly killed and process exit code
15. finalState = if (exitCode == 0) {
16. Some(DriverState.FINISHED)
17. } else if (killed) {
18. Some(DriverState.KILLED)
19. } else {
20. Some(DriverState.FAILED)
21. }
22. } catch {
23. case e: Exception=>
24. kill()
25. finalState =Some(DriverState.ERROR)
26. finalException =Some(e)
27. } finally {
28. if (shutdownHook !=null) {
29. ShutdownHookManager.removeShutdownHook(shutdownHook)
30. }
31. }
32.
33. // notify worker offinal driver state, possible exception
34. worker.send(DriverStateChanged(driverId,finalState.get, finalException))
35. }
36. }.start()
37. }
DriverRunner的start方法调用prepareAndRunDriver来实现driver jar包的准备及启动driver。prepareAndRunDriver源码如下:
1. private[worker] def prepareAndRunDriver():Int = {
2. val driverDir = createWorkingDirectory()
3. val localJarFilename =downloadUserJar(driverDir)
4.
5. def substituteVariables(argument: String):String = argument match {
6. case "{{WORKER_URL}}" =>workerUrl
7. case "{{USER_JAR}}" =>localJarFilename
8. case other => other
9. }
10.
11. // TODO: If we add ability to submitmultiple jars they should also be added here
12. val builder =CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
13. driverDesc.mem,sparkHome.getAbsolutePath, substituteVariables)
14.
15. runDriver(builder, driverDir,driverDesc.supervise)
16. }
prepareAndRunDriver方法中调用了createWorkingDirectory方法创建目录。通过Java的 new File创建了Driver的工作目录,如果目录不存在而且创建不成功,就提示失败。在本地文件系统创建一个目录一般不会失败,除非磁盘满。createWorkingDirectory源码如下:
1. private def createWorkingDirectory(): File = {
2. val driverDir = new File(workDir, driverId)
3. if (!driverDir.exists() &&!driverDir.mkdirs()) {
4. throw new IOException("Failed tocreate directory " + driverDir)
5. }
6. driverDir
7. }
回到DriverRunner.scala的prepareAndRunDriver方法,其中downloadUserJar方法下载jar包。我们自己写的代码是一个jar包,这里下载用户的jar包到本地。jar包在Hdfs中,我们从Hdfs中获取jar包下载到本地。
downloadUserJar方法源码如下:
1. private def downloadUserJar(driverDir:File): String = {
2. val jarFileName = newURI(driverDesc.jarUrl).getPath.split("/").last
3. val localJarFile = new File(driverDir,jarFileName)
4. if (!localJarFile.exists()) { // Mayalready exist if running multiple workers on one node
5. logInfo(s"Copying user jar${driverDesc.jarUrl} to $localJarFile")
6. Utils.fetchFile(
7. driverDesc.jarUrl,
8. driverDir,
9. conf,
10. securityManager,
11. SparkHadoopUtil.get.newConfiguration(conf),
12. System.currentTimeMillis(),
13. useCache = false)
14. if (!localJarFile.exists()) { // Verifycopy succeeded
15. throw new IOException(
16. s"Can not find expected jar$jarFileName which should have been loaded in $driverDir")
17. }
18. }
19. localJarFile.getAbsolutePath
20. }
downloadUserJar方法调用了fetchFile,fetchFile借助Hadoop,从Hdfs中下载文件。我们在提交文件的时候,将jar包上传到Hdfs上,提交一份大家都可以从Hdfs中下载。Utile. fetchFile方法源码如下:
1. def fetchFile(
2. url: String,
3. targetDir: File,
4. conf: SparkConf,
5. securityMgr: SecurityManager,
6. hadoopConf: Configuration,
7. timestamp: Long,
8. useCache: Boolean) {
9. val fileName = decodeFileNameInURI(newURI(url))
10. val targetFile = new File(targetDir,fileName)
11. val fetchCacheEnabled =conf.getBoolean("spark.files.useFetchCache", defaultValue = true)
12. if (useCache && fetchCacheEnabled){
13. val cachedFileName =s"${url.hashCode}${timestamp}_cache"
14. val lockFileName =s"${url.hashCode}${timestamp}_lock"
15. val localDir = newFile(getLocalDir(conf))
16. val lockFile = new File(localDir,lockFileName)
17. val lockFileChannel = newRandomAccessFile(lockFile, "rw").getChannel()
18. // Only one executor entry.
19. // The FileLock is only used to controlsynchronization for executors download file,
20. // it's always safe regardless of locktype (mandatory or advisory).
21. val lock = lockFileChannel.lock()
22. val cachedFile = new File(localDir,cachedFileName)
23. try {
24. if (!cachedFile.exists()) {
25. doFetchFile(url, localDir,cachedFileName, conf, securityMgr, hadoopConf)
26. }
27. } finally {
28. lock.release()
29. lockFileChannel.close()
30. }
31. copyFile(
32. url,
33. cachedFile,
34. targetFile,
35. conf.getBoolean("spark.files.overwrite",false)
36. )
37. } else {
38. doFetchFile(url, targetDir, fileName,conf, securityMgr, hadoopConf)
39. }
回到DriverRunner.scala的prepareAndRunDriver方法,driverDesc.command表明运行什么类,构建进程运行类的入口,然后是runDriver启动Driver。
1. private[worker] def prepareAndRunDriver():Int = {
2. .......
3. val builder =CommandUtils.buildProcessBuilder(driverDesc.command, securityManager,
4. driverDesc.mem,sparkHome.getAbsolutePath, substituteVariables)
5.
6. runDriver(builder, driverDir,driverDesc.supervise)
7. }
DriverRunner.scala的runDriver方法如下,runDriver中重定向输出文件和err文件,可以通过log日志文件查看执行的情况。最后是调用runCommandWithRetry方法:
1. private def runDriver(builder: ProcessBuilder,baseDir: File, supervise: Boolean): Int = {
2. builder.directory(baseDir)
3. def initialize(process: Process): Unit = {
4. // Redirect stdout and stderr to files
5. val stdout = new File(baseDir,"stdout")
6. CommandUtils.redirectStream(process.getInputStream,stdout)
7.
8. val stderr = new File(baseDir,"stderr")
9. val formattedCommand =builder.command.asScala.mkString("\"", "\"\"", "\"")
10. val header = "Launch Command:%s\n%s\n\n".format(formattedCommand, "=" * 40)
11. Files.append(header, stderr,StandardCharsets.UTF_8)
12. CommandUtils.redirectStream(process.getErrorStream,stderr)
13. }
14. runCommandWithRetry(ProcessBuilderLike(builder),initialize, supervise)
15. }
runCommandWithRetry中传入的参数是ProcessBuilderLike(builder),这里new出来一个ProcessBuilderLike ,在重载方法start()执行processBuilder.start()。ProcessBuilderLike源码如下:
1. private[deploy] objectProcessBuilderLike {
2. def apply(processBuilder: ProcessBuilder):ProcessBuilderLike = new ProcessBuilderLike {
3. override def start(): Process =processBuilder.start()
4. override def command: Seq[String] =processBuilder.command().asScala
5. }
6. }
我们看一下runCommandWithRetry的源码:
1. private[worker] def runCommandWithRetry(
2. command: ProcessBuilderLike, initialize:Process => Unit, supervise: Boolean): Int = {
3. var exitCode = -1
4. // Time to wait between submission retries.
5. var waitSeconds = 1
6. // A run of this many seconds resets theexponential back-off.
7. val successfulRunDuration = 5
8. var keepTrying = !killed
9.
10. while (keepTrying) {
11. logInfo("Launch Command: " +command.command.mkString("\"", "\" \"","\""))
12.
13. synchronized {
14. if (killed) { return exitCode }
15. process = Some(command.start())
16. initialize(process.get)
17. }
18.
19. val processStart = clock.getTimeMillis()
20. exitCode = process.get.waitFor()
21.
22. // check if attempting another run
23. keepTrying = supervise &&exitCode != 0 && !killed
24. if (keepTrying) {
25. if (clock.getTimeMillis() - processStart> successfulRunDuration * 1000) {
26. waitSeconds = 1
27. }
28. logInfo(s"Command exited withstatus $exitCode, re-launching after $waitSeconds s.")
29. sleeper.sleep(waitSeconds)
30. waitSeconds = waitSeconds * 2 //exponential back-off
31. }
32. }
33.
34. exitCode
35. }
36. }
runCommandWithRetry第一次不一定能申请成功,因此循环遍历重试。DriverRunner启动进程是通过ProcessBuilder中的process.get.waitFor来完成;如果supervise设置为True,exitCode不等于0 以及不是被killed,我们将keepTrying设置为True,继续循环重试启动进程。
回到DriverRunner.scala的LaunchDriver方法:
1. caseLaunchDriver(driverId, driverDesc) =>
2. ......
3. drivers(driverId) = driver
4. driver.start()
driver.start()启动Driver,进入start的源码:
1. private[worker] def start() = {
2. new Thread("DriverRunner for " +driverId) {
3. override def run() {
4. ......
5. } catch {
6. case e: Exception =>
7. kill()
8. finalState =Some(DriverState.ERROR)
9. finalException = Some(e)
10. } finally {
11. if (shutdownHook != null) {
12. ShutdownHookManager.removeShutdownHook(shutdownHook)
13. }
14. }
15.
16. // notify worker of final driver state,possible exception
17. worker.send(DriverStateChanged(driverId,finalState.get, finalException))
18. }
19. }.start()
20. }
Start启动时运行到了 finalState ,可能是Spark运行出状况了,如Driver运行时KILLED、或者FAILED,出状况以后,通过 worker.send给自己发一个消息,通知DriverStateChanged状态改变。我们在Worker.scala看一下driverStateChanged的源码:
1. case driverStateChanged @DriverStateChanged(driverId, state, exception) =>
2. handleDriverStateChanged(driverStateChanged)
在其中调用handleDriverStateChanged方法,handleDriverStateChanged源码如下:
1. private[worker] defhandleDriverStateChanged(driverStateChanged: DriverStateChanged): Unit = {
2. val driverId = driverStateChanged.driverId
3. val exception = driverStateChanged.exception
4. val state = driverStateChanged.state
5. state match {
6. case DriverState.ERROR =>
7. logWarning(s"Driver $driverIdfailed with unrecoverable exception: ${exception.get}")
8. case DriverState.FAILED =>
9. logWarning(s"Driver $driverIdexited with failure")
10. case DriverState.FINISHED =>
11. logInfo(s"Driver $driverId exitedsuccessfully")
12. case DriverState.KILLED =>
13. logInfo(s"Driver $driverId waskilled by user")
14. case _ =>
15. logDebug(s"Driver $driverIdchanged state to $state")
16. }
17. sendToMaster(driverStateChanged)
18. val driver = drivers.remove(driverId).get
19. finishedDrivers(driverId) = driver
20. trimFinishedDriversIfNecessary()
21. memoryUsed -= driver.driverDesc.mem
22. coresUsed -= driver.driverDesc.cores
23. }
Worker.scala的handleDriverStateChanged方法中对于state的不同情况,打印相关日志。关键的代码是sendToMaster(driverStateChanged),发一个消息给Master告知Driver进程挂掉。消息内容是driverStateChanged。sendToMaster源码如下:
1. private def sendToMaster(message: Any): Unit= {
2. master match {
3. case Some(masterRef) =>masterRef.send(message)
4. case None =>
5. logWarning(
6. s"Dropping $message because theconnection to master has not yet been established")
7. }
8. }
我们看一下Master的源码,Master收到DriverStateChanged消息以后,无论Driver的状态是 DriverState.ERROR |DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED ,都把Driver从内存数据结构中删掉,并把持久化引擎中的数据清理掉。
1. case DriverStateChanged(driverId, state,exception) =>
2. state match {
3. case DriverState.ERROR |DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>
4. removeDriver(driverId, state,exception)
5. case _ =>
6. throw new Exception(s"Receivedunexpected state update for driver $driverId: $state")
7. }
进入removeDriver源码,清理掉相关数据以后,再次调用 schedule():
1. private def removeDriver(
2. driverId: String,
3. finalState: DriverState,
4. exception: Option[Exception]) {
5. drivers.find(d => d.id == driverId)match {
6. case Some(driver) =>
7. logInfo(s"Removing driver:$driverId")
8. drivers -= driver
9. if (completedDrivers.size >=RETAINED_DRIVERS) {
10. val toRemove =math.max(RETAINED_DRIVERS / 10, 1)
11. completedDrivers.trimStart(toRemove)
12. }
13. completedDrivers += driver
14. persistenceEngine.removeDriver(driver)
15. driver.state = finalState
16. driver.exception = exception
17. driver.worker.foreach(w =>w.removeDriver(driver))
18. schedule()
19. case None =>
20. logWarning(s"Asked to removeunknown driver: $driverId")
21. }
22. }
23. }
接下来我们看一下启动Executor。Worker.scala的LaunchExecutor方法源码如下:
1. case LaunchExecutor(masterUrl, appId, execId,appDesc, cores_, memory_) =>
2. if (masterUrl != activeMasterUrl) {
3. logWarning("Invalid Master ("+ masterUrl + ") attempted to launch executor.")
4. } else {
5. try {
6. logInfo("Asked to launchexecutor %s/%d for %s".format(appId, execId, appDesc.name))
7.
8. // Create the executor's workingdirectory
9. val executorDir = new File(workDir,appId + "/" + execId)
10. if (!executorDir.mkdirs()) {
11. throw new IOException("Failedto create directory " + executorDir)
12. }
13.
14. // Create local dirs for theexecutor. These are passed to the executor via the
15. // SPARK_EXECUTOR_DIRS environmentvariable, and deleted by the Worker when the
16. // application finishes.
17. val appLocalDirs =appDirectories.getOrElse(appId,
18. Utils.getOrCreateLocalRootDirs(conf).map{ dir =>
19. val appDir =Utils.createDirectory(dir, namePrefix = "executor")
20. Utils.chmod700(appDir)
21. appDir.getAbsolutePath()
22. }.toSeq)
23. appDirectories(appId) = appLocalDirs
24. val manager = new ExecutorRunner(
25. appId,
26. execId,
27. appDesc.copy(command =Worker.maybeUpdateSSLSettings(appDesc.command, conf)),
28. cores_,
29. memory_,
30. self,
31. workerId,
32. host,
33. webUi.boundPort,
34. publicAddress,
35. sparkHome,
36. executorDir,
37. workerUri,
38. conf,
39. appLocalDirs,ExecutorState.RUNNING)
40. executors(appId + "/" +execId) = manager
41. manager.start()
42. coresUsed += cores_
43. memoryUsed += memory_
44. sendToMaster(ExecutorStateChanged(appId,execId, manager.state, None, None))
45. } catch {
46. case e: Exception =>
47. logError(s"Failed to launchexecutor $appId/$execId for ${appDesc.name}.", e)
48. if (executors.contains(appId +"/" + execId)) {
49. executors(appId + "/" +execId).kill()
50. executors -= appId +"/" + execId
51. }
52. sendToMaster(ExecutorStateChanged(appId,execId, ExecutorState.FAILED,
53. Some(e.toString), None))
54. }
55. }
直接看一下manager.start()方法,启动一个线程Thread,在run方法中调用fetchAndRunExecutor:
1. private[worker] def start() {
2. workerThread = newThread("ExecutorRunner for " + fullId) {
3. override def run() {fetchAndRunExecutor() }
4. }
5. workerThread.start()
6. // Shutdown hook that kills actors onshutdown.
7. shutdownHook =ShutdownHookManager.addShutdownHook { () =>
8. // It's possible that we arrive herebefore calling `fetchAndRunExecutor`, then `state` will
9. // be `ExecutorState.RUNNING`. In thiscase, we should set `state` to `FAILED`.
10. if (state == ExecutorState.RUNNING) {
11. state = ExecutorState.FAILED
12. }
13. killProcess(Some("Worker shuttingdown")) }
14. }
其中fetchAndRunExecutor的源码如下:
1. private def fetchAndRunExecutor() {
2. try {
3. // Launch the process
4. val builder =CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),
5. memory, sparkHome.getAbsolutePath,substituteVariables)
6. val command = builder.command()
7. val formattedCommand =command.asScala.mkString("\"", "\" \"","\"")
8. logInfo(s"Launch command:$formattedCommand")
9.
10. builder.directory(executorDir)
11. builder.environment.put("SPARK_EXECUTOR_DIRS",appLocalDirs.mkString(File.pathSeparator))
12. // In case we arerunning this from within the Spark Shell, avoid creating a "scala"
13. // parent process forthe executor command
14. builder.environment.put("SPARK_LAUNCH_WITH_SCALA","0")
15.
16. // Add webUI log urls
17. val baseUrl =
18. if(conf.getBoolean("spark.ui.reverseProxy", false)) {
19. s"/proxy/$workerId/logPage/?appId=$appId&executorId=$execId&logType="
20. } else {
21. s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="
22. }
23. builder.environment.put("SPARK_LOG_URL_STDERR",s"${baseUrl}stderr")
24. builder.environment.put("SPARK_LOG_URL_STDOUT",s"${baseUrl}stdout")
25.
26. process =builder.start()
27. val header = "SparkExecutor Command: %s\n%s\n\n".format(
28. formattedCommand,"=" * 40)
29.
30. // Redirect its stdoutand stderr to files
31. val stdout = newFile(executorDir, "stdout")
32. stdoutAppender =FileAppender(process.getInputStream, stdout, conf)
33.
34. val stderr = newFile(executorDir, "stderr")
35. Files.write(header,stderr, StandardCharsets.UTF_8)
36. stderrAppender =FileAppender(process.getErrorStream, stderr, conf)
37.
38. // Wait for it to exit;executor may exit with code 0 (when driver instructs it to shutdown)
39. // or with nonzero exitcode
40. val exitCode =process.waitFor()
41. state =ExecutorState.EXITED
42. val message ="Command exited with code " + exitCode
43. worker.send(ExecutorStateChanged(appId,execId, state, Some(message), Some(exitCode)))
44. } catch {
45. case interrupted:InterruptedException =>
46. logInfo("Runnerthread for executor " + fullId + " interrupted")
47. state =ExecutorState.KILLED
48. killProcess(None)
49. case e: Exception =>
50. logError("Errorrunning executor", e)
51. state =ExecutorState.FAILED
52. killProcess(Some(e.toString))
53. }
54. }
55. }
fetchAndRunExecutor类似于启动Driver的过程,在启动Executor时候首先构建CommandUtils.buildProcessBuilder,然后是builder.start(),退出是发送ExecutorStateChanged消息给我们的Worker。
在Worker.scala源码中executorStateChanged:
1. case executorStateChanged @ExecutorStateChanged(appId, execId, state, message, exitStatus) =>
2. handleExecutorStateChanged(executorStateChanged)
进入handleExecutorStateChanged源码,sendToMaster(executorStateChanged)发给executorStateChanged消息给Master :
1. private[worker] defhandleExecutorStateChanged(executorStateChanged: ExecutorStateChanged):
2. Unit = {
3. sendToMaster(executorStateChanged)
4. val state = executorStateChanged.state
5. if (ExecutorState.isFinished(state)) {
6. val appId = executorStateChanged.appId
7. val fullId = appId + "/" +executorStateChanged.execId
8. val message =executorStateChanged.message
9. val exitStatus =executorStateChanged.exitStatus
10. executors.get(fullId) match {
11. case Some(executor) =>
12. logInfo("Executor " +fullId + " finished with state " + state +
13. message.map(" message " +_).getOrElse("") +
14. exitStatus.map(" exitStatus" + _).getOrElse(""))
15. executors -= fullId
16. finishedExecutors(fullId) = executor
17. trimFinishedExecutorsIfNecessary()
18. coresUsed -= executor.cores
19. memoryUsed -= executor.memory
20. case None =>
21. logInfo("Unknown Executor "+ fullId + " finished with state " + state +
22. message.map(" message " +_).getOrElse("") +
23. exitStatus.map(" exitStatus" + _).getOrElse(""))
24. }
25. maybeCleanupApplication(appId)
26. }
27. }
28. }
我们看一下Master.scala,Master收到ExecutorStateChanged消息。如状态发生改变,通过exec.application.driver.send给Driver也发送一个ExecutorUpdated消息,流程和启动Driver基本一样的。ExecutorStateChanged源码如下:
1. case ExecutorStateChanged(appId, execId,state, message, exitStatus) =>
2. val execOption = idToApp.get(appId).flatMap(app=> app.executors.get(execId))
3. execOption match {
4. case Some(exec) =>
5. val appInfo = idToApp(appId)
6. val oldState = exec.state
7. exec.state = state
8.
9. if (state == ExecutorState.RUNNING) {
10. assert(oldState ==ExecutorState.LAUNCHING,
11. s"executor $execId statetransfer from $oldState to RUNNING is illegal")
12. appInfo.resetRetryCount()
13. }
14.
15. exec.application.driver.send(ExecutorUpdated(execId,state, message, exitStatus, false))
16.
17. if (ExecutorState.isFinished(state)){
18. // Remove this executor from theworker and app
19. logInfo(s"Removing executor${exec.fullId} because it is $state")
20. // If an application has alreadyfinished, preserve its
21. // state to display its informationproperly on the UI
22. if (!appInfo.isFinished) {
23. appInfo.removeExecutor(exec)
24. }
25. exec.worker.removeExecutor(exec)
26.
27. val normalExit = exitStatus ==Some(0)
28. // Only retry certain number oftimes so we don't go into an infinite loop.
29. // Important note: this code pathis not exercised by tests, so be very careful when
30. // changing this `if` condition.
31. if (!normalExit
32. &&appInfo.incrementRetryCount() >= MAX_EXECUTOR_RETRIES
33. && MAX_EXECUTOR_RETRIES>= 0) { // < 0 disables this application-killing path
34. val execs = appInfo.executors.values
35. if (!execs.exists(_.state ==ExecutorState.RUNNING)) {
36. logError(s"Application${appInfo.desc.name} with ID ${appInfo.id} failed " +
37. s"${appInfo.retryCount}times; removing it")
38. removeApplication(appInfo,ApplicationState.FAILED)
39. }
40. }
41. }
42. schedule()
43. case None =>
44. logWarning(s"Got status updatefor unknown executor $appId/$execId")
45. }
- Driver到底是什么时候产生的
- Executor到底是什么时候启动的?
- 绝对定位的时候它的包含块到底是什么
- Activity到底是什么时候显示到屏幕上的呢?
- Activity到底是什么时候显示到屏幕上的呢
- Activity到底是什么时候显示到屏幕上的呢
- 当push的时候应该注意的ODEX到底是什么文件
- 你要的到底是什么
- 爱情的感觉到底是什么?
- 外面的世界到底是什么
- 我们到底需要的是什么?
- HMODULE 到底定义的是什么
- 我到底是什么颜色的
- python的w+到底是什么
- 打败我们的到底是什么?
- 投资的本质到底是什么?
- bind()的作用到底是什么
- 人的追求到底是什么??
- 113. Path Sum II
- Cache的工作原理
- faster rcnn: 架构实现过程详细介绍
- javascript常用技巧(转载)
- String使用equals方法和==分别比较的是什么?
- Driver到底是什么时候产生的
- smarty实现静态页面练习
- Executor到底是什么时候启动的?
- 快速排序
- 【effective Java读书笔记】方法(一)
- 0/1背包问题
- hibernate自动创建表时提示语法错误“type=innoDB”
- 【JVM】类加载
- Spark中的python shell交互界面Ipython和jupyter notebook