为什么说coarseGrainedExecutorBackend要通信的对象driverUrl是driverEndpoint而不是ClientEndpoint

来源：互联网发布：台湾谈大陆2016网络编辑：程序博客网时间：2024/04/30 08:55

背景:集群启动的时候启动了master和worker

用户提交程序时:

1. 首先new spark context,其中会new dagScheduler,taskSchedulerImpl和sparkDeploySchedulerBackend，并start taskSchedulerImpl。

2. 在taskSchedulerImpl start时，会start schedulerBackend，在standalone模式下是start了sparkDeoplySchedulerBackend。

3. 在start sparkDeoploySchedulerBackend时会先start它的父类CoarseGrainedSchedulerBackend(start时会初始化DriverEndpoint这个内部类,该内部类是rpcEndpoint,它有onstart方法,在该方法执行时会执行Option(self).foreach(_.send(ReviveOffers))来周期性地发ReviveOffers消息给自己，ReviveOffers是个空的object，会触发makeOffers来‘Make fake resource offers on all executors’.

4. SparkDeploySchedulerBackend的start方法在调用其父类CoarseGrainedScheduleBackend的start方法启动了DriverEndpoint后，接下来会new AppClient并start AppClient,在start AppClient中会new ClientEndpoint(rpcEnv)这个内部类，该内部类是个消息通讯体，在实例化完成后会自动执行其onstart方法,onstart()内部会发消息给master来注册app（其实质是发送消息给所有masters,一旦跟一个master连接成功，就cancel与其他master的连接）:masterRef.send(RegisterApplication(appDescription, self))。需要注意的是:这里的appDescription包含了app的具体信息，包括command信息；这里的self是ClientEndpoint本身。.

5. master本身是个ThreadSafeRpcEndpoint消息通讯体，接受到来自ClientEndpoint的消息RegisterApplication(description, driver)后，会createApplication(description, driver)和registerApplication(app)来创建和这册Application，并发送Application注册成功的消息给driver:driver.send(RegisteredApplication(app.id, self))（注意：这里的driver其实是ClientEndpoint！），然后调用schedule()方法。

6. clientEndpoint接受到RegisteredApplication(appId_, masterRef)消息后，会调用master = Some(masterRef)和listener.connected(appId.get)，（后者实质是调用AppClientListener的具体实现SparkDeploySchedulerBackend.connected(appId.get)），至此clientEndpoint获得了注册成功了的Application的ID和Master的地址。

7. master在注册完Application后接下来会调用schedule()方法，在schedule()方法中会调用startExecutorsOnWorkers()，在startExecutorsOnWorkers()方法中会调用scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)和allocateWorkerResourceToExecutors(app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))，在allocateWorkerResourceToExecutors()中会launchExecutor(worker, exec),仔细看这个launchExecutor(worker: WorkerInfo, exec: ExecutorDesc)，它发送了如下两条消息：

worker.endpoint.send(LaunchExecutor(masterUrl, exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))

exec.application.driver.send(ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))

8. 上述master发送的第一条消息是发给worker让其laucn executor的，worker本身是个消息通讯体，其在接受到消息LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_) 后，创建并start了一个（ExecutorRunnerManages the execution of one executor process.），而后发ExecutorStateChanged消息给master通知Master executor状态的变化 :sendToMaster(ExecutorStateChanged(appId, execId, manager.state, None, None))。

9. 上述master发送的第二条消息是发送给ClientEndpoint这个消息通讯体通知它获得了executor，ClientEndpoint在接受到 ExecutorAdded(id: Int, workerId: String, hostPort: String, cores: Int, memory: Int)消息后会调用listener.executorAdded(fullId, workerId, hostPort, cores, memory)，实质是调用AppClientListener的具体实现SparkDeploySchedulerBackend.executorAdded(fullId: String, workerId: String, hostPort: String, cores: Int,memory: Int),到此executor注册成功.

10. 我们来仔细看下worker是怎么launchExecutor的，worker创建了ExecutorRunner，然后调用了ExecutorRunner的start()方法，该start()方法调用了方法fetchAndRunExecutor()，这个fetchAndRunExecutor()方法中有以下代码：

val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),memory, sparkHome.getAbsolutePath, substituteVariables)

process = builder.start()

这里就是构建并启动新的进程的关键之所在！所有的要启动的新的进程的相关信息都在这个builder里！！！我们来看下它都有哪些信息，以及从哪里来的。

这个ExecutorRunner的appDesc.command来自于worker从maser接受的case class消息LaunchExecutor(masterUrl, appId, execId, appDesc, cores_, memory_)中的appDesc,而master的这个appDesc来自于它从clientEndpoint接受的case class消息RegisterApplication(appDescription: ApplicationDescription, driver: RpcEndpointRef)中的appDescription, 而clientEndpoint的appDescription则来自于AppClient实例化时从sparkDeploySchedulerBackend中传入的appDesc，该appDesc包含了command:

val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command,appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)

val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)

我们可以看到command包含了要启动的进程的名字CoarseGrainedExecutorBackend,也包含了args参数, args参数内容如下：

val args = Seq(

"--driver-url", driverUrl,

"--executor-id", "{{EXECUTOR_ID}}",

"--hostname", "{{HOSTNAME}}",

"--cores", "{{CORES}}",

"--app-id", "{{APP_ID}}",

"--worker-url", "{{WORKER_URL}}")

这里的driverUrl其内容是：

// The endpoint for executors to talk to us

val driverUrl = RpcEndpointAddress(

sc.conf.get("spark.driver.host"),

sc.conf.get("spark.driver.port").toInt,

CoarseGrainedSchedulerBackend.ENDPOINT_NAME).toString

我们可以看到，driverUrl在这里对应的endPoint名字是CoarseGrainedSchedulerBackend.ENDPOINT_NAME，其内容实质是“CoarseGrainedScheduler”。至此一切真相大白，CoarseGrainedExecutorBackend进程启动时接受到的以后要通信的对象driverUrl就是由sparkDeploySchedulerBackend在这里设定的，其endPoint名字是“CoarseGrainedScheduler“！,由于driverEndpoint在rpcEnv中注册的Endpoint名字是“CoarseGrainedScheduler”，而clientEndpoint在rpcEnv中注册的Endpoint名字是'AppClient'，所以我们说CoarseGrainedExecutorBackend要通信的对象是driverEndpoint，而不是clientEndpoint！

注意：clientEndpoint在rpcEnv中注册时的Endpoint名字是'AppClient'，如下源码：

def start() {

// Just launch an rpcEndpoint; it will call back into the listener.

endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))

}

注意：driverEndpoint在rpcEnv中注册时的Endpoint名字是'CoarseGrainedScheduler'，如下源码：

driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))

这里的ENDPOINT_NAME来自以下源码：

private[spark] object CoarseGrainedSchedulerBackend {

val ENDPOINT_NAME = "CoarseGrainedScheduler"

}

0 0