Spark源码分析之Master注册机制原理
来源:互联网 发布:网络直播云南电视台 编辑:程序博客网 时间:2024/05/16 19:25
一 Worker向Master注册
1.1 Worker启动,调用registerWithMaster,向Master注册
当worker启动的时候,会调用registerWithMaster方法
# 注册状态置为false
# 尝试向所有master注册
# 后台线程定时调度,发送ReregisterWithMaster请求,如果之前已经注册成功,则下一次来注册,则啥也不做
private defregisterWithMaster() {
registrationRetryTimermatch{
//如果没有,说明还没有注册,然后会开始去注册
case None=>
// 初始注册状态为false
registered = false
// 尝试向所有master注册
registerMasterFutures= tryRegisterAllMasters()
// 连接尝试次数设为0
connectionAttemptCount= 0
// 后台线程定时调度,发送ReregisterWithMaster请求,如果之前已经注册成功,则下一次来注册,则啥也不做
registrationRetryTimer= Some(forwordMessageScheduler.scheduleAtFixedRate(
new Runnable{
override def run(): Unit =Utils.tryLogNonFatalError{
Option(self).foreach(_.send(ReregisterWithMaster))
}
},
INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,
TimeUnit.SECONDS))
// 如果已经有 registrationRetryTimer,就啥都不做
case Some(_) =>
logInfo("Not spawning another attempt to register with the master, sincethere is an"+
" attemptscheduled already.")
}
}
private def tryRegisterAllMasters(): Array[JFuture[_]] = { masterRpcAddresses.map { masterAddress => registerMasterThreadPool.submit(new Runnable { override def run(): Unit = { try { logInfo("Connecting to master " + masterAddress + "...") // 构造master RpcEndpoint,用于向master发送消息或者请求 val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME) // 向指定的master注册 registerWithMaster(masterEndpoint) } catch { case ie: InterruptedException => // Cancelled case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e) } } }) }}
private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = { // 向master发送RegisterWorker请求 masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker( workerId, host, port, self, cores, memory, workerWebUiUrl)) .onComplete { // 回调成功,则调用handleRegisterResponse case Success(msg) => Utils.tryLogNonFatalError { handleRegisterResponse(msg) } // 回调失败,则退出 case Failure(e) => logError(s"Cannot register with master: ${masterEndpoint.address}", e) System.exit(1) }(ThreadUtils.sameThread)}
1.2 Master接受到Worker的RegisterWorker请求,则开始注册worker
# 检查worker是否已经注册过,如果已经注册过,返回注册失败的RegisterWorkerFailed消息
# 检查master所维护的worker节点中是否有DEAD状态的worker,如果有则移除这些worker
# 检查RpcAddress->Worker的映射是否包含这个RpcAddress,如果包含检查状态是否是为UNKNOWN状态,如果是则移除
# 把这个worker添加到Master所维护的与worker相关列表或者集合中
# 然后向Worker发送RegisteredWorker消息,表示注册已成功
# 重新调用schedule方法,开始进行调度,让worker开始干活
// 如果当前节点状态是standby,返回MasterInStandbyif (state == RecoveryState.STANDBY) { context.reply(MasterInStandby)} else if (idToWorker.contains(id)) { // 判断维护的workerid->WorkerInfo映射是否包含这个worker id // 如果包含返回wokerid,则返回 worker id重复的RegisterWorkerFailed context.reply(RegisterWorkerFailed("Duplicate worker ID"))} else {// 表示当前节点为master,且要注册是worker id之前是不存在的 // 创建worker,并进行注册,注册成功并且返回RegisteredWorker请求,然后开始调度 // 否则返回RegisterWorkerFailed请求,worker注册失败 val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, workerRef, workerWebUiUrl) if (registerWorker(worker)) { persistenceEngine.addWorker(worker) context.reply(RegisteredWorker(self, masterWebUiUrl)) schedule() } else { val workerAddress = worker.endpoint.address logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress) context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress)) }}
1.3Worker收到Master返回的注册结果,调用handleRegisterResponse处理结果
# 如果接收RegisteredWorker消息,则更新注册状态;后台线程开始定时调度向master发送心跳的线程;向master发送WorkerLatestState请求,获取worker最近状态;
# 如果接收RegisterWorkerFailed消息,则退出
private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized { msg match { // 如果是RegisteredWorker请求,表示已经注册成功 case RegisteredWorker(masterRef, masterWebUiUrl) => logInfo("Successfully registered with master " + masterRef.address.toSparkURL) registered = true // 更新registered状态 changeMaster(masterRef, masterWebUiUrl) // 后台线程开始定时调度向master发送心跳的线程 forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) // 如果启用了cleanup功能,后台线程开始定时调度发送WorkDirCleanup指令,清理目录 if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) } // 根据worker所持有的executor构造ExecutorDescription对象,描述该executor val execs = executors.values.map { e => new ExecutorDescription(e.appId, e.execId, e.cores, e.state) } // 向master发送WorkerLatestState请求,获取worker最近状态 masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq)) // 如果是RegisterWorkerFailed请求,表示注册失败 case RegisterWorkerFailed(message) => // 如果还没有注册成功,则退出 if (!registered) { logError("Worker registration failed: " + message) System.exit(1) } // 如果是MasterInStandby请求,则啥也不做 case MasterInStandby => // Ignore. Master not yet ready. }}
二Driver向Master注册
在用spark-submit提交应用程序的时候,会调用SparkSubmit这个类,SparkSubmit会调用prepareSubmitEnvironment准备提交环境,在这个时候会设置集群管理者即Clsuter Manager;然后根据部署模式是standalone集群模式,且不是使用rest方式提交,则会初始化org.apache.spark.deploy.Client这个类,并且给定launch参数
2.1 客户端向Master发起提交driver的请求
Client在启动的时候会调用onstart方法,然后根据给定指令时launch还是kill发送对应的消息。
如果是launch:
则最终会调用ayncSendToMasterAndForwardReply向master发送RequestSubmitDriver消息
如果是kill:
则最终会调用ayncSendToMasterAndForwardReply向master发送RequestKillDriver消息
driverArgs.cmd match { case "launch" => val mainClass = "org.apache.spark.deploy.worker.DriverWrapper" val classPathConf = "spark.driver.extraClassPath" val classPathEntries = sys.props.get(classPathConf).toSeq.flatMap { cp => cp.split(java.io.File.pathSeparator) } val libraryPathConf = "spark.driver.extraLibraryPath" val libraryPathEntries = sys.props.get(libraryPathConf).toSeq.flatMap { cp => cp.split(java.io.File.pathSeparator) } val extraJavaOptsConf = "spark.driver.extraJavaOptions" val extraJavaOpts = sys.props.get(extraJavaOptsConf) .map(Utils.splitCommandString).getOrElse(Seq.empty) val sparkJavaOpts = Utils.sparkJavaOpts(conf) val javaOpts = sparkJavaOpts ++ extraJavaOpts val command = new Command(mainClass, Seq("{{WORKER_URL}}", "{{USER_JAR}}", driverArgs.mainClass) ++ driverArgs.driverOptions, sys.env, classPathEntries, libraryPathEntries, javaOpts) val driverDescription = new DriverDescription( driverArgs.jarUrl, driverArgs.memory, driverArgs.cores, driverArgs.supervise, command) ayncSendToMasterAndForwardReply[SubmitDriverResponse]( RequestSubmitDriver(driverDescription)) case "kill" => val driverId = driverArgs.driverId ayncSendToMasterAndForwardReply[KillDriverResponse](RequestKillDriver(driverId)) }}
2.2Master接收客户端的RequestSubmitDriver消息,开始注册driver
# 创建driver
# 持久化引擎添加driver
# 将driver添加到master所维护的driver相关集合或者列表中
# 调用schedule开始调度资源
# 向Client发送SubmitDriverResponse消息
case RequestSubmitDriver(description) => // 如果master不是active,返回错误 if (state != RecoveryState.ALIVE) { val msg = s"${Utils.BACKUP_STANDALONE_MASTER_PREFIX}: $state. " + "Can only accept driver submissions in ALIVE state." context.reply(SubmitDriverResponse(self, false, None, msg)) } else { logInfo("Driver submitted " + description.command.mainClass) // 创建driver val driver = createDriver(description) // 持久化引擎添加drriver persistenceEngine.addDriver(driver) // drivers集合和waitingDrivers集合添加driver waitingDrivers += driver drivers.add(driver) schedule()// 开始调度 // 返回成功的请求消息 context.reply(SubmitDriverResponse(self, true, Some(driver.id), s"Driver successfully submitted as ${driver.id}")) }
三Application向Master注册
3.1 构建StandaloneAppClient,然后向Master注册应用程序
在Standalone模式下,Driver是通过StandaloneSchedulerBackend来和Master进行资源请求协商的.
# SparkContext在初始化的时候会调用createTaskScheduler方法创建TaskSchedulerImpl和StandaloneSchedulerBackend
# 调用TaskSchedulerImpl的start方法启动TaskScheduler
// Create and start the schedulerval (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)_schedulerBackend = sched_taskScheduler = ts_dagScheduler = new DAGScheduler(this)_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)_taskScheduler.start()
# 启动TaskScheduler的时候,首先就会启动StandaloneSchedulerBackend
override def start() { backend.start() if (!isLocal && conf.getBoolean("spark.speculation", false)) { logInfo("Starting speculative execution thread") speculationScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryOrStopSparkContext(sc) { checkSpeculatableTasks() } }, SPECULATION_INTERVAL_MS, SPECULATION_INTERVAL_MS, TimeUnit.MILLISECONDS) }}
# 启动StandaloneSchedulerBackend就会创建StandaloneAppClient,并且启动它
override def start() {
// ……省略
val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory, command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor, initialExecutorLimit)client = new StandaloneAppClient(sc.env.rpcEnv, masters, appDesc, this, conf)client.start()
// ……省略
}
# 启动StandaloneAppClient的时候,会构建通信环境, 会注册一个ClientEndpoint用于通信,然后调用ClientEndpoint的onstart方法
def start() { // Just launch an rpcEndpoint; it will call back into the listener. endpoint.set(rpcEnv.setupEndpoint("AppClient", new ClientEndpoint(rpcEnv)))}
# onstart方法会调用registerWithMaster方法,然后调用tryRegisterAllMasters方法向所有master发送RegisterApplication消息,注册应用程序application
override def onStart(): Unit = { try { registerWithMaster(1) } catch { case e: Exception => logWarning("Failed to connect to master", e) markDisconnected() stop() }}
private def registerWithMaster(nthRetry: Int) { registerMasterFutures.set(tryRegisterAllMasters()) registrationRetryTimer.set(registrationRetryThread.schedule(new Runnable { override def run(): Unit = { if (registered.get) { registerMasterFutures.get.foreach(_.cancel(true)) registerMasterThreadPool.shutdownNow() } else if (nthRetry >= REGISTRATION_RETRIES) { markDead("All masters are unresponsive! Giving up.") } else { registerMasterFutures.get.foreach(_.cancel(true)) registerWithMaster(nthRetry + 1) } } }, REGISTRATION_TIMEOUT_SECONDS, TimeUnit.SECONDS))}
private def tryRegisterAllMasters(): Array[JFuture[_]] = { for (masterAddress <- masterRpcAddresses) yield { registerMasterThreadPool.submit(new Runnable { override def run(): Unit = try { if (registered.get) { return } logInfo("Connecting to master " + masterAddress.toSparkURL + "...") val masterRef = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME) masterRef.send(RegisterApplication(appDescription, self)) } catch { case ie: InterruptedException => // Cancelled case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e) } }) }}
3.2Master开始注册应用程序
# 创建应用程序
# 如果该应用程序已经注册过,则直接返回
# 把该应用程序注册到master,即添加到master所维护与application相关集合或者列表,放入等待队列
# 持久化引擎添加该应用程序
# 向master发送RegisteredApplication请求,表示注册已完成
# 调用schedule方法,开始调度
case RegisterApplication(description, driver) => // 其他的非leader的master是不能进行应用程序的创建和注册 if (state == RecoveryState.STANDBY) { // ignore, don't send response } else { logInfo("Registering app " + description.name) // 创建应用程序和driver val app = createApplication(description, driver) // 注册应用程序 registerApplication(app) logInfo("Registered app " + description.name + " with ID " + app.id) // 持久化引擎添加该application persistenceEngine.addApplication(app) // 向master发送RegisteredApplication请求,表示注册已完成 driver.send(RegisteredApplication(app.id, self)) schedule() }
private def registerApplication(app: ApplicationInfo): Unit = { // 获取app的RpcAddress val appAddress = app.driver.address // 如果已经注册过,则直接返回 if (addressToApp.contains(appAddress)) { logInfo("Attempted to re-register application at same address: " + appAddress) return } applicationMetricsSystem.registerSource(app.appSource) apps += app // 添加这个app到master所维护的application集合 // 并且把app相关数据存放到对应application映射列表 idToApp(app.id) = app endpointToApp(app.driver) = app addressToApp(appAddress) = app waitingApps += app if (reverseProxy) { webUi.addProxyTargets(app.id, app.desc.appUiUrl) }}
3.3StandaloneAppClient接收到RegisteredApplication消息
# 为应用程序设置id
# 注册状态设置为true
# 设置master
# StandaloneAppClientListener开始监听应用程序
case RegisteredApplication(appId_, masterRef) => appId.set(appId_) registered.set(true) master = Some(masterRef) listener.connected(appId.get)
- Spark源码分析之Master注册机制原理
- Spark源码分析之Master状态改变处理机制原理
- spark源码分析之master注册application
- Spark内核源码深度剖析:Master注册机制原理剖析与源码分析
- Spark的Master分析2(Master注册机制原理分析)
- Spark源码分析之Master启动和通信机制
- Spark源码分析之Master主备切换机制
- Master原理剖析与源码分析:注册机制原理剖析与源码分析
- Spark源码分析之Master资源调度算法原理
- Spark内核源码深度剖析:Master主备切换机制原理剖析与源码分析
- spark源码分析之Master源码主备切换机制分析
- spark master注册机制和主备切换源码
- Spark Master的注册机制
- Master原理剖析与源码分析:Master状态改变处理机制原理剖析与源码分析
- Spark分析之Master
- spark源码学习(二)---Master源码分析(1)-master的主备切换机制
- spark源码学习(二)---Master源码分析(2)-master内组件状态改变机制
- 3.Master注册机制源码分析和状态改变机制源码分析
- Python变量及基本运算
- SSL证书(HTTPS)背后的加密算法
- Android 中camera 、 Matrix 和画图进阶学习
- caffe记录错误
- Android Studio导入别人项目容易遇到的问题
- Spark源码分析之Master注册机制原理
- java垃圾回收
- SpringMVC_ModelAttribute注解之源码分析
- 发布的slpk无法在场景中加载(Layer cannot be added)
- 【知识图谱】复旦大学:基于知识图谱的用户画像技术研究
- HbaseRegionSplit脑图
- 【AI-CPS OS】让人工智能成为为现代农业的智能“农业大脑”
- 【智能零售】德勤中国和阿里研究院:便利店的下一站
- 数据仓库ETL算法