Spark1.6源码之Master主备切换机制
来源:互联网 发布:dos运行多个java 编辑:程序博客网 时间:2024/05/19 15:23
Spark1.6源码之Master主备切换机制
**注意:Spark在standalone运行模式下,可以配置spark master的HA,当active master节点宕机,就能把standby master切换成active。
主备切换的机制有2种:
(1)基于文件系统的切换——在active master挂掉后,手动切换到standby master节点上。
(2)基于zookeeper的切换——自动切换master。** 。大概流程:
1:Master接收ElectedLeader消息后,使用持久化引擎来读取storedApp、storedDriver、storedWorker,这三个集合只要有一个不为空,说明需要恢复,继续下面的逻辑
2:调用beginRecovery方法开始恢复,主要是把stored里的数据进行注册,然后把app和worker的状态重置成未知、并且向app所在的driver和worker发送消息
3:Master接收到回复消息后,会更改app和worker状态为可用
4:最后调用completeRecovery方法完成主备切换的数据恢复.
- 1、case ElectedLeader classPath:org.apache.spark.deploy.master.Master
- 开始恢复的入口
//选举Master case ElectedLeader => //使用持久化引擎来读取这三个序列,如果有数据,说明需要恢复,反之不需要 val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv) state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) { RecoveryState.ALIVE } else { RecoveryState.RECOVERING } logInfo("I have been elected leader! New state: " + state) if (state == RecoveryState.RECOVERING) { //开始恢复 beginRecovery(storedApps, storedDrivers, storedWorkers) recoveryCompletionTask = forwardMessageThread.schedule(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { //向自己发送完成恢复消息,会跳转到下面的那个case CompleteRecovery self.send(CompleteRecovery) } }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS) }//完成恢复case CompleteRecovery => completeRecovery()
2、beginRecovery classPath:org.apache.spark.deploy.master.Master
private def beginRecovery(storedApps: Seq[ApplicationInfo], storedDrivers: Seq[DriverInfo], storedWorkers: Seq[WorkerInfo]) { //把状态设置为未知,先把app注册到内存,并且向该app所在的driver发送变更消息 //driver(StandaloneAppClient)接收到消息会回复消息(MasterChangeAcknowledged) for (app <- storedApps) { logInfo("Trying to recover app: " + app.id) try { registerApplication(app) app.state = ApplicationState.UNKNOWN app.driver.send(MasterChanged(self, masterWebUiUrl)) } catch { case e: Exception => logInfo("App " + app.id + " had exception on reconnect") } } //把driver添加到缓存(疑问:这个driver通信不了怎么办?) for (driver <- storedDrivers) { // Here we just read in the list of drivers. Any drivers associated with now-lost workers // will be re-launched when we detect that the worker is missing. drivers += driver } //把状态设置为未知,先把worker注册到内存,并且向driver发送变更消息 ////worker接收到消息会回复消息(WorkerSchedulerStateResponse),并把当前worker包含的executors和drivers信息返回回来。 for (worker <- storedWorkers) { logInfo("Trying to recover worker: " + worker.id) try { registerWorker(worker) worker.state = WorkerState.UNKNOWN worker.endpoint.send(MasterChanged(self, masterWebUiUrl)) } catch { case e: Exception => logInfo("Worker " + worker.id + " had exception on reconnect") } } }
- 3、case ElectedLeader classPath:org.apache.spark.deploy.master.Master
- 接收driver和worker的消息回复
//driver回复了消息后,改变app状态为等待中case MasterChangeAcknowledged(appId) => idToApp.get(appId) match { case Some(app) => logInfo("Application has been re-registered: " + appId) app.state = ApplicationState.WAITING case None => logWarning("Master change ack from unknown app: " + appId) } //统计下缓存数据里的workers和apps里未知状态的个数是否为0,如果是0,则可以进行完成恢复操作 if (canCompleteRecovery) { completeRecovery() } //worker回复消息,状态重置为存活 case WorkerSchedulerStateResponse(workerId, executors, driverIds) => idToWorker.get(workerId) match { case Some(worker) => logInfo("Worker has been re-registered: " + workerId) worker.state = WorkerState.ALIVE //在刚才worker发送过来的executors里面筛选出来是属于将要恢复的app下的executor。 val validExecutors = executors.filter(exec => idToApp.get(exec.appId).isDefined) //循环合格的executors。并把app、exec、worker在内存里关联上. for (exec <- validExecutors) { val app = idToApp.get(exec.appId).get val execInfo = app.addExecutor(worker, exec.cores, Some(exec.execId)) worker.addExecutor(execInfo) execInfo.copyState(exec) } //driverIds里存放的是sotredDriver数据 //循环筛选出需要恢复的driver,并且绑定worker for (driverId <- driverIds) { drivers.find(_.id == driverId).foreach { driver => driver.worker = Some(worker) driver.state = DriverState.RUNNING worker.drivers(driverId) = driver } } case None => logWarning("Scheduler state from unknown worker: " + workerId) } //原理同上 if (canCompleteRecovery) { completeRecovery() }
- 4、completeRecovery( ) classPath:org.apache.spark.deploy.master.Master
- 完成恢复方法,就是移除掉那些状态未知的组件,以及重新分配下没人认领的driver
private def completeRecovery() { //保证只恢复一次,如果状态不在恢复中、则不作任何操作 // Ensure "only-once" recovery semantics using a short synchronization period. if (state != RecoveryState.RECOVERING) { return } state = RecoveryState.COMPLETING_RECOVERY //把那些在上个步骤没有回复消息给Master的组件(worker,app)给移除掉 // Kill off any workers and apps that didn't respond to us. workers.filter(_.state == WorkerState.UNKNOWN).foreach(removeWorker) apps.filter(_.state == ApplicationState.UNKNOWN).foreach(finishApplication) //重新调度那些没有被任何worker认领的driver // Reschedule drivers which were not claimed by any workers drivers.filter(_.worker.isEmpty).foreach { d => logWarning(s"Driver ${d.id} was not found after master recovery") if (d.desc.supervise) { logWarning(s"Re-launching ${d.id}") relaunchDriver(d) } else { removeDriver(d.id, DriverState.ERROR, None) logWarning(s"Did not re-launch ${d.id} because it was not supervised") } } //更改本Master为存活状态 state = RecoveryState.ALIVE schedule() logInfo("Recovery complete - resuming operations!") }
阅读全文
0 0
- Spark1.6源码之Master主备切换机制
- Spark源码分析之Master主备切换机制
- spark源码分析之Master源码主备切换机制分析
- Spark1.6源码之Application注册机制
- Spark1.6源码之Worker注册机制
- Spark1.6源码之资源调度机制
- 2.Master主备机制切换源码分析
- spark master注册机制和主备切换源码
- spark源码学习(二)---Master源码分析(1)-master的主备切换机制
- Spark内核源码深度剖析:Master主备切换机制原理剖析与源码分析
- Master原理剖析与源码分析:主备切换机制原理剖析与源码分析
- Spark系列(五)Master主备切换机制
- Spark通信机制:1)Spark1.3 vs Spark1.6源码分析
- Spark1.6源码编译
- Spark1.6源码之Task任务提交源码分析
- Spark的Master分析1(主备切换机制原理分析)
- Spark源码分析之Master启动和通信机制
- Spark源码分析之Master状态改变处理机制原理
- 生成连续日期
- 第一次使用博客 贴一个JSP+SQL实现按日期查询留言的代码吧~
- centos6.9下yum安装svn
- Dalvik VM (DVM) 与Java VM (JVM)之间的区别
- PeekMessage和GetMessage函数的主要区别
- Spark1.6源码之Master主备切换机制
- Xshell的使用以及常用命令
- 一位资深程序员大牛给予Java初学者的学习路线建议
- 一个优秀的研发团队应该具备什么特征
- UE4 Garbage Collection & Dynamic Memory Allocation
- Jenkins+Git+Maven+Shell+Tomcat集成测试环境搭建
- C++ const限定符总结
- excel下载火狐浏览器不兼容问题
- 第2章 渲染流水线