Master的注册机制和状态管理详解

来源:互联网 发布:erlang python 编辑:程序博客网 时间:2024/06/12 00:23

一 、master对其他组件注册的处理

1, master接受注册的对象主要就是:driver,application,worker;需要补充说明executor不会注册给master,executor是注册给driver中的schedulerbackbend的;

2, worker是再启动后主动向master注册的,所以如果在生产环境下加入新的worker到已经正在运行的Spark集群上,此时不需要重新启动spark集群就能够使用新加入的worker以提升处理能力;

case RegisterWorker(    id, workerHost, workerPort, workerRef, cores, memory, workerUiPort, publicAddress)=> {  logInfo("Registeringworker %s:%d with %d cores, %s RAM".format(    workerHost, workerPort, cores, Utils.megabytesToString(memory)))  if (state == RecoveryState.STANDBY) {    context.reply(MasterInStandby)  } else if (idToWorker.contains(id)) {    context.reply(RegisterWorkerFailed("Duplicate worker ID"))  } else {    val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,      workerRef, workerUiPort, publicAddress)    if (registerWorker(worker)) {      persistenceEngine.addWorker(worker)      context.reply(RegisteredWorker(self, masterWebUiUrl))      schedule()    } else {      val workerAddress = worker.endpoint.address      logWarning("Workerregistration failed. Attempted to re-register worker at same " +        "address:" + workerAddress)      context.reply(RegisterWorkerFailed("Attempted to re-register worker at sameaddress: "        + workerAddress))    }  }

3, master在接收到worker注册的请求后,首先会判断一下当前的master是否是standby的模式,如果是的话就不处理;然后会判断当前master的内存数据结构idToWorker中是否已经有该worker的注册信息,如果有的话此时不会重复注册;

4, master如果决定接收注册的worker,首先会创建workerInfo对象来保存注册的worker信息;

private[spark] class WorkerInfo(    val id: String,    val host: String,    val port: Int,    val cores: Int,    val memory: Int,    val endpoint: RpcEndpointRef,    val webUiPort: Int,    val publicAddress: String)  extends Serializable {  }

然后调用registerWorker来执行具体的注册过程,如果worker的状态是否是dead的状态则直接过滤掉,对于unknown装的内容调用removeWorker进行清理(包括清理worker下的executors和driver)

5, 注册时候是先注册driver然后在注册application;

二 master对driver和executor状态变化的处理

1, 对driver状态变化的处理

case DriverState.ERROR | DriverState.FINISHED | DriverState.KILLED | DriverState.FAILED =>  removeDriver(driverId, state, exception)

2, Executor挂掉的时候系统会尝试一定次数的重启(最多重试10次重启)
这里写图片描述

阅读全文
0 0