Spark1.6源码之Worker注册机制
来源:互联网 发布:广电机顶盒需要网络吗 编辑:程序博客网 时间:2024/05/23 21:24
Spark1.6源码之Worker注册机制
注意:三大组件(worker,application,driver)注册到master,其中就属worker稍微麻烦一点 。
大概流程: 这里分析的注册机制主要是讲当Master Receive到Regist Request之后所做的操作
- 1、Worker Regist classPath:org.apache.spark.deploy.worker
- Worker这个类是继承了RpcEndpoint的实现类ThreadSafeRpcEndpoint,所以他有个onStart方法,启动该RPC实例的时候会调用该方法,该方法就是向Master注册信息
- worker启动后,会启动一个线程池,向HA里所有的Master发送注册消息,当活跃的Master注册完成并且返回消息后,再由worker匹配消息类型,来做响应的处理
- 如果注册成功,更改缓存里关于Master的一些信息,并且定时发送心跳信息,和检查组件等请求
- 如果失败,则重新尝试注册。直到最大次数16次后,即为放弃。
1.1 worker启动就开始注册
override def onStart() { //注册到Master registerWithMaster() }private def registerWithMaster() { registrationRetryTimer match { case None => registered = false //向所有的Master地址注册一遍 registerMasterFutures = tryRegisterAllMasters() //重置 重试链接次数 为 : 0 ,这个是用来判断失败后重新注册次数的,超过限制 16次就不在重新注册 connectionAttemptCount = 0 //开启一个定时器、如果上面的tryRegisterAllMasters注册失败,那么registered字段就不为TRUE //这里就判断registered字段,重试直到最大次数后,放弃重试. //All masters are unresponsive! Giving up registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate( new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { Option(self).foreach(_.send(ReregisterWithMaster)) } }, INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS, INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS, TimeUnit.SECONDS)) case Some(_) => logInfo("Not spawning another attempt to register with the master, since there is an" + " attempt scheduled already.") } }private def tryRegisterAllMasters(): Array[JFuture[_]] = { //循环该worker的构造参数属性masterRpcAddresses,这是一个存放着RpcAddress的集合 //然后使用一个线程池(个数=masterRpcAddresses个数)来多线程注册Master //本来这里还有个疑问,这里使用多线程去执行代码,在run方法内部的执行过程中会使用到成员变量:registered //这个变量的作用还很重要,这里不会产生多个线程之间对数据共享造成的错误逻辑? //后来看Master的代码才知道了,这里的向多个master发送注册消息,是指在HA的情况下 //在这种情况下、只有Live的Master会返回消息,而对registered的逻辑判断操作是在返回消息之后 //所以不会产生我所担心的情况 masterRpcAddresses.map { masterAddress => registerMasterThreadPool.submit(new Runnable { override def run(): Unit = { try { logInfo("Connecting to master " + masterAddress + "...") //链接Master,并且创建了一个可以向Master进行RPC通信的引用 val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME) //拿到通信引用、进行注册 registerWithMaster(masterEndpoint) } catch { case ie: InterruptedException => // Cancelled case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e) } } }) } }//当链接成功并且有能发消息的引用对象之后,才是真正进行注册操作private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = { //使用ask方法发送消息进行注册,这种方式会得到返回信息 masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker( workerId, host, port, self, cores, memory, workerWebUiUrl)) .onComplete { // This is a very fast action so we can use "ThreadUtils.sameThread" //如果通信成功,处理返回消息(因为可能注册失败) case Success(msg) => Utils.tryLogNonFatalError { //处理返回的信息 handleRegisterResponse(msg) } case Failure(e) => logError(s"Cannot register with master: ${masterEndpoint.address}", e) System.exit(1) }(ThreadUtils.sameThread) }private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized { msg match { //能匹配到这里,只可能是一个线程! //匹配注册成功的消息 case RegisteredWorker(masterRef, masterWebUiUrl) => registered = true //设置标识该worker是否注册的字段为TRUE //刚才是群发消息去注册,并不知道哪一台是活动的Master,但是当接收到注册成功的消息后,就可以从 //返回消息里拿到这些信息,并且,设置到自己的缓存 changeMaster(masterRef, masterWebUiUrl) //发送心跳,15s一次,master60秒检查一次 forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) } val execs = executors.values.map { e => new ExecutorDescription(e.appId, e.execId, e.cores, e.state) } //发送消息、让Master检查一下该worker里的executors,drivers,如果不能识别,Master会返 //回消息让该worker移除掉不能识别的组件 masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq)) //注册失败 case RegisterWorkerFailed(message) => if (!registered) { logError("Worker registration failed: " + message) System.exit(1) } // 如果消息发送到StandBy的Master上去了,会直接不做任何处理 case MasterInStandby => // Ignore. Master not yet ready. } }
- 2、Master Regist classPath:org.apache.spark.deploy.master
- 接收到注册消息之后,开始检查本身的缓存空间对该worker有没有历史数据的引用。
- 如果该worker以前注册过并且现在在master的数据里是死亡的,那么清除掉他
- 如果该worker的RPC通讯地址也注册过,现在在master的缓存数据里是未知,那么也清除掉他,并且清除掉该worker所包含的executors,drivers,否者如果不是未知的,直接返回,不需要重新注册
- 添加该worker的一系列信息至master的缓存。并且持久化.
1.1 当接收到Worker通过ASK发送的注册请求
override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = { case RegisterWorker( id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl) => if (state == RecoveryState.STANDBY) {//如果请求到StandBy的Master,直接返回不做处理的消息 context.reply(MasterInStandby) } else if (idToWorker.contains(id)) {//如果已经注册过了,不需要二次注册 context.reply(RegisterWorkerFailed("Duplicate worker ID")) } else { //封装worker的信息对象 val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, workerRef, workerWebUiUrl) //开始注册,如果成功,进行持久化保存信息,并且向worker回送消息:注册完成 if (registerWorker(worker)) { persistenceEngine.addWorker(worker) context.reply(RegisteredWorker(self, masterWebUiUrl)) schedule() //有新的资源可以调度了,当然马上要用上啦! } else { val workerAddress = worker.endpoint.address logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress) context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress)) }}private def registerWorker(worker: WorkerInfo): Boolean = { //当该worker的所启动的端口以及主机信息以前其实已经注册过了,并被保存到workers集合了,但是在某个时间段被Master设置为死亡,就需要先移除掉该worker在Master里的信息 workers.filter { w => (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD) }.foreach { w => workers -= w } val workerAddress = worker.endpoint.address //该worker的RPC通信地址如果在Master里也保存过 if (addressToWorker.contains(workerAddress)) { val oldWorker = addressToWorker(workerAddress) //如果老worker的状态为未知 if (oldWorker.state == WorkerState.UNKNOWN) { // A worker registering from UNKNOWN implies that the worker was restarted during recovery. // The old worker must thus be dead, so we will remove it and accept the new worker. //直接移除老worker,并且移除该worker所包含的executors,drivers removeWorker(oldWorker) } else { logInfo("Attempted to re-register worker at same address: " + workerAddress) return false } } workers += worker //重新添加worker idToWorker(worker.id) = worker //添加这个Id对应的worker,这个集合不是和workers同步的 addressToWorker(workerAddress) = worke //添加rpc对应的worker if (reverseProxy) { webUi.addProxyTargets(worker.id, worker.webUiAddress) } true //注册成功 }
阅读全文
0 0
- Spark1.6源码之Worker注册机制
- Spark1.6源码之Application注册机制
- Spark1.6源码之资源调度机制
- Spark1.6源码之Master主备切换机制
- memcached源码分析之线程池机制-----worker线程
- Spark源码分析之Worker启动通信机制
- Spark通信机制:1)Spark1.3 vs Spark1.6源码分析
- Spark1.6源码编译
- Spark1.6源码之Task任务提交源码分析
- Spark源码分析之Worker
- Spark源码分析之Worker
- Spark源码分析之Worker
- Spark1.3从创建到提交:1)master和worker启动流程源码分析
- Zookeeper教程(三):ZooKeeper源码阅读之Worker机制及集群状态监控
- spark1.2.0源码分析之ShuffleMapTask
- hadoop2.7.3源码解析之datanode注册和心跳机制
- Spark源码分析之Master注册机制原理
- spark1.2.0源码分析之RDD的reduce操作
- java反射getDeclaredField和getField的区别
- java生成zip包方法
- [PAT甲级]1010. Radix (25)(求另一个数的基数)
- 跌倒
- 6174问题
- Spark1.6源码之Worker注册机制
- 位测试指令(笔记)
- unity 16位进制字符串转化为10进制字符串
- 如何在linux下开启FTP服务
- 电脑进入系统后黑屏怎么办
- GetPrivateProfileInt使用说明
- 选择排序算法的实验
- JS中==是怎么判断的
- GithubPages教程 在GithubPages上搭建个人主页