Spark1.6源码之Worker注册机制

来源:互联网 发布:广电机顶盒需要网络吗 编辑:程序博客网 时间:2024/05/23 21:24

Spark1.6源码之Worker注册机制

注意:三大组件(worker,application,driver)注册到master,其中就属worker稍微麻烦一点

大概流程: 这里分析的注册机制主要是讲当Master Receive到Regist Request之后所做的操作

1、Worker Regist classPath:org.apache.spark.deploy.worker
Worker这个类是继承了RpcEndpoint的实现类ThreadSafeRpcEndpoint,所以他有个onStart方法,启动该RPC实例的时候会调用该方法,该方法就是向Master注册信息
worker启动后,会启动一个线程池,向HA里所有的Master发送注册消息,当活跃的Master注册完成并且返回消息后,再由worker匹配消息类型,来做响应的处理
如果注册成功,更改缓存里关于Master的一些信息,并且定时发送心跳信息,和检查组件等请求
如果失败,则重新尝试注册。直到最大次数16次后,即为放弃。



1.1 worker启动就开始注册

override def onStart() {    //注册到Master    registerWithMaster() }private def registerWithMaster() {    registrationRetryTimer match {      case None =>        registered = false        //向所有的Master地址注册一遍        registerMasterFutures = tryRegisterAllMasters()        //重置 重试链接次数 为 : 0 ,这个是用来判断失败后重新注册次数的,超过限制 16次就不在重新注册        connectionAttemptCount = 0        //开启一个定时器、如果上面的tryRegisterAllMasters注册失败,那么registered字段就不为TRUE        //这里就判断registered字段,重试直到最大次数后,放弃重试.        //All masters are unresponsive! Giving up        registrationRetryTimer = Some(forwordMessageScheduler.scheduleAtFixedRate(          new Runnable {            override def run(): Unit = Utils.tryLogNonFatalError {              Option(self).foreach(_.send(ReregisterWithMaster))            }          },          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,          INITIAL_REGISTRATION_RETRY_INTERVAL_SECONDS,          TimeUnit.SECONDS))      case Some(_) =>        logInfo("Not spawning another attempt to register with the master, since there is an" +          " attempt scheduled already.")    }  }private def tryRegisterAllMasters(): Array[JFuture[_]] = {    //循环该worker的构造参数属性masterRpcAddresses,这是一个存放着RpcAddress的集合    //然后使用一个线程池(个数=masterRpcAddresses个数)来多线程注册Master    //本来这里还有个疑问,这里使用多线程去执行代码,在run方法内部的执行过程中会使用到成员变量:registered    //这个变量的作用还很重要,这里不会产生多个线程之间对数据共享造成的错误逻辑?    //后来看Master的代码才知道了,这里的向多个master发送注册消息,是指在HA的情况下    //在这种情况下、只有Live的Master会返回消息,而对registered的逻辑判断操作是在返回消息之后    //所以不会产生我所担心的情况    masterRpcAddresses.map { masterAddress =>      registerMasterThreadPool.submit(new Runnable {        override def run(): Unit = {          try {            logInfo("Connecting to master " + masterAddress + "...")            //链接Master,并且创建了一个可以向Master进行RPC通信的引用            val masterEndpoint = rpcEnv.setupEndpointRef(masterAddress, Master.ENDPOINT_NAME)            //拿到通信引用、进行注册            registerWithMaster(masterEndpoint)          } catch {            case ie: InterruptedException => // Cancelled            case NonFatal(e) => logWarning(s"Failed to connect to master $masterAddress", e)          }        }      })    }  }//当链接成功并且有能发消息的引用对象之后,才是真正进行注册操作private def registerWithMaster(masterEndpoint: RpcEndpointRef): Unit = {    //使用ask方法发送消息进行注册,这种方式会得到返回信息    masterEndpoint.ask[RegisterWorkerResponse](RegisterWorker(      workerId, host, port, self, cores, memory, workerWebUiUrl))      .onComplete {        // This is a very fast action so we can use "ThreadUtils.sameThread"        //如果通信成功,处理返回消息(因为可能注册失败)        case Success(msg) =>          Utils.tryLogNonFatalError {            //处理返回的信息            handleRegisterResponse(msg)          }        case Failure(e) =>          logError(s"Cannot register with master: ${masterEndpoint.address}", e)          System.exit(1)      }(ThreadUtils.sameThread)  }private def handleRegisterResponse(msg: RegisterWorkerResponse): Unit = synchronized {    msg match {      //能匹配到这里,只可能是一个线程!      //匹配注册成功的消息      case RegisteredWorker(masterRef, masterWebUiUrl) =>        registered = true   //设置标识该worker是否注册的字段为TRUE        //刚才是群发消息去注册,并不知道哪一台是活动的Master,但是当接收到注册成功的消息后,就可以从        //返回消息里拿到这些信息,并且,设置到自己的缓存        changeMaster(masterRef, masterWebUiUrl)        //发送心跳,15s一次,master60秒检查一次        forwordMessageScheduler.scheduleAtFixedRate(new Runnable {          override def run(): Unit = Utils.tryLogNonFatalError {            self.send(SendHeartbeat)          }        }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)        if (CLEANUP_ENABLED) {          logInfo(            s"Worker cleanup enabled; old application directories will be deleted in: $workDir")          forwordMessageScheduler.scheduleAtFixedRate(new Runnable {            override def run(): Unit = Utils.tryLogNonFatalError {              self.send(WorkDirCleanup)            }          }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)        }        val execs = executors.values.map { e =>          new ExecutorDescription(e.appId, e.execId, e.cores, e.state)        }        //发送消息、让Master检查一下该worker里的executors,drivers,如果不能识别,Master会返        //回消息让该worker移除掉不能识别的组件        masterRef.send(WorkerLatestState(workerId, execs.toList, drivers.keys.toSeq))      //注册失败      case RegisterWorkerFailed(message) =>        if (!registered) {          logError("Worker registration failed: " + message)          System.exit(1)        }      // 如果消息发送到StandBy的Master上去了,会直接不做任何处理      case MasterInStandby =>        // Ignore. Master not yet ready.    }  }
2、Master Regist classPath:org.apache.spark.deploy.master
接收到注册消息之后,开始检查本身的缓存空间对该worker有没有历史数据的引用。
如果该worker以前注册过并且现在在master的数据里是死亡的,那么清除掉他
如果该worker的RPC通讯地址也注册过,现在在master的缓存数据里是未知,那么也清除掉他,并且清除掉该worker所包含的executors,drivers,否者如果不是未知的,直接返回,不需要重新注册
添加该worker的一系列信息至master的缓存。并且持久化.



1.1 当接收到Worker通过ASK发送的注册请求

 override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {    case RegisterWorker(        id, workerHost, workerPort, workerRef, cores, memory, workerWebUiUrl) =>      if (state == RecoveryState.STANDBY) {//如果请求到StandBy的Master,直接返回不做处理的消息        context.reply(MasterInStandby)      } else if (idToWorker.contains(id)) {//如果已经注册过了,不需要二次注册        context.reply(RegisterWorkerFailed("Duplicate worker ID"))      } else {        //封装worker的信息对象        val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,          workerRef, workerWebUiUrl)        //开始注册,如果成功,进行持久化保存信息,并且向worker回送消息:注册完成        if (registerWorker(worker)) {          persistenceEngine.addWorker(worker)          context.reply(RegisteredWorker(self, masterWebUiUrl))          schedule() //有新的资源可以调度了,当然马上要用上啦!        } else {          val workerAddress = worker.endpoint.address          logWarning("Worker registration failed. Attempted to re-register worker at same " +            "address: " + workerAddress)          context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: "            + workerAddress))        }}private def registerWorker(worker: WorkerInfo): Boolean = {     //当该worker的所启动的端口以及主机信息以前其实已经注册过了,并被保存到workers集合了,但是在某个时间段被Master设置为死亡,就需要先移除掉该worker在Master里的信息     workers.filter { w =>      (w.host == worker.host && w.port == worker.port) && (w.state == WorkerState.DEAD)    }.foreach { w =>      workers -= w    }    val workerAddress = worker.endpoint.address    //该worker的RPC通信地址如果在Master里也保存过    if (addressToWorker.contains(workerAddress)) {      val oldWorker = addressToWorker(workerAddress)      //如果老worker的状态为未知      if (oldWorker.state == WorkerState.UNKNOWN) {        // A worker registering from UNKNOWN implies that the worker was restarted during recovery.        // The old worker must thus be dead, so we will remove it and accept the new worker.        //直接移除老worker,并且移除该worker所包含的executors,drivers        removeWorker(oldWorker)      } else {        logInfo("Attempted to re-register worker at same address: " + workerAddress)        return false      }    }     workers += worker //重新添加worker    idToWorker(worker.id) = worker //添加这个Id对应的worker,这个集合不是和workers同步的    addressToWorker(workerAddress) = worke //添加rpc对应的worker    if (reverseProxy) {       webUi.addProxyTargets(worker.id, worker.webUiAddress)    }    true //注册成功  }