Spark源码解读(3)——从集群启动到Job提交

来源：互联网发布：新型网络诈骗手段莆田编辑：程序博客网时间：2024/05/29 13:02

1，Master启动

Master启动过程主要做了两件事：

1）启动一个守护线程定时对Worker的TimeOut进行Check，默认TimeOut时间为60s

    checkForWorkerTimeOutTask = forwardMessageThread.scheduleAtFixedRate(new Runnable {      override def run(): Unit = Utils.tryLogNonFatalError {        self.send(CheckForWorkerTimeOut)      }    }, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)

    case CheckForWorkerTimeOut => {      timeOutDeadWorkers()    }

  /** Check for, and remove, any timed-out workers */  private def timeOutDeadWorkers() {    // Copy the workers into an array so we don't modify the hashset while iterating through it    val currentTime = System.currentTimeMillis()    val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT_MS).toArray    for (worker <- toRemove) {      if (worker.state != WorkerState.DEAD) {        logWarning("Removing %s because we got no heartbeat in %d seconds".format(          worker.id, WORKER_TIMEOUT_MS / 1000))        removeWorker(worker)      } else {        if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT_MS)) {          workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough; cull it        }      }    }  }

  private def removeWorker(worker: WorkerInfo) {    logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port)    worker.setState(WorkerState.DEAD)    idToWorker -= worker.id    addressToWorker -= worker.endpoint.address    for (exec <- worker.executors.values) {      logInfo("Telling app of lost executor: " + exec.id)      exec.application.driver.send(ExecutorUpdated(        exec.id, ExecutorState.LOST, Some("worker lost"), None))      exec.application.removeExecutor(exec)    }    for (driver <- worker.drivers.values) {      if (driver.desc.supervise) {        logInfo(s"Re-launching ${driver.id}")        relaunchDriver(driver)      } else {        logInfo(s"Not re-launching ${driver.id} because it was not supervised")        removeDriver(driver.id, DriverState.ERROR, None)      }    }    persistenceEngine.removeWorker(worker)  }

当Master发现Worker心跳超时，则将白Worker移除，如果该Worker上有Executor运行，则在移除改Executor的同时通知改Executor对应的Driver，如果该Worker上运行着Driver，则需要判断该Driver是否处于监视状态，以决定是否重启。

2）选主

    val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match {      case "ZOOKEEPER" =>        logInfo("Persisting recovery state to ZooKeeper")        val zkFactory =          new ZooKeeperRecoveryModeFactory(conf, serializer)        (zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this))      case "FILESYSTEM" =>        val fsFactory =          new FileSystemRecoveryModeFactory(conf, serializer)        (fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this))      case "CUSTOM" =>        val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory"))        val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer])          .newInstance(conf, serializer)          .asInstanceOf[StandaloneRecoveryModeFactory]        (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this))      case _ =>        (new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this))    }

通常使用Zookeeper模式

关于选主这里便有以下几种情况：

1）第一次启动，其中一个Master当选主

2）检测到当前处于Active的Master宕机，处于Standby状态的Master当选为主

3）处于Active状态的Master发现自己不再是Active

第一种情况最为简单，第二种情况需要从持久化引擎中读取状态，恢复到内存，第三种情况则直接选择主动宕机。

    case ElectedLeader => {      val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv)      state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) {        RecoveryState.ALIVE      } else {        RecoveryState.RECOVERING      }      logInfo("I have been elected leader! New state: " + state)      if (state == RecoveryState.RECOVERING) {        beginRecovery(storedApps, storedDrivers, storedWorkers)        recoveryCompletionTask = forwardMessageThread.schedule(new Runnable {          override def run(): Unit = Utils.tryLogNonFatalError {            self.send(CompleteRecovery)          }        }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)      }    }

    case RevokedLeadership => {      logError("Leadership has been revoked -- master shutting down.")      System.exit(0)    }

2，Worker启动

Worker启动更为简单，主要是主动向Master注册

    registerWithMaster()

Master在收到Worker的注册请求之后将Worker的信息保存到持久化引擎，回复Worker注册成功，并触发一轮调度

    case RegisterWorker(        id, workerHost, workerPort, workerRef, cores, memory, workerUiPort, publicAddress) => {      logInfo("Registering worker %s:%d with %d cores, %s RAM".format(        workerHost, workerPort, cores, Utils.megabytesToString(memory)))      if (state == RecoveryState.STANDBY) {        context.reply(MasterInStandby)      } else if (idToWorker.contains(id)) {        context.reply(RegisterWorkerFailed("Duplicate worker ID"))      } else {        val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory,          workerRef, workerUiPort, publicAddress)        if (registerWorker(worker)) {          persistenceEngine.addWorker(worker)          context.reply(RegisteredWorker(self, masterWebUiUrl))          schedule()        } else {          val workerAddress = worker.endpoint.address          logWarning("Worker registration failed. Attempted to re-register worker at same " +            "address: " + workerAddress)          context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: "            + workerAddress))        }      }    }

Worker在收到Master的确认注册成功的信息之后，开始定时向Master心跳

case RegisteredWorker(masterRef, masterWebUiUrl) =>        logInfo("Successfully registered with master " + masterRef.address.toSparkURL)        registered = true        changeMaster(masterRef, masterWebUiUrl)        forwordMessageScheduler.scheduleAtFixedRate(new Runnable {          override def run(): Unit = Utils.tryLogNonFatalError {            self.send(SendHeartbeat)          }        }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS)        if (CLEANUP_ENABLED) {          logInfo(            s"Worker cleanup enabled; old application directories will be deleted in: $workDir")          forwordMessageScheduler.scheduleAtFixedRate(new Runnable {            override def run(): Unit = Utils.tryLogNonFatalError {              self.send(WorkDirCleanup)            }          }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS)        }

心跳的间隔默认是超时时间的1/4，也就是每15s进行一次心跳

3，Job提交

Job提交的过程较为复杂，主要涉及Master，Worker，Client，Driver，Executor等多个组件

首先，从SparkContext的初始化过程开始分析：

SparkContext在初始化的过程中做了两件比较重要的事情：1）启动HeartBeatReceiver；2）启动TaskScheduler

    // We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will    // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)    _heartbeatReceiver = env.rpcEnv.setupEndpoint(      HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))    // Create and start the scheduler    val (sched, ts) = SparkContext.createTaskScheduler(this, master)    _schedulerBackend = sched    _taskScheduler = ts    _dagScheduler = new DAGScheduler(this)    _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's    // constructor    _taskScheduler.start()

启动TaskScheduler主要做了两件事，启动Driver和Client

启动Driver代码：

    // TODO (prashant) send conf instead of properties    driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))

Driver启动后主要想自己定时发送ReviveOffers消息

    override def onStart() {      // Periodically revive offers to allow delay scheduling to work      val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s")      reviveThread.scheduleAtFixedRate(new Runnable {        override def run(): Unit = Utils.tryLogNonFatalError {          Option(self).foreach(_.send(ReviveOffers))        }      }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS)    }

启动Client代码：

    client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf)    client.start()

Client启动后开始向Master注册Application

    override def onStart(): Unit = {      try {        registerWithMaster(1)      } catch {        case e: Exception =>          logWarning("Failed to connect to master", e)          markDisconnected()          stop()      }    }

Master收到Client的注册请求之后会回复Client注册成功，并进行一次资源调度，并通知Worker LaunchExecutor

    case RegisterApplication(description, driver) => {      // TODO Prevent repeated registrations from some driver      if (state == RecoveryState.STANDBY) {        // ignore, don't send response      } else {        logInfo("Registering app " + description.name)        val app = createApplication(description, driver)        registerApplication(app)        logInfo("Registered app " + description.name + " with ID " + app.id)        persistenceEngine.addApplication(app)        driver.send(RegisteredApplication(app.id, self))        schedule()      }    }

这里有必要看下资源的调度过程：

    /** Return whether the specified worker can launch an executor for this app. */    def canLaunchExecutor(pos: Int): Boolean = {      val keepScheduling = coresToAssign >= minCoresPerExecutor      val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor      // If we allow multiple executors per worker, then we can always launch new executors.      // Otherwise, if there is already an executor on this worker, just give it more cores.      val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0      if (launchingNewExecutor) {        val assignedMemory = assignedExecutors(pos) * memoryPerExecutor        val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor        val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit        keepScheduling && enoughCores && enoughMemory && underLimit      } else {        // We're adding cores to an existing executor, so no need        // to check memory and executor limits        keepScheduling && enoughCores      }    }    // Keep launching executors until no more workers can accommodate any    // more executors, or if we have reached this application's limits    var freeWorkers = (0 until numUsable).filter(canLaunchExecutor)    while (freeWorkers.nonEmpty) {      freeWorkers.foreach { pos =>        var keepScheduling = true        while (keepScheduling && canLaunchExecutor(pos)) {          coresToAssign -= minCoresPerExecutor          assignedCores(pos) += minCoresPerExecutor          // If we are launching one executor per worker, then every iteration assigns 1 core          // to the executor. Otherwise, every iteration assigns cores to a new executor.          if (oneExecutorPerWorker) {            assignedExecutors(pos) = 1          } else {            assignedExecutors(pos) += 1          }          // Spreading out an application means spreading out its executors across as          // many workers as possible. If we are not spreading out, then we should keep          // scheduling executors on this worker until we use all of its resources.          // Otherwise, just move on to the next worker.          if (spreadOutApps) {            keepScheduling = false          }        }      }      freeWorkers = freeWorkers.filter(canLaunchExecutor)    }    assignedCores  }

首先看下几个重要参数：coresToAssign，coresPerExecutor，memoryPerExecutor

1）coresToAssign

    var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

  private[master] def coresLeft: Int = requestedCores - coresGranted

  private val requestedCores = desc.maxCores.getOrElse(defaultCores)

  private val maxCores = conf.getOption("spark.cores.max").map(_.toInt)

      OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, ALL_DEPLOY_MODES,        sysProp = "spark.cores.max"),

      case TOTAL_EXECUTOR_CORES =>        totalExecutorCores = value

  protected final String TOTAL_EXECUTOR_CORES = "--total-executor-cores";

从上面的参数解析和传递过程可以看出此处的coresToAssign就是spark-submit 中传递的--total-executor-cores
2）coresPerExecutor

    val coresPerExecutor = app.desc.coresPerExecutor

    val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)

      OptionAssigner(args.executorCores, STANDALONE | YARN, ALL_DEPLOY_MODES,        sysProp = "spark.executor.cores"),

<pre name="code" class="java"><pre name="code" class="java">      case EXECUTOR_CORES =>        executorCores = value

  protected final String EXECUTOR_CORES = "--executor-cores";

可以看出coresPerExecutor来自于 --executor-cores

3）memoryPerExecutor

    val memoryPerExecutor = app.desc.memoryPerExecutorMB

    _executorMemory = _conf.getOption("spark.executor.memory")      .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))      .orElse(Option(System.getenv("SPARK_MEM"))      .map(warnSparkMem))      .map(Utils.memoryStringToMb)      .getOrElse(1024)

      OptionAssigner(args.executorMemory, STANDALONE | MESOS | YARN, ALL_DEPLOY_MODES,        sysProp = "spark.executor.memory"),

      case EXECUTOR_MEMORY =>        executorMemory = value

  protected final String EXECUTOR_MEMORY = "--executor-memory";

可以看出memoryPerExecutor来自于 --executor-memory
除了这三个参数外还有一个参数比较重要：spreadOutApps

  private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)

默认情况下spreadOutApps为true，也就是会尽可能将Executor均匀的分配到每个Worker上，不设置的情况下memoryPerExecutor默认为1g，如果不设置coresPerExecutor则每个Worker最多只会启动一个Executor，且可能出现一个Executor拥有多个Cores但仅仅分配memotyPerExecutor大小的内存的情况。

Master完成调度之后就会通知Worker LaunchExecutor

  /**   * Allocate a worker's resources to one or more executors.   * @param app the info of the application which the executors belong to   * @param assignedCores number of cores on this worker for this application   * @param coresPerExecutor number of cores per executor   * @param worker the worker info   */  private def allocateWorkerResourceToExecutors(      app: ApplicationInfo,      assignedCores: Int,      coresPerExecutor: Option[Int],      worker: WorkerInfo): Unit = {    // If the number of cores per executor is specified, we divide the cores assigned    // to this worker evenly among the executors with no remainder.    // Otherwise, we launch a single executor that grabs all the assignedCores on this worker.    val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1)    val coresToAssign = coresPerExecutor.getOrElse(assignedCores)    for (i <- 1 to numExecutors) {      val exec = app.addExecutor(worker, coresToAssign)      launchExecutor(worker, exec)      app.state = ApplicationState.RUNNING    }  }

在通知Worker启动Executor的同时也会通知Client ExecutorAdded

  private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {    logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)    worker.addExecutor(exec)    worker.endpoint.send(LaunchExecutor(masterUrl,      exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))    exec.application.driver.send(      ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))  }

Worker收到Master的指令后会启动一个线程用于启动子进程Executor

    workerThread = new Thread("ExecutorRunner for " + fullId) {      override def run() { fetchAndRunExecutor() }    }    workerThread.start()

Executor进程启动代码：

// Launch the process      val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf),        memory, sparkHome.getAbsolutePath, substituteVariables)      val command = builder.command()      val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"")      logInfo(s"Launch command: $formattedCommand")      builder.directory(executorDir)      builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator))      // In case we are running this from within the Spark Shell, avoid creating a "scala"      // parent process for the executor command      builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0")      // Add webUI log urls      val baseUrl =        s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType="      builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr")      builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout")      process = builder.start()

这里可以看到command通过appDesc传递进来，appDesc.command

    val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend",      args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)

这里就一目了然了，启动的是org.apache.spark.executor.CoarseGrainedExecutorBackend

main方法中注册了Executor

      env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend(        env.rpcEnv, driverUrl, executorId, sparkHostPort, cores, userClassPath, env))

根据之前对Master启动过程的分析我们知道RPCEndpoint注册之后，onStart()方法会被调用

  override def onStart() {    logInfo("Connecting to driver: " + driverUrl)    rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref =>      // This is a very fast action so we can use "ThreadUtils.sameThread"      driver = Some(ref)      ref.ask[RegisterExecutorResponse](        RegisterExecutor(executorId, self, hostPort, cores, extractLogUrls))    }(ThreadUtils.sameThread).onComplete {      // This is a very fast action so we can use "ThreadUtils.sameThread"      case Success(msg) => Utils.tryLogNonFatalError {        Option(self).foreach(_.send(msg)) // msg must be RegisterExecutorResponse      }      case Failure(e) => {        logError(s"Cannot register with driver: $driverUrl", e)        System.exit(1)      }    }(ThreadUtils.sameThread)  }

Executor启动之后首先向Driver注册了自己

上文提到在SparkContext初始化的时候，TaskScheduler启动之后启动了Driver，Driver收到Executor的注册信息之后给Executor一个回复信息：

          context.reply(RegisteredExecutor(executorAddress.host))

Executor在收到Driver的回复信息后开始定时心跳，默认心跳间隔为10s

  private def startDriverHeartbeater(): Unit = {    val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s")    // Wait a random interval so the heartbeats don't end up in sync    val initialDelay = intervalMs + (math.random * intervalMs).asInstanceOf[Int]    val heartbeatTask = new Runnable() {      override def run(): Unit = Utils.logUncaughtExceptions(reportHeartBeat())    }    heartbeater.scheduleAtFixedRate(heartbeatTask, initialDelay, intervalMs, TimeUnit.MILLISECONDS)  }

流程图大致如下：

上文分析了SparkContext初始化过程，接下来继续分析Job的提交过程

RDD的转换分为两类：Transformation和Action

Transformation操作不会触发Job的提交，仅仅只是创建一个新的RDD，并将当前RDD作为dependency传递给下一个RDD，如下面的map()操作：

  /**   * Return a new RDD by applying a function to all elements of this RDD.   */  def map[U: ClassTag](f: T => U): RDD[U] = withScope {    val cleanF = sc.clean(f)    new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))  }

Action操作会触发Job的提交，如collect()操作：

  /**   * Return an array that contains all of the elements in this RDD.   */  def collect(): Array[T] = withScope {    val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)    Array.concat(results: _*)  }

这里调用了SparkContext的runJob()方法，接着调用dagScheduler.runJob() ……

完整的调用链如下：

Action操作 --> SparkContext.runJob() --> dagScheduler.runJob() --> dagScheduler.submitJob() --> eventProcessLooper.post(JobSubmit) --> dagScheduler.handleJobSubmitted --> taskScheduler.submitTasks --> CoarseGrainedSchedulerBackend.launchTasks() -->executor.launchTask --> threadPool.execute(TaskRunner)

在整个调用流程中涉及RDD --> DAG --> Stage --> TaskSet 的转换过程，以及资源调度过程，这部分较为复杂，由后续博客继续分析。

1 0