spark资源调度分配

来源：互联网发布：js 仿京东楼层特效编辑：程序博客网时间：2024/05/29 12:39

一．任务调度与资源调度的区别
1.任务调度是通过DAGScheduler、TaskScheduler、SchedulerBackend等进行的作业调度
2.资源调度是指应用程序如何获得资源
3.任务调度是在资源调度的基础上进行的，没有资源调度那么任务调度就成了无源之水
二．资源调度内幕
1.因为Master负责资源管理和调度，所以资源调度的方法scheduler位于Master.scala这个类中，当注册程序或者资源发生变化的时候都会导致Scheduler的调用，例如注册程序的时候：

case RegisterApplication(description, driver) => {  // TODO Prevent repeated registrations from some driver  if (state == RecoveryState.STANDBY) {    // ignore, don't send response  } else {    logInfo("Registering app " + description.name)    val app = createApplication(description, driver)    registerApplication(app)    logInfo("Registered app " + description.name + " with ID " + app.id)    persistenceEngine.addApplication(app)    driver.send(RegisteredApplication(app.id, self))    schedule()  }}

2.scheduler调用的时机：每次有新的应用程序提交或者集群资源状况发生改变的时候（包括Executor增加或减少、worker增加或减少）

/** * Schedule the currently available resources among waiting apps. This method will be called * every time a new app joins or resource availability changes. */private def schedule(): Unit = {  if (state != RecoveryState.ALIVE) { //当前Master必须是Alive的方式采用资源调度，如果不是Alive的状态直接返回    return  }  // Drivers take strict precedence over executors  val shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))  val numWorkersAlive = shuffledAliveWorkers.size  var curPos = 0  for (driver <- waitingDrivers.toList) { // iterate over a copy of waitingDrivers    // We assign workers to each waiting driver in a round-robin fashion. For each driver, we    // start from the last worker that was assigned a driver, and continue onwards until we have    // explored all alive workers.    var launched = false    var numWorkersVisited = 0    while (numWorkersVisited < numWorkersAlive && !launched) {      val worker = shuffledAliveWorkers(curPos)      numWorkersVisited += 1      if (worker.memoryFree >= driver.desc.mem && worker.coresFree >= driver.desc.cores) {        launchDriver(worker, driver)        waitingDrivers -= driver        launched = true      }      curPos = (curPos + 1) % numWorkersAlive    }  }  startExecutorsOnWorkers() //在worker节点启动Executor，见第八点}

3.if (state != RecoveryState.ALIVE) { //当前Master必须是Alive的方式采用资源调度，如果不是Alive的状态直接返回
return
}
4.使用Random.shuffle把Master中保留的集群中所有Worker的信息打乱，其算法内部是循环随机交换所有Worker在Master缓冲数据结构中的位置
5.接下来要判断所有worker中哪些是Alive级别的Worker，Alive才能参见资源的分配工作：

// Drivers take strict precedence over executorsval shuffledAliveWorkers = Random.shuffle(workers.toSeq.filter(_.state == WorkerState.ALIVE))

6.当sparkSubmit指定Driver在Cluster模式的情况下，此时Driver会加入waitingDriver等待队列中，在每个的DriverDescription中有要启动Driver时候对Worker的内存及Cores的要求等内容

private[deploy] case class DriverDescription(    jarUrl: String,    mem: Int,    cores: Int,    supervise: Boolean,    command: Command) {  override def toString: String = s"DriverDescription (${command.mainClass})"}

在符合资源要求的情况下然后采取随机打乱后的一个Worker来启动Driver

private def launchDriver(worker: WorkerInfo, driver: DriverInfo) {  logInfo("Launching driver " + driver.id + " on worker " + worker.id)  worker.addDriver(driver)  //worker添加Driver  driver.worker = Some(worker)  //worker和Driver相互记录  worker.endpoint.send(LaunchDriver(driver.id, driver.desc))  //Master发指令给Worker  driver.state = DriverState.RUNNING  //标志Driver的状态为RUNING}

Master发指令给Worker，让远程的Worker启动Driver

worker.endpoint.send(LaunchDriver(driver.id, driver.desc))

7.先启动Driver才会发生后续的一切的资源调度的模式。
8.Spark默认为应用程序启动Executor的方式是FIFO的方式，也就是所有提交的应用程序都是放在调度的等待队列中，先进先出，只有满足了前面应用程序的资源分配的基础上才能满足下一个应用程序资源的分配；

/** * Schedule and launch executors on workers */private def startExecutorsOnWorkers(): Unit = {  // Right now this is a very simple FIFO scheduler. We keep trying to fit in the first app  // in the queue, then the second app, etc.  for (app <- waitingApps if app.coresLeft > 0) {    val coresPerExecutor: Option[Int] = app.desc.coresPerExecutor    // Filter out workers that don't have enough resources to launch an executor    val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)      .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&        worker.coresFree >= coresPerExecutor.getOrElse(1))      .sortBy(_.coresFree).reverse    val assignedCores = scheduleExecutorsOnWorkers(app, usableWorkers, spreadOutApps)    // Now that we've decided how many cores to allocate on each worker, let's allocate them    for (pos <- 0 until usableWorkers.length if assignedCores(pos) > 0) {      allocateWorkerResourceToExecutors(        app, assignedCores(pos), coresPerExecutor, usableWorkers(pos))    }  }}

9.为应用程序具体分配Executor之前要判断应用程序是否还需要分配Core，如果不需要则不会为应用程序分配Executor

for (app <- waitingApps if app.coresLeft > 0) {

10.具体分配Executor之前要对要求Worker必须是Alive的状态必须满足Application对每个Executor的内存和Cores的要求，并且在此基础上进行降序排序，产生计算资源由大到小的usableWorker数据结构

val usableWorkers = workers.toArray.filter(_.state == WorkerState.ALIVE)  .filter(worker => worker.memoryFree >= app.desc.memoryPerExecutorMB &&    worker.coresFree >= coresPerExecutor.getOrElse(1))  .sortBy(_.coresFree).reverse

在FIFO的情况下是spreadOutApps来让应用程序尽可能多的运行在所有的Node上。

// As a temporary workaround before better ways of configuring memory, we allow users to set// a flag that will perform round-robin scheduling across the nodes (spreading out each app// among all the nodes) instead of trying to consolidate each app onto a small # of nodes.private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)

11.为应用程序分配Executor有两种方式，第一种方式是尽可能在集群的所有Worker上分配Executor，这种方式往往会带来潜在的更好的数据本地性；
12.具体在集群上分配cores的时候会尽可能的满足我们的要求：

var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)

13.如果是每个Worker下面只能够为当前的应用程序分配一个Executor的话，每次是分配一个core

// If we are launching one executor per worker, then every iteration assigns 1 core// to the executor. Otherwise, every iteration assigns cores to a new executor.if (oneExecutorPerWorker) {  assignedExecutors(pos) = 1} else {  assignedExecutors(pos) += 1}

14.准备具体为当前应用程序分配的Executor信息后，Master要通过远程通信发指令给worker来具体启动ExecutorBackend进程：

private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = {  logInfo("Launching executor " + exec.fullId + " on worker " + worker.id)  worker.addExecutor(exec)  worker.endpoint.send(LaunchExecutor(masterUrl,    exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))  exec.application.driver.send(    ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))}

worker.endpoint.send(LaunchExecutor(masterUrl,  exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory))

15.紧接着给我们应用程序的Driver发送一个ExecutorAdded的信息：

exec.application.driver.send(  ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory))

先分析到这里，完毕！

阅读全文

0 0