Spark源码解读(3)——从集群启动到Job提交
来源:互联网 发布:新型网络诈骗手段莆田 编辑:程序博客网 时间:2024/05/29 13:02
1,Master启动
Master启动过程主要做了两件事:
1)启动一个守护线程定时对Worker的TimeOut进行Check,默认TimeOut时间为60s
checkForWorkerTimeOutTask = forwardMessageThread.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(CheckForWorkerTimeOut) } }, 0, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS)
case CheckForWorkerTimeOut => { timeOutDeadWorkers() }
/** Check for, and remove, any timed-out workers */ private def timeOutDeadWorkers() { // Copy the workers into an array so we don't modify the hashset while iterating through it val currentTime = System.currentTimeMillis() val toRemove = workers.filter(_.lastHeartbeat < currentTime - WORKER_TIMEOUT_MS).toArray for (worker <- toRemove) { if (worker.state != WorkerState.DEAD) { logWarning("Removing %s because we got no heartbeat in %d seconds".format( worker.id, WORKER_TIMEOUT_MS / 1000)) removeWorker(worker) } else { if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * WORKER_TIMEOUT_MS)) { workers -= worker // we've seen this DEAD worker in the UI, etc. for long enough; cull it } } } }
private def removeWorker(worker: WorkerInfo) { logInfo("Removing worker " + worker.id + " on " + worker.host + ":" + worker.port) worker.setState(WorkerState.DEAD) idToWorker -= worker.id addressToWorker -= worker.endpoint.address for (exec <- worker.executors.values) { logInfo("Telling app of lost executor: " + exec.id) exec.application.driver.send(ExecutorUpdated( exec.id, ExecutorState.LOST, Some("worker lost"), None)) exec.application.removeExecutor(exec) } for (driver <- worker.drivers.values) { if (driver.desc.supervise) { logInfo(s"Re-launching ${driver.id}") relaunchDriver(driver) } else { logInfo(s"Not re-launching ${driver.id} because it was not supervised") removeDriver(driver.id, DriverState.ERROR, None) } } persistenceEngine.removeWorker(worker) }当Master发现Worker心跳超时,则将白Worker移除,如果该Worker上有Executor运行,则在移除改Executor的同时通知改Executor对应的Driver,如果该Worker上运行着Driver,则需要判断该Driver是否处于监视状态,以决定是否重启。
2)选主
val (persistenceEngine_, leaderElectionAgent_) = RECOVERY_MODE match { case "ZOOKEEPER" => logInfo("Persisting recovery state to ZooKeeper") val zkFactory = new ZooKeeperRecoveryModeFactory(conf, serializer) (zkFactory.createPersistenceEngine(), zkFactory.createLeaderElectionAgent(this)) case "FILESYSTEM" => val fsFactory = new FileSystemRecoveryModeFactory(conf, serializer) (fsFactory.createPersistenceEngine(), fsFactory.createLeaderElectionAgent(this)) case "CUSTOM" => val clazz = Utils.classForName(conf.get("spark.deploy.recoveryMode.factory")) val factory = clazz.getConstructor(classOf[SparkConf], classOf[Serializer]) .newInstance(conf, serializer) .asInstanceOf[StandaloneRecoveryModeFactory] (factory.createPersistenceEngine(), factory.createLeaderElectionAgent(this)) case _ => (new BlackHolePersistenceEngine(), new MonarchyLeaderAgent(this)) }通常使用Zookeeper模式
关于选主这里便有以下几种情况:
1)第一次启动,其中一个Master当选主
2)检测到当前处于Active的Master宕机,处于Standby状态的Master当选为主
3)处于Active状态的Master发现自己不再是Active
第一种情况最为简单,第二种情况需要从持久化引擎中读取状态,恢复到内存,第三种情况则直接选择主动宕机。
case ElectedLeader => { val (storedApps, storedDrivers, storedWorkers) = persistenceEngine.readPersistedData(rpcEnv) state = if (storedApps.isEmpty && storedDrivers.isEmpty && storedWorkers.isEmpty) { RecoveryState.ALIVE } else { RecoveryState.RECOVERING } logInfo("I have been elected leader! New state: " + state) if (state == RecoveryState.RECOVERING) { beginRecovery(storedApps, storedDrivers, storedWorkers) recoveryCompletionTask = forwardMessageThread.schedule(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(CompleteRecovery) } }, WORKER_TIMEOUT_MS, TimeUnit.MILLISECONDS) } }
case RevokedLeadership => { logError("Leadership has been revoked -- master shutting down.") System.exit(0) }2,Worker启动
Worker启动更为简单,主要是主动向Master注册
registerWithMaster()Master在收到Worker的注册请求之后将Worker的信息保存到持久化引擎,回复Worker注册成功,并触发一轮调度
case RegisterWorker( id, workerHost, workerPort, workerRef, cores, memory, workerUiPort, publicAddress) => { logInfo("Registering worker %s:%d with %d cores, %s RAM".format( workerHost, workerPort, cores, Utils.megabytesToString(memory))) if (state == RecoveryState.STANDBY) { context.reply(MasterInStandby) } else if (idToWorker.contains(id)) { context.reply(RegisterWorkerFailed("Duplicate worker ID")) } else { val worker = new WorkerInfo(id, workerHost, workerPort, cores, memory, workerRef, workerUiPort, publicAddress) if (registerWorker(worker)) { persistenceEngine.addWorker(worker) context.reply(RegisteredWorker(self, masterWebUiUrl)) schedule() } else { val workerAddress = worker.endpoint.address logWarning("Worker registration failed. Attempted to re-register worker at same " + "address: " + workerAddress) context.reply(RegisterWorkerFailed("Attempted to re-register worker at same address: " + workerAddress)) } } }Worker在收到Master的确认注册成功的信息之后,开始定时向Master心跳
case RegisteredWorker(masterRef, masterWebUiUrl) => logInfo("Successfully registered with master " + masterRef.address.toSparkURL) registered = true changeMaster(masterRef, masterWebUiUrl) forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(SendHeartbeat) } }, 0, HEARTBEAT_MILLIS, TimeUnit.MILLISECONDS) if (CLEANUP_ENABLED) { logInfo( s"Worker cleanup enabled; old application directories will be deleted in: $workDir") forwordMessageScheduler.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { self.send(WorkDirCleanup) } }, CLEANUP_INTERVAL_MILLIS, CLEANUP_INTERVAL_MILLIS, TimeUnit.MILLISECONDS) }心跳的间隔默认是超时时间的1/4,也就是每15s进行一次心跳
3,Job提交
Job提交的过程较为复杂,主要涉及Master,Worker,Client,Driver,Executor等多个组件
首先,从SparkContext的初始化过程开始分析:
SparkContext在初始化的过程中做了两件比较重要的事情:1)启动HeartBeatReceiver;2)启动TaskScheduler
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will // retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640) _heartbeatReceiver = env.rpcEnv.setupEndpoint( HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this)) // Create and start the scheduler val (sched, ts) = SparkContext.createTaskScheduler(this, master) _schedulerBackend = sched _taskScheduler = ts _dagScheduler = new DAGScheduler(this) _heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet) // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's // constructor _taskScheduler.start()启动TaskScheduler主要做了两件事,启动Driver和Client
启动Driver代码:
// TODO (prashant) send conf instead of properties driverEndpoint = rpcEnv.setupEndpoint(ENDPOINT_NAME, createDriverEndpoint(properties))Driver启动后主要想自己定时发送ReviveOffers消息
override def onStart() { // Periodically revive offers to allow delay scheduling to work val reviveIntervalMs = conf.getTimeAsMs("spark.scheduler.revive.interval", "1s") reviveThread.scheduleAtFixedRate(new Runnable { override def run(): Unit = Utils.tryLogNonFatalError { Option(self).foreach(_.send(ReviveOffers)) } }, 0, reviveIntervalMs, TimeUnit.MILLISECONDS) }启动Client代码:
client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf) client.start()Client启动后开始向Master注册Application
override def onStart(): Unit = { try { registerWithMaster(1) } catch { case e: Exception => logWarning("Failed to connect to master", e) markDisconnected() stop() } }Master收到Client的注册请求之后会回复Client注册成功,并进行一次资源调度,并通知Worker LaunchExecutor
case RegisterApplication(description, driver) => { // TODO Prevent repeated registrations from some driver if (state == RecoveryState.STANDBY) { // ignore, don't send response } else { logInfo("Registering app " + description.name) val app = createApplication(description, driver) registerApplication(app) logInfo("Registered app " + description.name + " with ID " + app.id) persistenceEngine.addApplication(app) driver.send(RegisteredApplication(app.id, self)) schedule() } }这里有必要看下资源的调度过程:
/** Return whether the specified worker can launch an executor for this app. */ def canLaunchExecutor(pos: Int): Boolean = { val keepScheduling = coresToAssign >= minCoresPerExecutor val enoughCores = usableWorkers(pos).coresFree - assignedCores(pos) >= minCoresPerExecutor // If we allow multiple executors per worker, then we can always launch new executors. // Otherwise, if there is already an executor on this worker, just give it more cores. val launchingNewExecutor = !oneExecutorPerWorker || assignedExecutors(pos) == 0 if (launchingNewExecutor) { val assignedMemory = assignedExecutors(pos) * memoryPerExecutor val enoughMemory = usableWorkers(pos).memoryFree - assignedMemory >= memoryPerExecutor val underLimit = assignedExecutors.sum + app.executors.size < app.executorLimit keepScheduling && enoughCores && enoughMemory && underLimit } else { // We're adding cores to an existing executor, so no need // to check memory and executor limits keepScheduling && enoughCores } } // Keep launching executors until no more workers can accommodate any // more executors, or if we have reached this application's limits var freeWorkers = (0 until numUsable).filter(canLaunchExecutor) while (freeWorkers.nonEmpty) { freeWorkers.foreach { pos => var keepScheduling = true while (keepScheduling && canLaunchExecutor(pos)) { coresToAssign -= minCoresPerExecutor assignedCores(pos) += minCoresPerExecutor // If we are launching one executor per worker, then every iteration assigns 1 core // to the executor. Otherwise, every iteration assigns cores to a new executor. if (oneExecutorPerWorker) { assignedExecutors(pos) = 1 } else { assignedExecutors(pos) += 1 } // Spreading out an application means spreading out its executors across as // many workers as possible. If we are not spreading out, then we should keep // scheduling executors on this worker until we use all of its resources. // Otherwise, just move on to the next worker. if (spreadOutApps) { keepScheduling = false } } } freeWorkers = freeWorkers.filter(canLaunchExecutor) } assignedCores }首先看下几个重要参数:coresToAssign,coresPerExecutor,memoryPerExecutor
1)coresToAssign
var coresToAssign = math.min(app.coresLeft, usableWorkers.map(_.coresFree).sum)
private[master] def coresLeft: Int = requestedCores - coresGranted
private val requestedCores = desc.maxCores.getOrElse(defaultCores)
private val maxCores = conf.getOption("spark.cores.max").map(_.toInt)
OptionAssigner(args.totalExecutorCores, STANDALONE | MESOS, ALL_DEPLOY_MODES, sysProp = "spark.cores.max"),
case TOTAL_EXECUTOR_CORES => totalExecutorCores = value
protected final String TOTAL_EXECUTOR_CORES = "--total-executor-cores";从上面的参数解析和传递过程可以看出此处的coresToAssign就是spark-submit 中传递的--total-executor-cores
2)coresPerExecutor
val coresPerExecutor = app.desc.coresPerExecutor
val coresPerExecutor = conf.getOption("spark.executor.cores").map(_.toInt)
OptionAssigner(args.executorCores, STANDALONE | YARN, ALL_DEPLOY_MODES, sysProp = "spark.executor.cores"),
<pre name="code" class="java"><pre name="code" class="java"> case EXECUTOR_CORES => executorCores = value
protected final String EXECUTOR_CORES = "--executor-cores";可以看出coresPerExecutor来自于 --executor-cores
3)memoryPerExecutor
val memoryPerExecutor = app.desc.memoryPerExecutorMB
_executorMemory = _conf.getOption("spark.executor.memory") .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY"))) .orElse(Option(System.getenv("SPARK_MEM")) .map(warnSparkMem)) .map(Utils.memoryStringToMb) .getOrElse(1024)
OptionAssigner(args.executorMemory, STANDALONE | MESOS | YARN, ALL_DEPLOY_MODES, sysProp = "spark.executor.memory"),
case EXECUTOR_MEMORY => executorMemory = value
protected final String EXECUTOR_MEMORY = "--executor-memory";可以看出memoryPerExecutor来自于 --executor-memory
除了这三个参数外还有一个参数比较重要:spreadOutApps
private val spreadOutApps = conf.getBoolean("spark.deploy.spreadOut", true)默认情况下spreadOutApps为true,也就是会尽可能将Executor均匀的分配到每个Worker上,不设置的情况下memoryPerExecutor默认为1g,如果不设置coresPerExecutor则每个Worker最多只会启动一个Executor,且可能出现一个Executor拥有多个Cores但仅仅分配memotyPerExecutor大小的内存的情况。
Master完成调度之后就会通知Worker LaunchExecutor
/** * Allocate a worker's resources to one or more executors. * @param app the info of the application which the executors belong to * @param assignedCores number of cores on this worker for this application * @param coresPerExecutor number of cores per executor * @param worker the worker info */ private def allocateWorkerResourceToExecutors( app: ApplicationInfo, assignedCores: Int, coresPerExecutor: Option[Int], worker: WorkerInfo): Unit = { // If the number of cores per executor is specified, we divide the cores assigned // to this worker evenly among the executors with no remainder. // Otherwise, we launch a single executor that grabs all the assignedCores on this worker. val numExecutors = coresPerExecutor.map { assignedCores / _ }.getOrElse(1) val coresToAssign = coresPerExecutor.getOrElse(assignedCores) for (i <- 1 to numExecutors) { val exec = app.addExecutor(worker, coresToAssign) launchExecutor(worker, exec) app.state = ApplicationState.RUNNING } }在通知Worker启动Executor的同时也会通知Client ExecutorAdded
private def launchExecutor(worker: WorkerInfo, exec: ExecutorDesc): Unit = { logInfo("Launching executor " + exec.fullId + " on worker " + worker.id) worker.addExecutor(exec) worker.endpoint.send(LaunchExecutor(masterUrl, exec.application.id, exec.id, exec.application.desc, exec.cores, exec.memory)) exec.application.driver.send( ExecutorAdded(exec.id, worker.id, worker.hostPort, exec.cores, exec.memory)) }Worker收到Master的指令后会启动一个线程用于启动子进程Executor
workerThread = new Thread("ExecutorRunner for " + fullId) { override def run() { fetchAndRunExecutor() } } workerThread.start()Executor进程启动代码:
// Launch the process val builder = CommandUtils.buildProcessBuilder(appDesc.command, new SecurityManager(conf), memory, sparkHome.getAbsolutePath, substituteVariables) val command = builder.command() val formattedCommand = command.asScala.mkString("\"", "\" \"", "\"") logInfo(s"Launch command: $formattedCommand") builder.directory(executorDir) builder.environment.put("SPARK_EXECUTOR_DIRS", appLocalDirs.mkString(File.pathSeparator)) // In case we are running this from within the Spark Shell, avoid creating a "scala" // parent process for the executor command builder.environment.put("SPARK_LAUNCH_WITH_SCALA", "0") // Add webUI log urls val baseUrl = s"http://$publicAddress:$webUiPort/logPage/?appId=$appId&executorId=$execId&logType=" builder.environment.put("SPARK_LOG_URL_STDERR", s"${baseUrl}stderr") builder.environment.put("SPARK_LOG_URL_STDOUT", s"${baseUrl}stdout") process = builder.start()这里可以看到command通过appDesc传递进来,appDesc.command
val command = Command("org.apache.spark.executor.CoarseGrainedExecutorBackend", args, sc.executorEnvs, classPathEntries ++ testingClassPath, libraryPathEntries, javaOpts)这里就一目了然了,启动的是org.apache.spark.executor.CoarseGrainedExecutorBackend
main方法中注册了Executor
env.rpcEnv.setupEndpoint("Executor", new CoarseGrainedExecutorBackend( env.rpcEnv, driverUrl, executorId, sparkHostPort, cores, userClassPath, env))根据之前对Master启动过程的分析我们知道RPCEndpoint注册之后,onStart()方法会被调用
override def onStart() { logInfo("Connecting to driver: " + driverUrl) rpcEnv.asyncSetupEndpointRefByURI(driverUrl).flatMap { ref => // This is a very fast action so we can use "ThreadUtils.sameThread" driver = Some(ref) ref.ask[RegisterExecutorResponse]( RegisterExecutor(executorId, self, hostPort, cores, extractLogUrls)) }(ThreadUtils.sameThread).onComplete { // This is a very fast action so we can use "ThreadUtils.sameThread" case Success(msg) => Utils.tryLogNonFatalError { Option(self).foreach(_.send(msg)) // msg must be RegisterExecutorResponse } case Failure(e) => { logError(s"Cannot register with driver: $driverUrl", e) System.exit(1) } }(ThreadUtils.sameThread) }Executor启动之后首先向Driver注册了自己
上文提到在SparkContext初始化的时候,TaskScheduler启动之后启动了Driver,Driver收到Executor的注册信息之后给Executor一个回复信息:
context.reply(RegisteredExecutor(executorAddress.host))Executor在收到Driver的回复信息后开始定时心跳,默认心跳间隔为10s
private def startDriverHeartbeater(): Unit = { val intervalMs = conf.getTimeAsMs("spark.executor.heartbeatInterval", "10s") // Wait a random interval so the heartbeats don't end up in sync val initialDelay = intervalMs + (math.random * intervalMs).asInstanceOf[Int] val heartbeatTask = new Runnable() { override def run(): Unit = Utils.logUncaughtExceptions(reportHeartBeat()) } heartbeater.scheduleAtFixedRate(heartbeatTask, initialDelay, intervalMs, TimeUnit.MILLISECONDS) }
流程图大致如下:
上文分析了SparkContext初始化过程,接下来继续分析Job的提交过程
RDD的转换分为两类:Transformation和Action
Transformation操作不会触发Job的提交,仅仅只是创建一个新的RDD,并将当前RDD作为dependency传递给下一个RDD,如下面的map()操作:
/** * Return a new RDD by applying a function to all elements of this RDD. */ def map[U: ClassTag](f: T => U): RDD[U] = withScope { val cleanF = sc.clean(f) new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF)) }Action操作会触发Job的提交,如collect()操作:
/** * Return an array that contains all of the elements in this RDD. */ def collect(): Array[T] = withScope { val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray) Array.concat(results: _*) }这里调用了SparkContext的runJob()方法,接着调用dagScheduler.runJob() ……
完整的调用链如下:
Action操作 --> SparkContext.runJob() --> dagScheduler.runJob() --> dagScheduler.submitJob() --> eventProcessLooper.post(JobSubmit) --> dagScheduler.handleJobSubmitted --> taskScheduler.submitTasks --> CoarseGrainedSchedulerBackend.launchTasks() -->executor.launchTask --> threadPool.execute(TaskRunner)
在整个调用流程中涉及RDD --> DAG --> Stage --> TaskSet 的转换过程,以及资源调度过程,这部分较为复杂,由后续博客继续分析。
- Spark源码解读(3)——从集群启动到Job提交
- Spark源码解读之Job提交
- Spark源码系列之Spark内核——Job提交
- Spark源码解读-JOB的提交与执行
- Spark源码走读(二) —— Job的提交
- Spark流处理解读(3)JOB源码解读
- Spark源码—Job Runtime
- Spark集群部署和Job提交思想
- Spark源码解读(1)——Master启动过程
- Spark源码解读(2)——Worker启动过程
- Spark Streaming源码解读之Job详解
- Spark修炼之道(高级篇)——Spark源码阅读:第三节 Spark Job的提交
- Spark源码分析-Job提交过程
- Spark提交任务到集群
- Spark提交任务到集群
- Spark源码走读3——Job Runtime
- Spark源码学习(3)——Job Runtime
- Spark1.3从创建到提交:2)spark-submit和SparkContext源码分析
- 在 Laravel 5 中集成 Pjax 实现无刷新加载页面的扩展包 —— Laravel Pjax
- ViewPager实现图片自动轮播无限循环(完美流畅版)
- 合并(序列)流
- 糍粑大叔的独游之旅-u3d中2D轮廓的生成(中)
- Android自定义SurfaceView——实现画板功能
- Spark源码解读(3)——从集群启动到Job提交
- Emag eht htiw Em Pleh
- C#的枚举转换、应用例子
- Python机器视觉编程常用数据结构与示例
- SpringBoot的常用系统变量
- Less使用手册
- Java 加密解密技术学习之DES(对称加密学习)
- C# WinCE中DataGrid列设置(列宽 列标题等)
- String详解, String和CharSequence区别, StringBuilder和StringBuffer的区别 (String系列之1)