SparkContext主要组成部分

来源：互联网发布：南京海关数据分中心编辑：程序博客网时间：2024/05/21 06:13

1. 基本说明

本次学习使用的是Spark1.6.0
SparkContext是应用程序的提交执行的前提。先来看一下SparkContext的注释：

/** * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. * * Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before * creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details. * * @param config a Spark Config object describing the application configuration. Any settings in *   this config overrides the default configs as well as system properties. */

也就是说SparkContext是Spark的入口，相当于应用程序的main函数。目前在一个JVM进程中可以创建多个SparkContext，但是只能有一个active级别的。如果你需要创建一个新的SparkContext实例，必须先调用stop方法停掉当前active级别的SparkContext实例。

2. 初始设置

首先保存了当前的CallSite信息，并且判断是否允许创建多个SparkContext实例，使用的是spark.driver.allowMultipleContexts属性，默认为false。

class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {  // 获取当前SparkContext的当前调用栈。包含了最靠近栈顶的用户类及最靠近栈底的Scala或者Spark核心类信息  private val creationSite: CallSite = Utils.getCallSite()  // SparkContext默认只有一个实例。如果在config（SparkConf）中设置了allowMultipleContexts为true，  //当存在多个active级别的SparkContext实例时Spark会发生警告，而不是抛出异常，要特别注意。  // 如果没有配置，则默认为false  private val allowMultipleContexts: Boolean =    config.getBoolean("spark.driver.allowMultipleContexts", false)  // 用来确保SparkContext实例的唯一性，并将当前的SparkContext标记为正在构建中，以防止多个SparkContext实例同时成为active级别的。  // NOTE: this must be placed at the beginning of the SparkContext constructor.  SparkContext.markPartiallyConstructed(this, allowMultipleContexts)...}

接下来是对SparkConf进行复制，然后对各种配置信息进行校验，其中最主要的就是SparkConf必须指定spark.master（用于设置部署模式）和spark.app.name（应用程序名称）属性，否则会抛出异常。

``` scala    private var _conf: SparkConf = _    ...    _conf = config.clone()    _conf.validateSettings()    if (!_conf.contains("spark.master")) {      throw new SparkException("A master URL must be set in your configuration")    }    if (!_conf.contains("spark.app.name")) {      throw new SparkException("An application name must be set in your configuration")    }

3. 创建执行环境SparkEnv

SparkEnv是Spark的执行环境对象，其中包括与众多Executor指向相关的对象。在local模式下Driver会创建Executor，local-cluster部署模式或者Standalone部署模式下Worker另起的CoarseGrainedExecutorBackend进程中也会创建Executor，所以SparkEnv存在于Driver或者CoarseGrainedExecutorBackend进程中。
创建SparkEnv主要使用SparkEnv的createDriverEnv方法，有四个参数：conf、isLocal、listenerBus 以及在本地模式下driver运行executor需要的numberCores。

  // 是否是本地模式  def isLocal: Boolean = (master == "local" || master.startsWith("local["))  // 采用监听器模式维护各类事件的处理  // An asynchronous listener bus for Spark events  private[spark] val listenerBus = new LiveListenerBus  ...  /**   * 获取在本地模式下执行程序需要的cores个数，否则不需要，为0   * The number of driver cores to use for execution in local mode, 0 otherwise.   */  private[spark] def numDriverCores(master: String): Int = {    def convertToInt(threads: String): Int = {      if (threads == "*") Runtime.getRuntime.availableProcessors() else threads.toInt    }    master match {      case "local" => 1      case SparkMasterRegex.LOCAL_N_REGEX(threads) => convertToInt(threads)      case SparkMasterRegex.LOCAL_N_FAILURES_REGEX(threads, _) => convertToInt(threads)      case _ => 0 // driver is not used for execution    }  }  ...  // This function allows components created by SparkEnv to be mocked in unit tests:  private[spark] def createSparkEnv(      conf: SparkConf,      isLocal: Boolean,      listenerBus: LiveListenerBus): SparkEnv = {    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))  }

SparkEnv的createDriverEnv方法最终调用create创建SparkEnv。SparkEnv的构造步骤如下：
1. 创建安全管理器SecurityManager；
2. 创建RpcEnv；
3. 创建基于Akka的分布式消息系统ActorSystem（注意：Spark 1.4.0之后已经废弃了）
4. 创建Map任务输出跟踪器MapOutputTracker；
5. 创建ShuffleManager；
6. 内存管理器MemoryManager；
7. 创建块传输服务NettyBlockTransferService；
8. 创建BlockManagerMaster；
9. 创建块管理器BlockManager；
10. 创建广播管理器BroadcastManager；
11. 创建缓存管理器CacheManager；
12. 创建测量系统MetricsSystem；
13. 创建OutputCommitCoordinator；
14. 创建SparkEnv

这些模块都比较重要，需要在后面一一展开学习。详细内容链接。

4. 创建SparkUI

SparkUI 提供了用浏览器访问具有样式及布局并且提供丰富监控数据的页面。其采用的是时间监听机制。发送的事件会存入缓存，由定时调度器取出后分配给监听此事件的监听器对监控数据进行更新。如果不需要SparkUI，则可以将spark.ui.enabled置为false。

    _ui =      if (conf.getBoolean("spark.ui.enabled", true)) {        Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,          _env.securityManager, appName, startTime = startTime))      } else {        // For tests, do not enable the UI        None      }

5. Hadoop相关配置

默认情况下，Spark使用HDFS作为分布式文件系统，所以需要获取Hadoop相关的配置信息：

_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)

获取的配置信息包括：

将Amazon S3文件系统的AWS_ACCESS_KEY_ID和 AWS_SECRET_ACCESS_KEY加载到Hadoop的Configuration；
将SparkConf中所有的以spark.hadoop.开头的属性都赋值到Hadoop的Configuration；
将SparkConf的属性spark.buffer.size复制到Hadoop的Configuration的配置io.file.buffer.size。

  /**   * Return an appropriate (subclass) of Configuration. Creating config can initializes some Hadoop   * subsystems.   */  def newConfiguration(conf: SparkConf): Configuration = {    val hadoopConf = new Configuration()    // Note: this null check is around more than just access to the "conf" object to maintain    // the behavior of the old implementation of this code, for backwards compatibility.    if (conf != null) {      // Explicitly check for S3 environment variables      if (System.getenv("AWS_ACCESS_KEY_ID") != null &&          System.getenv("AWS_SECRET_ACCESS_KEY") != null) {        val keyId = System.getenv("AWS_ACCESS_KEY_ID")        val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY")        hadoopConf.set("fs.s3.awsAccessKeyId", keyId)        hadoopConf.set("fs.s3n.awsAccessKeyId", keyId)        hadoopConf.set("fs.s3a.access.key", keyId)        hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey)        hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey)        hadoopConf.set("fs.s3a.secret.key", accessKey)      }      // Copy any "spark.hadoop.foo=bar" system properties into conf as "foo=bar"      conf.getAll.foreach { case (key, value) =>        if (key.startsWith("spark.hadoop.")) {          hadoopConf.set(key.substring("spark.hadoop.".length), value)        }      }      val bufferSize = conf.get("spark.buffer.size", "65536")      hadoopConf.set("io.file.buffer.size", bufferSize)    }    hadoopConf  }

6. Execuyor环境变量

executorEnvs包含的环境变量将会注册应用程序的过程中发送给Master，Master给Worker发送调度后，Worker最终使用executorEnvs提供的信息启动Executor。
通过配置spark.executor.memory指定Executor占用的内存的大小，也可以配置系统变量SPARK_EXECUTOR_MEMORY或者SPARK_MEM设置其大小。

    _executorMemory = _conf.getOption("spark.executor.memory")      .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))      .orElse(Option(System.getenv("SPARK_MEM"))      .map(warnSparkMem))      .map(Utils.memoryStringToMb)      .getOrElse(1024)

executorEnvs是由一个HashMap存储：

    // Environment variables to pass to our executors.    private[spark] val executorEnvs = HashMap[String, String]()    // Convert java options to env vars as a work around    // since we can't set env vars directly in sbt.    for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))      value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {      executorEnvs(envKey) = value    }    Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>      executorEnvs("SPARK_PREPEND_CLASSES") = v    }    // The Mesos scheduler backend relies on this environment variable to set executor memory.    // TODO: Set this only in the Mesos scheduler.    executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"    executorEnvs ++= _conf.getExecutorEnv    executorEnvs("SPARK_USER") = sparkUser

7. 创建任务调度器TaskScheduler

TaskScheduler也是SparkContext的重要组成部分，负责任务的提交，请求集群管理器对任务调度，并且负责发送的任务到集群，运行它们，任务失败的重试，以及慢任务的在其他节点上重试。
TaskScheduler负责任务调度资源分配，SchedulerBackend负责与Master、Worker通信收集Worker上分配给该应用使用的资源情况。

为 TaskSet创建和维护一个TaskSetManager并追踪任务的本地性以及错误信息；

遇到Straggle 任务会方到其他的节点进行重试；

向DAGScheduler汇报执行情况，包括在Shuffle输出lost的时候报告fetch failed 错误等信息；

创建TaskScheduler代码：

    // Create and start the scheduler    // sched: ScheduleBackend    // ts: TaskSchedule    val (sched, ts) = SparkContext.createTaskScheduler(this, master)

createTaskScheduler方法根据master的配置匹配部署模式，创建TaskSchedulerImpl，并生成不同的SchedulerBackend。

  /**   * Create a task scheduler based on a given master URL.   * Return a 2-tuple of the scheduler backend and the task scheduler.   */  private def createTaskScheduler(      sc: SparkContext,      master: String): (SchedulerBackend, TaskScheduler) = {    import SparkMasterRegex._    // When running locally, don't try to re-execute tasks on failure.    val MAX_LOCAL_TASK_FAILURES = 1    master match {      case "local" =>        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)        val backend = new LocalBackend(sc.getConf, scheduler, 1)        scheduler.initialize(backend)        (backend, scheduler)      case LOCAL_N_REGEX(threads) =>         ...  // 类似      case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>        ...  // 类似      case SPARK_REGEX(sparkUrl) =>        ...  // 类似      case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>        ...  // 类似      case "yarn-standalone" | "yarn-cluster" =>        ...  // 类似      case "yarn-client" =>        ...  // 类似      case MESOS_REGEX(mesosUrl) =>        ...  // 类似      case SIMR_REGEX(simrUrl) =>        ...  // 类似      case zkUrl if zkUrl.startsWith("zk://") =>        logWarning("Master URL for a multi-master Mesos cluster managed by ZooKeeper should be " +          "in the form mesos://zk://host:port. Current Master URL will stop working in Spark 2.0.")        createTaskScheduler(sc, "mesos://" + zkUrl)      case _ =>        throw new SparkException("Could not parse Master URL: '" + master + "'")    }  }}

这里写图片描述

8. 创建和启动DAGScheduler

DAGScheduler主要用于在任务正式交给TaskScheduler提交之前做一些准备工作，包括：创建Job，将DAG中的RDD划分到不同的Stage，提交Stage，等等。
创建DAGScheduler代码：

@volatile private var _dagScheduler: DAGScheduler = __dagScheduler = new DAGScheduler(this)

DAGScheduler的数据结构主要维护jobId和stageId的关系、Stage、ActiveJob，以及缓存的RDD的Partition的位置信息。

9. TaskScheduler的启动

TaskScheduler在启动的时候实际是调用了backend的start方法：

    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's    // constructor    _taskScheduler.start()  override def start() {    backend.start()    ...  }

10. 启动测量系统MetricsSystem

MetricsSystem中三个概念：

Instance：指定了谁在使用测量系统；
Spark按照Instance的不同，区分为Master、Worker、Application、Driver和Executor；

Source：指定了从哪里收集测量数据；
Source的有两种来源：Spark internal source: MasterSource/WorkerSource等； Common source： JvmSource

Sink：指定了往哪里输出测量数据；
Spark目前提供的Sink有ConsoleSink、CsvSink、JmxSink、MetricsServlet、GraphiteSink等；Spark使用MetricsServlet作为默认的Sink.

MetricsSystem的启动过程包括：
1. 注册Sources；
2. 注册Sinks；
3. 将Sinks增加Jetty的ServletContextHandler；
MetricsSystem启动完毕后，会遍历与Sinks有关的ServletContextHandler，并调用attachHandler将它们绑定到Spark UI上。

    // Attach the driver metrics servlet handler to the web ui after the metrics system is started.    metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

11. 创建和启动ExecutorAllocationManager

ExecutorAllocationManager用于对以分配的Executor进行管理。
创建和启动ExecutorAllocationManager代码：

    // Optionally scale number of executors dynamically based on workload. Exposed for testing.    val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)    if (!dynamicAllocationEnabled && _conf.getBoolean("spark.dynamicAllocation.enabled", false)) {      logWarning("Dynamic Allocation and num executors both set, thus dynamic allocation disabled.")    }    _executorAllocationManager =      if (dynamicAllocationEnabled) {        Some(new ExecutorAllocationManager(this, listenerBus, _conf))      } else {        None      }    _executorAllocationManager.foreach(_.start())

默认情况下不会创建ExecutorAllocationManager，可以修改属性spark.dynamicAllocation.enabled为true来创建。ExecutorAllocationManager可以动态的分配最小Executor的数量、动态分配最大Executor的数量、每个Executor可以运行的Task数量等配置信息，并对配置信息进行校验。start方法将ExecutorAllocationListener加入listenerBus中，ExecutorAllocationListener通过监听listenerBus里的事件，动态的添加、删除Executor。并且通过不断添加Executor，遍历Executor，将超时的Executor杀死并移除。

12. ContextCleaner的创建与启动

ContextCleaner用于清理超出应用范围的RDD、ShuffleDependency和Broadcast对象。

    _cleaner =      if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {        Some(new ContextCleaner(this))      } else {        None      }    _cleaner.foreach(_.start())

ContextCleaner的组成：

referenceQueue: 缓存顶级的AnyRef引用；
referenceBuff：缓存AnyRef的虚引用；
listeners：缓存清理工作的监听器数组；
cleaningThread：用于具体清理工作的线程。

13. Spark 注册监听器和环境更新

在SparkContext的初始化过程中，可能对其环境造成影响，所以需要更新环境：

    // 注册config的spark.extraListeners属性中指定的监听器，并启动监听器总线    setupAndStartListenerBus()    postEnvironmentUpdate()    postApplicationStart()

SparkContext初始化过程中，如果设置了spark.jars属性，spark.jars指定的jar包将由addJar方法加入httpFileServer的jarDir变量指定的路径下。每加入一个jar都会调用postEnvironmentUpdate方法更新环境。增加文件与增加jar相同，也会调用postEnvironmentUpdate方法。

    _jars = _conf.getOption("spark.jars").map(_.split(",")).map(_.filter(_.size != 0)).toSeq.flatten    _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.size != 0))      .toSeq.flatten  ...  /**   * Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported   * filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.   */  def addJar(path: String) {    ...    postEnvironmentUpdate()  }  ...  /**   * Add a file to be downloaded with this Spark job on every node.   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark jobs,   * use `SparkFiles.get(fileName)` to find its download location.   *   * A directory can be given if the recursive option is set to true. Currently directories are only   * supported for Hadoop-supported filesystems.   */  def addFile(path: String, recursive: Boolean): Unit = {    ...    postEnvironmentUpdate()  }

postEnvironmentUpdate方法处理步骤：
1. 通过调用SparkEnv的方法environmentDetails最终影响环境的JVM参数、Spark属性、系统属性、classPath等；
2. 生成事件environmentUpdate，并post到listenerBus，此事件被Environment监听，最终影响EnvironmentPage页面中的输出内容。

  /** Post the environment update event once the task scheduler is ready */  private def postEnvironmentUpdate() {    if (taskScheduler != null) {      val schedulingMode = getSchedulingMode.toString      val addedJarPaths = addedJars.keys.toSeq      val addedFilePaths = addedFiles.keys.toSeq      val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,        addedFilePaths)      val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)      listenerBus.post(environmentUpdate)    }  }

postApplicationStart方法只是向listenerBus发送了SparkListenerApplicationStart事件：

  /** Post the application start event */  private def postApplicationStart() {    // Note: this code assumes that the task scheduler has been initialized and has contacted    // the cluster manager to get an application ID (in case the cluster manager provides one).    listenerBus.post(SparkListenerApplicationStart(appName, Some(applicationId),      startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))  }

14. 创建DGASchedulerSource、BlockManagerSource和ExecutorAllocationManagerSource

首先要调用taskScheduler的postStartHook方法，其目的是为了等待backend就绪。

    // Post init    _taskScheduler.postStartHook()    _env.metricsSystem.registerSource(_dagScheduler.metricsSource)    _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))    _executorAllocationManager.foreach { e =>      _env.metricsSystem.registerSource(e.executorAllocationManagerSource)    }

15. 将SparkContext标记为激活

SparkContext初始化的最后将当前SparkContext的状态从contextBeingConstructed（正在构建中）改为activeContext（已激活）：

  // In order to prevent multiple SparkContexts from being active at the same time, mark this  // context as having finished construction.  // NOTE: this must be placed at the end of the SparkContext constructor.  SparkContext.setActiveContext(this, allowMultipleContexts)

至此，SparkContext的construction构造完成。由于涉及的内容比较多，需要单独对每个模块进行学习，后面会把学习的链接补充上。

1 0