1. 基本说明


/** * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. * * Only one SparkContext may be active per JVM.  You must `stop()` the active SparkContext before * creating a new one.  This limitation may eventually be removed; see SPARK-2243 for more details. * * @param config a Spark Config object describing the application configuration. Any settings in *   this config overrides the default configs as well as system properties. */


2. 初始设置


class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationClient {  // 获取当前SparkContext的当前调用栈。包含了最靠近栈顶的用户类及最靠近栈底的Scala或者Spark核心类信息  private val creationSite: CallSite = Utils.getCallSite()  // SparkContext默认只有一个实例。如果在config(SparkConf)中设置了allowMultipleContexts为true,  //当存在多个active级别的SparkContext实例时Spark会发生警告,而不是抛出异常,要特别注意。  // 如果没有配置,则默认为false  private val allowMultipleContexts: Boolean =    config.getBoolean("spark.driver.allowMultipleContexts", false)  // 用来确保SparkContext实例的唯一性,并将当前的SparkContext标记为正在构建中,以防止多个SparkContext实例同时成为active级别的。  // NOTE: this must be placed at the beginning of the SparkContext constructor.  SparkContext.markPartiallyConstructed(this, allowMultipleContexts)...}


``` scala    private var _conf: SparkConf = _    ...    _conf = config.clone()    _conf.validateSettings()    if (!_conf.contains("spark.master")) {      throw new SparkException("A master URL must be set in your configuration")    }    if (!_conf.contains("")) {      throw new SparkException("An application name must be set in your configuration")    }

3. 创建执行环境SparkEnv

创建SparkEnv主要使用SparkEnv的createDriverEnv方法,有四个参数:conf、isLocal、listenerBus 以及在本地模式下driver运行executor需要的numberCores。

  // 是否是本地模式  def isLocal: Boolean = (master == "local" || master.startsWith("local["))  // 采用监听器模式维护各类事件的处理  // An asynchronous listener bus for Spark events  private[spark] val listenerBus = new LiveListenerBus  ...  /**   * 获取在本地模式下执行程序需要的cores个数,否则不需要,为0   * The number of driver cores to use for execution in local mode, 0 otherwise.   */  private[spark] def numDriverCores(master: String): Int = {    def convertToInt(threads: String): Int = {      if (threads == "*") Runtime.getRuntime.availableProcessors() else threads.toInt    }    master match {      case "local" => 1      case SparkMasterRegex.LOCAL_N_REGEX(threads) => convertToInt(threads)      case SparkMasterRegex.LOCAL_N_FAILURES_REGEX(threads, _) => convertToInt(threads)      case _ => 0 // driver is not used for execution    }  }  ...  // This function allows components created by SparkEnv to be mocked in unit tests:  private[spark] def createSparkEnv(      conf: SparkConf,      isLocal: Boolean,      listenerBus: LiveListenerBus): SparkEnv = {    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))  }

1. 创建安全管理器SecurityManager;
2. 创建RpcEnv;
3. 创建基于Akka的分布式消息系统ActorSystem(注意:Spark 1.4.0之后已经废弃了
4. 创建Map任务输出跟踪器MapOutputTracker;
5. 创建ShuffleManager;
6. 内存管理器MemoryManager;
7. 创建块传输服务NettyBlockTransferService;
8. 创建BlockManagerMaster;
9. 创建块管理器BlockManager;
10. 创建广播管理器BroadcastManager;
11. 创建缓存管理器CacheManager;
12. 创建测量系统MetricsSystem;
13. 创建OutputCommitCoordinator;
14. 创建SparkEnv


4. 创建SparkUI

SparkUI 提供了用浏览器访问具有样式及布局并且提供丰富监控数据的页面。其采用的是时间监听机制。发送的事件会存入缓存,由定时调度器取出后分配给监听此事件的监听器对监控数据进行更新。如果不需要SparkUI,则可以将spark.ui.enabled置为false。

    _ui =      if (conf.getBoolean("spark.ui.enabled", true)) {        Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,          _env.securityManager, appName, startTime = startTime))      } else {        // For tests, do not enable the UI        None      }

5. Hadoop相关配置


_hadoopConfiguration = SparkHadoopUtil.get.newConfiguration(_conf)


  • 将Amazon S3文件系统的AWS_ACCESS_KEY_ID和 AWS_SECRET_ACCESS_KEY加载到Hadoop的Configuration;
  • 将SparkConf中所有的以spark.hadoop.开头的属性都赋值到Hadoop的Configuration;
  • 将SparkConf的属性spark.buffer.size复制到Hadoop的Configuration的配置io.file.buffer.size。
  /**   * Return an appropriate (subclass) of Configuration. Creating config can initializes some Hadoop   * subsystems.   */  def newConfiguration(conf: SparkConf): Configuration = {    val hadoopConf = new Configuration()    // Note: this null check is around more than just access to the "conf" object to maintain    // the behavior of the old implementation of this code, for backwards compatibility.    if (conf != null) {      // Explicitly check for S3 environment variables      if (System.getenv("AWS_ACCESS_KEY_ID") != null &&          System.getenv("AWS_SECRET_ACCESS_KEY") != null) {        val keyId = System.getenv("AWS_ACCESS_KEY_ID")        val accessKey = System.getenv("AWS_SECRET_ACCESS_KEY")        hadoopConf.set("fs.s3.awsAccessKeyId", keyId)        hadoopConf.set("fs.s3n.awsAccessKeyId", keyId)        hadoopConf.set("fs.s3a.access.key", keyId)        hadoopConf.set("fs.s3.awsSecretAccessKey", accessKey)        hadoopConf.set("fs.s3n.awsSecretAccessKey", accessKey)        hadoopConf.set("fs.s3a.secret.key", accessKey)      }      // Copy any "" system properties into conf as "foo=bar"      conf.getAll.foreach { case (key, value) =>        if (key.startsWith("spark.hadoop.")) {          hadoopConf.set(key.substring("spark.hadoop.".length), value)        }      }      val bufferSize = conf.get("spark.buffer.size", "65536")      hadoopConf.set("io.file.buffer.size", bufferSize)    }    hadoopConf  }

6. Execuyor环境变量


    _executorMemory = _conf.getOption("spark.executor.memory")      .orElse(Option(System.getenv("SPARK_EXECUTOR_MEMORY")))      .orElse(Option(System.getenv("SPARK_MEM"))      .map(warnSparkMem))      .map(Utils.memoryStringToMb)      .getOrElse(1024)


    // Environment variables to pass to our executors.    private[spark] val executorEnvs = HashMap[String, String]()    // Convert java options to env vars as a work around    // since we can't set env vars directly in sbt.    for { (envKey, propKey) <- Seq(("SPARK_TESTING", "spark.testing"))      value <- Option(System.getenv(envKey)).orElse(Option(System.getProperty(propKey)))} {      executorEnvs(envKey) = value    }    Option(System.getenv("SPARK_PREPEND_CLASSES")).foreach { v =>      executorEnvs("SPARK_PREPEND_CLASSES") = v    }    // The Mesos scheduler backend relies on this environment variable to set executor memory.    // TODO: Set this only in the Mesos scheduler.    executorEnvs("SPARK_EXECUTOR_MEMORY") = executorMemory + "m"    executorEnvs ++= _conf.getExecutorEnv    executorEnvs("SPARK_USER") = sparkUser

7. 创建任务调度器TaskScheduler


  • 为 TaskSet创建和维护一个TaskSetManager并追踪任务的本地性以及错误信息;
  • 遇到Straggle 任务会方到其他的节点进行重试;
  • 向DAGScheduler汇报执行情况, 包括在Shuffle输出lost的时候报告fetch failed 错误等信息;


    // Create and start the scheduler    // sched: ScheduleBackend    // ts: TaskSchedule    val (sched, ts) = SparkContext.createTaskScheduler(this, master)


  /**   * Create a task scheduler based on a given master URL.   * Return a 2-tuple of the scheduler backend and the task scheduler.   */  private def createTaskScheduler(      sc: SparkContext,      master: String): (SchedulerBackend, TaskScheduler) = {    import SparkMasterRegex._    // When running locally, don't try to re-execute tasks on failure.    val MAX_LOCAL_TASK_FAILURES = 1    master match {      case "local" =>        val scheduler = new TaskSchedulerImpl(sc, MAX_LOCAL_TASK_FAILURES, isLocal = true)        val backend = new LocalBackend(sc.getConf, scheduler, 1)        scheduler.initialize(backend)        (backend, scheduler)      case LOCAL_N_REGEX(threads) =>         ...  // 类似      case LOCAL_N_FAILURES_REGEX(threads, maxFailures) =>        ...  // 类似      case SPARK_REGEX(sparkUrl) =>        ...  // 类似      case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave) =>        ...  // 类似      case "yarn-standalone" | "yarn-cluster" =>        ...  // 类似      case "yarn-client" =>        ...  // 类似      case MESOS_REGEX(mesosUrl) =>        ...  // 类似      case SIMR_REGEX(simrUrl) =>        ...  // 类似      case zkUrl if zkUrl.startsWith("zk://") =>        logWarning("Master URL for a multi-master Mesos cluster managed by ZooKeeper should be " +          "in the form mesos://zk://host:port. Current Master URL will stop working in Spark 2.0.")        createTaskScheduler(sc, "mesos://" + zkUrl)      case _ =>        throw new SparkException("Could not parse Master URL: '" + master + "'")    }  }}


8. 创建和启动DAGScheduler


@volatile private var _dagScheduler: DAGScheduler = __dagScheduler = new DAGScheduler(this)


9. TaskScheduler的启动


    // start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's    // constructor    _taskScheduler.start()  override def start() {    backend.start()    ...  }

10. 启动测量系统MetricsSystem


  • Instance: 指定了谁在使用测量系统;
  • Source: 指定了从哪里收集测量数据;
    Source的有两种来源:Spark internal source: MasterSource/WorkerSource等; Common source: JvmSource
  • Sink:指定了往哪里输出测量数据;

1. 注册Sources;
2. 注册Sinks;
3. 将Sinks增加Jetty的ServletContextHandler;
MetricsSystem启动完毕后,会遍历与Sinks有关的ServletContextHandler,并调用attachHandler将它们绑定到Spark UI上。

    // Attach the driver metrics servlet handler to the web ui after the metrics system is started.    metricsSystem.getServletHandlers.foreach(handler => ui.foreach(_.attachHandler(handler)))

11. 创建和启动ExecutorAllocationManager


    // Optionally scale number of executors dynamically based on workload. Exposed for testing.    val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)    if (!dynamicAllocationEnabled && _conf.getBoolean("spark.dynamicAllocation.enabled", false)) {      logWarning("Dynamic Allocation and num executors both set, thus dynamic allocation disabled.")    }    _executorAllocationManager =      if (dynamicAllocationEnabled) {        Some(new ExecutorAllocationManager(this, listenerBus, _conf))      } else {        None      }    _executorAllocationManager.foreach(_.start())


12. ContextCleaner的创建与启动


    _cleaner =      if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) {        Some(new ContextCleaner(this))      } else {        None      }    _cleaner.foreach(_.start())


  • referenceQueue: 缓存顶级的AnyRef引用;
  • referenceBuff:缓存AnyRef的虚引用;
  • listeners:缓存清理工作的监听器数组;
  • cleaningThread:用于具体清理工作的线程。

13. Spark 注册监听器和环境更新


    // 注册config的spark.extraListeners属性中指定的监听器,并启动监听器总线    setupAndStartListenerBus()    postEnvironmentUpdate()    postApplicationStart()


    _jars = _conf.getOption("spark.jars").map(_.split(",")).map(_.filter(_.size != 0)).toSeq.flatten    _files = _conf.getOption("spark.files").map(_.split(",")).map(_.filter(_.size != 0))      .toSeq.flatten  ...  /**   * Adds a JAR dependency for all tasks to be executed on this SparkContext in the future.   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported   * filesystems), an HTTP, HTTPS or FTP URI, or local:/path for a file on every worker node.   */  def addJar(path: String) {    ...    postEnvironmentUpdate()  }  ...  /**   * Add a file to be downloaded with this Spark job on every node.   * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in Spark jobs,   * use `SparkFiles.get(fileName)` to find its download location.   *   * A directory can be given if the recursive option is set to true. Currently directories are only   * supported for Hadoop-supported filesystems.   */  def addFile(path: String, recursive: Boolean): Unit = {    ...    postEnvironmentUpdate()  }

1. 通过调用SparkEnv的方法environmentDetails最终影响环境的JVM参数、Spark属性、系统属性、classPath等;
2. 生成事件environmentUpdate,并post到listenerBus,此事件被Environment监听,最终影响EnvironmentPage页面中的输出内容。

  /** Post the environment update event once the task scheduler is ready */  private def postEnvironmentUpdate() {    if (taskScheduler != null) {      val schedulingMode = getSchedulingMode.toString      val addedJarPaths = addedJars.keys.toSeq      val addedFilePaths = addedFiles.keys.toSeq      val environmentDetails = SparkEnv.environmentDetails(conf, schedulingMode, addedJarPaths,        addedFilePaths)      val environmentUpdate = SparkListenerEnvironmentUpdate(environmentDetails)    }  }


  /** Post the application start event */  private def postApplicationStart() {    // Note: this code assumes that the task scheduler has been initialized and has contacted    // the cluster manager to get an application ID (in case the cluster manager provides one)., Some(applicationId),      startTime, sparkUser, applicationAttemptId, schedulerBackend.getDriverLogUrls))  }

14. 创建DGASchedulerSource、BlockManagerSource和ExecutorAllocationManagerSource


    // Post init    _taskScheduler.postStartHook()    _env.metricsSystem.registerSource(_dagScheduler.metricsSource)    _env.metricsSystem.registerSource(new BlockManagerSource(_env.blockManager))    _executorAllocationManager.foreach { e =>      _env.metricsSystem.registerSource(e.executorAllocationManagerSource)    }

15. 将SparkContext标记为激活


  // In order to prevent multiple SparkContexts from being active at the same time, mark this  // context as having finished construction.  // NOTE: this must be placed at the end of the SparkContext constructor.  SparkContext.setActiveContext(this, allowMultipleContexts)


