spak源码分析之sparkContext源码分析

来源:互联网 发布:淘宝考试不 编辑:程序博客网 时间:2024/06/06 20:46
                            sparkContext源码分析

1.sparkcontext是spark应用开发的入口,负责与spark集群的连接。可用于在集群上创建RDD,累加器,和 广播变量等一系列操作。
2.sparkcontext初始化主要做了以下几件事:

  1. 初始化spakEn
  2. 初始化taskScheduler
  3. 初始化dagscheduler
  4. 初始化sparkUI

3.sparkEnv相关代码
调用sparkcontext自己的createSparkEnv方法

    // Create the Spark execution environment (cache, map output tracker, etc)    //初始化的时候创建spark的运行环境    _env = createSparkEnv(_conf, isLocal, listenerBus)    SparkEnv.set(_env)

sparkcontext的createSparkEnv()方法会调用sparkEnv的createSparkEnv()方法

 // This function allows components created by SparkEnv to be mocked in unit tests:  private[spark] def createSparkEnv(      conf: SparkConf,      isLocal: Boolean,      listenerBus: LiveListenerBus): SparkEnv = {    //创建spark驱动环境(driver),调用sparkEnv的createDriverEnv()方法    SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))  }

sparkEnv的createDriverEnv()方法会接着调用sparkEnv的create()方法,create()方法内会初始化cacheManager、blockManagerMaster、blockManager、BroadcastManager(广播变量管理器)、mapOutputTracker(跟踪map阶段任务输出)等一系列组件,最后才会 new SparkEnv

4.创建完sparkEnv创建taskscheduler

// Create and start the scheduler 创建和开始scheduler    val (sched, ts) = SparkContext.createTaskScheduler(this, master)    //将sced赋值给_schedulerBackend    _schedulerBackend = sched    //将ts赋值给_taskScheduler    _taskScheduler = ts

根据不同的运行模式进行创建
TaskScheduler负责实际每个具体Task的物理调度。

       //这个是spark提交常用的standalone模式       //第一步创建一个schedulerImpl,      // 第二步获得masterurl,      // 第三步获取backend,是用来提交任务到executor的      // 第四步调用schedulerImpl的初始化方法,创建调度池      case SPARK_REGEX(sparkUrl) =>        //创建taskscheduler任务调度器        val scheduler = new TaskSchedulerImpl(sc)        val masterUrls = sparkUrl.split(",").map("spark://" + _)        //backend负责这个集群资源的获取和调度继承自CoarseGrainedSchedulerBackend        //CoarseGrainedSchedulerBackend是在worker上执行具体任务的代表,是executor的代理人        val backend = new SparkDeploySchedulerBackend(scheduler, sc, masterUrls)        //调用taskschedulerImpl的initialize()方法,传入backend参数        scheduler.initialize(backend)        //返回backend和scheduler        (backend, scheduler)

下面是taskSchedulerImpl中的initialize方法

//taskscheduler的初始化方法,需要传入一个SchedulerBackend  def initialize(backend: SchedulerBackend) {      //将backed赋值给taskSchedulerImpl的backed    this.backend = backend    // temporarily set rootPool name to empty --临时将rootPool名称设置为空    // rootPool 调度池    rootPool = new Pool("", schedulingMode, 0, 0)    schedulableBuilder = {      schedulingMode match {      //任务调度机制,FIFO -- first in first out(默认)        case SchedulingMode.FIFO =>          new FIFOSchedulableBuilder(rootPool)        case SchedulingMode.FAIR =>          new FairSchedulableBuilder(rootPool, conf)      }    }    //创建任务调度池    schedulableBuilder.buildPools()  }

5.创建完taskscheduler 然后创建DAGScheduler相关代码
DAGScheduler负责将Task拆分成不同Stage的具有依赖关系(包含RDD的依赖关系)的多批任务,然后提交给TaskScheduler进行具体处理。

创建一个DAG实例
DAG主要负责为job划分stage,寻找最佳运行task的位置,追踪RDD和stage是否被物化(缓存),等其他功能

    //初始化DAGScheduler,this代表sparkcontext    _dagScheduler = new DAGScheduler(this)

//调用DAG的this(sc,sc.taskScheduler)构造函数

//传入sparkcontext和taskScheduler(上一步已经初始化完成)def this(sc: SparkContext) = this(sc, sc.taskScheduler)

6.启动taskScheduler

  //taskScheduler在DAGScheduler的构造函数中设置DAGScheduler引用后 启动TaskScheduler    _taskScheduler.start()

调用taskSchedulerImpl的start()方法,而taskSchedulerImpl的start()方法被SparkDeploySchedulerBackend重写。所以这里需要查看SparkDeploySchedulerBackend的start()方法
以下是部分代码:

     //ApplicationDescription它就代表了当前运行的application的情况,比如maxCores,name    val appDesc = new ApplicationDescription(sc.appName, maxCores, sc.executorMemory,      command, appUIAddress, sc.eventLogDir, sc.eventLogCodec, coresPerExecutor)    //创建了AppClient,负责application与集群进行通信,向master发送注册application请求    client = new AppClient(sc.env.rpcEnv, masters, appDesc, this, conf)    //启动了一个aka线程,用来监听通信    client.start()    //状态设置为已提交    launcherBackend.setState(SparkAppHandle.State.SUBMITTED)    //等待注册    waitForRegistration()    //状态设置为运行中    launcherBackend.setState(SparkAppHandle.State.RUNNING)

7.SparkUI
spark图形化界面

_ui =      if (conf.getBoolean("spark.ui.enabled", true)) {        Some(SparkUI.createLiveUI(this, _conf, listenerBus, _jobProgressListener,          _env.securityManager, appName, startTime = startTime))      } else {        // For tests, do not enable the UI        None      }

调用sparkUI的createLiveUI()方法,创建sparkUI

def createLiveUI(      sc: SparkContext,      conf: SparkConf,      listenerBus: SparkListenerBus,      jobProgressListener: JobProgressListener,      securityManager: SecurityManager,      appName: String,      startTime: Long): SparkUI = {    create(Some(sc), conf, listenerBus, securityManager, appName,      jobProgressListener = Some(jobProgressListener), startTime = startTime)  }

8.初始化blockManager

    //初始化blockManager,并传入applicationId    //将appliction的id传入让blockManager管理application    _env.blockManager.initialize(_applicationId)

调用blockManager的initialize()方法

/**   * Initializes the BlockManager with the given appId. This is not performed in the constructor as   * the appId may not be known at BlockManager instantiation time (in particular for the driver,   * where it is only learned after registration with the TaskScheduler).   *使用给定的appId初始化BlockManager。    *  这不是在构造函数中执行,因为appId可能在BlockManager实例化时间(特别是对于仅在注册到TaskScheduler之后才学习的驱动程序)中不知道。   * This method initializes the BlockTransferService and ShuffleClient, registers with the   * BlockManagerMaster, starts the BlockManagerWorker endpoint, and registers with a local shuffle   * service if configured.    * 此方法初始化BlockTransferService和ShuffleClient,向BlockManagerMaster注册,启动BlockManagerWorker端点,并配置本地随机服务注册。   */def initialize(appId: String): Unit = {    //初始化blockTransferServcic    blockTransferService.init(this)    shuffleClient.init(appId)    //生成一个blockManagerId    blockManagerId = BlockManagerId(      executorId, blockTransferService.hostName, blockTransferService.port)    shuffleServerId = if (externalShuffleServiceEnabled) {      logInfo(s"external shuffle service port = $externalShuffleServicePort")      BlockManagerId(executorId, blockTransferService.hostName, externalShuffleServicePort)    } else {      blockManagerId    }    //blockManagermaster注册blockManager,传入blockmanagerID,最大内存和保存节点地址    master.registerBlockManager(blockManagerId, maxMemory, slaveEndpoint)    // Register Executors' configuration with the local shuffle service, if one should exist.    //使用本地shuffle服务注册执行者的配置,如果存在的话。    if (externalShuffleServiceEnabled && !blockManagerId.isDriver) {      registerWithExternalShuffleServer()    }  }

9.到此为止sparkcontext中的重要工作基本上已经完成了!如果有错误之处,还请及时指正以免误人子弟,谢谢!!

原创粉丝点击