Spark2.0.X源码深度剖析之 SparkContext
来源:互联网 发布:免费的收银软件 编辑:程序博客网 时间:2024/06/05 23:53
微信号:519292115
邮箱:taosiyuan163@163.com
尊重原创,禁止转载!!
Spark目前是大数据领域中最火的框架之一,可高效实现离线批处理,实时计算和机器学习等多元化操作,阅读源码有助你加深对框架的理解和认知
本人将依次剖析Spark2.0.0.X版本的各个核心组件,包括以后章节的SparkContext,SparkEnv,RpcEnv,NettyRpc,BlockManager,OutputTracker,TaskScheduler,DAGScheduler等
SparkContext作为程序员编写代码的第一个生成对象,它会首先在Driver端创建,除了负责连接集群以外还会在创建的时候会初始化各个核心组件,包括DAGScheduler,TaskScheduler,SparkEnv,accumulator等。
/** * Main entry point for Spark functionality. A SparkContext represents the connection to a Spark * cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. * * Only one SparkContext may be active per JVM. You must `stop()` the active SparkContext before * creating a new one. This limitation may eventually be removed; see SPARK-2243 for more details. * * @param config a Spark Config object describing the application configuration. Any settings in * this config overrides the default configs as well as system properties. */class SparkContext(config: SparkConf) extends Logging {第一个生成的对象,主要用作负责Spark集群的事件监听,和MetricsSystem类似,他们之间也会有消息通信
// An asynchronous listener bus for Spark eventsprivate[spark] val listenerBus = new LiveListenerBus(this)
先把Driver的标记 包括地址,主机名和executorId号 设置进SparkConf里
// Set Spark driver host and port system properties. This explicitly sets the configuration// instead of relying on the default value of the config constant._conf.set(DRIVER_HOST_ADDRESS, _conf.get(DRIVER_HOST_ADDRESS))_conf.setIfMissing("spark.driver.port", "0")_conf.set("spark.executor.id", SparkContext.DRIVER_IDENTIFIER)
在创建SparkEnv之前 会创建jobprogressListener,他是负责每个事件的监听并发送给listenerBus进行事件处理,包括对SparkEnv的时间监听
// "_jobProgressListener" should be set up before creating SparkEnv because when creating// "SparkEnv", some messages will be posted to "listenerBus" and we should not miss them.// 负责监听事件并把事件消息发送给之前生成的listenerBus_jobProgressListener = new JobProgressListener(_conf)listenerBus.addListener(jobProgressListener)// Create the Spark execution environment (cache, map output tracker, etc)// 开始创建Spark的执行环境了_env = createSparkEnv(_conf, isLocal, listenerBus)SparkEnv.set(_env)这里其实创建的是DriverEnv
// This function allows components created by SparkEnv to be mocked in unit tests:private[spark] def createSparkEnv( conf: SparkConf, isLocal: Boolean, listenerBus: LiveListenerBus): SparkEnv = { //创建的DriverEnv SparkEnv.createDriverEnv(conf, isLocal, listenerBus, SparkContext.numDriverCores(master))}下面会走到SparkEnv包下的初始方法,关于SparkEnv和其里面创建的RpcEnv,MapOutputTracker,blockTransferService,blockManager等 ,都会放在后面另起章节来讲
/** * Create a SparkEnv for the driver. */private[spark] def createDriverEnv( conf: SparkConf, isLocal: Boolean, listenerBus: LiveListenerBus, numCores: Int, mockOutputCommitCoordinator: Option[OutputCommitCoordinator] = None): SparkEnv = { // 先做断言 判断是否包含DRIVER_HOST_ADDRESS assert(conf.contains(DRIVER_HOST_ADDRESS), s"${DRIVER_HOST_ADDRESS.key} is not set on the driver!") // 判断是否包含spark.driver.port assert(conf.contains("spark.driver.port"), "spark.driver.port is not set on the driver!") // 拿到绑定的地址 val bindAddress = conf.get(DRIVER_BIND_ADDRESS) // 拿到HOST地址 val advertiseAddress = conf.get(DRIVER_HOST_ADDRESS) val port = conf.get("spark.driver.port").toInt // 判断下是否传输加密 val ioEncryptionKey = if (conf.get(IO_ENCRYPTION_ENABLED)) { Some(CryptoStreamUtils.createKey(conf)) } else { None } // 调用通用的Env的create方法 create( conf, SparkContext.DRIVER_IDENTIFIER, bindAddress, advertiseAddress, Option(port), isLocal, numCores, ioEncryptionKey, listenerBus = listenerBus, mockOutputCommitCoordinator = mockOutputCommitCoordinator )}
// 一个低级别的状态报告API,负责监听job和stage的进度_statusTracker = new SparkStatusTracker(this)
// We need to register "HeartbeatReceiver" before "createTaskScheduler" because Executor will// retrieve "HeartbeatReceiver" in the constructor. (SPARK-6640)// 注册heartbeatReceiver的Endpoint到rpcEnv上面并返回他对应的Reference// 这里可以说一下 ,接下来所有的master-slave模式的组件都是通过setupEndpoint和setupEndpointRef// 来注册自己和解锁对应的Endponit_heartbeatReceiver = env.rpcEnv.setupEndpoint( // 通过注册他的名字和endpoint对象 HeartbeatReceiver.ENDPOINT_NAME, new HeartbeatReceiver(this))
接下来就是最核心之一的TaskScheduler及DAGScheduler创建和启动:
// Create and start the scheduler// 在createTaskScheduler方法里面主要是根据master来匹配对应的schedulerBackend和// taskScheduler的创建方式val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)_schedulerBackend = sched_taskScheduler = ts// 创建DAGScheduler,具体细节会在接下来章节提及_dagScheduler = new DAGScheduler(this)// 通知HeartbeatReceiver taskScheduler被创建_heartbeatReceiver.ask[Boolean](TaskSchedulerIsSet)// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler's// constructor// 启动taskScheduler_taskScheduler.start()
拿到提交job的APPId 并初始化BlockManager
// 获得SparkAppId 格式是"spark-application-" + 时间戳_applicationId = _taskScheduler.applicationId()_applicationAttemptId = taskScheduler.applicationAttemptId()// 设置拿到的job相关的appId到conf里_conf.set("spark.app.id", _applicationId)if (_conf.getBoolean("spark.ui.reverseProxy", false)) { System.setProperty("spark.ui.proxyBase", "/proxy/" + _applicationId)}_ui.foreach(_.setAppId(_applicationId))// 拿到指定的appId并初始化blockmanager_env.blockManager.initialize(_applicationId)
如果是动态资源分配模式的话 会构造一个ExecutorAllocationManager对象,目前只能在yarn模式使用
// Optionally scale number of executors dynamically based on workload. Exposed for testing.// 是否动态分配资源,目前只支持yarn模式,包括开启blockManager的externalShuffl也必须启动此参数,这个// 在后面章节也会提及到val dynamicAllocationEnabled = Utils.isDynamicAllocationEnabled(_conf)_executorAllocationManager = if (dynamicAllocationEnabled) { schedulerBackend match { // 负责连接到cluster manager申请或kill掉executors case b: ExecutorAllocationClient => // 这个对象 主要是根据集群资源动态触发增加或者删除资源策略 Some(new ExecutorAllocationManager( schedulerBackend.asInstanceOf[ExecutorAllocationClient], listenerBus, _conf)) case _ => None } } else { None }_executorAllocationManager.foreach(_.start())创建一个弱引用的Cleaner用作RRD,ShuffleDependicy和Broadcast的强制清理
_cleaner = if (_conf.getBoolean("spark.cleaner.referenceTracking", true)) { // ContextCleaner RDD, ShuffleDependency和Broadcast的弱引用,负责对他们超出范围的 // 执行清理,比如CG时候的RDD被回收掉了 而对应的数据集却没有,此时ContextCleaner就负责 // 清理这个RDD对应的数据集 Some(new ContextCleaner(this)) } else { None }_cleaner.foreach(_.start())
阅读全文
0 0
- Spark2.0.X源码深度剖析之 SparkContext
- Spark2.0.X源码深度剖析之 Spark Submit..
- Spark2.0.X源码深度剖析之 SparkEnv
- Spark2.0.X源码深度剖析之 RpcEnv & NettyRpcEnv
- Spark2.0.X源码深度剖析之 DAGScheduler之Stage划分 —— 国内全网最新最全最具深度!!!
- Spark2.0.X源码深度剖析之 TaskScheduler之Task划分 —— 国内全网最新最全最具深度!!!
- Spark2.0.X算子源码深度剖析之MapPartitionsRDD,绝对让你看清楚算子的计算本质
- Spark2.2 SparkContext原理剖析图及源码
- Spark2.X源码学习--从SparkContext创建分析AppClient注册
- Spark内核源码深度剖析:sparkContext初始化的源码核心
- SparkContext源码深入剖析
- spark2.x---2. SparkContext构成与初始化
- Spark内核源码深度剖析:SparkContext原理剖析与源码分析
- Spark内核源码深度剖析:SparkContext原理剖析与源码分析
- 《apache spark源码剖析》 学习笔记之SparkContext
- 0001.spark2.0源码分析(1)--SparkContext
- 【Spark2.0源码学习】-8.SparkContext与Application介绍
- Spark源码分析之SparkContext
- Storm示例剖析-fastWordCount
- iOS 根据时间排序
- 《手把手博客搭建教程2—LAMP安装配置》
- 基于C++的归并排序算法
- 提示框UIAlertView
- Spark2.0.X源码深度剖析之 SparkContext
- Bootstrap 网格系统
- 51NOD1287 加农炮 【RMQ】
- 学习C++的一些笔记(一)
- 有关graphviz中文乱码的一个与总不同的很无奈的解决方法
- Codeforces Round #379 (Div. 2) C. Anton and Making Potions —— 二分
- 企业管理理论综述与实践 — 管理、使命、愿景、价值观
- 线性回归例子(Linear Regression Example)
- Php与Apache的三种结合方式以及各自优缺点