Spark Streaming源码初探 (2)
来源:互联网 发布:5g网络的概念 编辑:程序博客网 时间:2024/06/05 12:33
在 Spark Streaming源码初探 (1) 讲解基于Receiver方式创建DStream和简单分析StreamingContext的启动函数,本节将继续上一节的内容,主要从StreamingContext#start方法中的jobScheduler.start()开始。
简单回顾一下StreamingContext#start源码:
/** * Start the execution of the streams. * * @throws IllegalStateException if the StreamingContext is already stopped. */ def start(): Unit = synchronized { state match { case INITIALIZED => startSite.set(DStream.getCreationSite()) StreamingContext.ACTIVATION_LOCK.synchronized { StreamingContext.assertNoOtherContextIsActive() try { validate() // Start the streaming scheduler in a new thread, so that thread local properties // like call sites and job groups can be reset without affecting those of the // current thread. ThreadUtils.runInNewThread("streaming-start") { sparkContext.setCallSite(startSite.get) sparkContext.clearJobGroup() sparkContext.setLocalProperty(SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, "false") savedProperties.set(SerializationUtils.clone(sparkContext.localProperties.get())) //TODO JobScheduler启动,重点的入口 scheduler.start() } ... //省略部分代码 } }
此段代码重点:JobScheduler的启动,在JobScheduler的start方法中完成一些必要初始化工作,具体见如下!
接下来就查看org.apache.spark.streaming.scheduler.JobScheduler#start代码:
/** * org.apache.spark.streaming.scheduler.#start() */ def start(): Unit = synchronized { if (eventLoop != null) return // scheduler has already been started logDebug("Starting JobScheduler") eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") { //TODO JobScheduler的重点方法,负责job的启动和异常信息处理 override protected def onReceive(event: JobSchedulerEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = reportError("Error in job scheduler", e) } eventLoop.start() // attach rate controllers of input streams to receive batch completion updates for { inputDStream <- ssc.graph.getInputStreams rateController <- inputDStream.rateController } ssc.addStreamingListener(rateController) listenerBus.start() //TODO ReceiverTracker初始化 receiverTracker = new ReceiverTracker(ssc) //TODO 主要用来跟踪输入数据集的统计信息,并进行UI显示和监控 inputInfoTracker = new InputInfoTracker(ssc) val executorAllocClient: ExecutorAllocationClient = ssc.sparkContext.schedulerBackend match { case b: ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient] case _ => null } executorAllocationManager = ExecutorAllocationManager.createIfEnabled( executorAllocClient, receiverTracker, ssc.conf, ssc.graph.batchDuration.milliseconds, clock) executorAllocationManager.foreach(ssc.addStreamingListener) //TODO 重点:ReceiverTracker启动(完成ReceiverSupvisor的创建一起Receiver的启动) receiverTracker.start() //start方法中会判断是否为Receiver类型计算源而具体判断是否启动Receiver //TODO jobGenerator启动 jobGenerator.start() executorAllocationManager.foreach(_.start()) logInfo("Started JobScheduler") } //TODO JobScheduler的start方法定义结束此段代码的重点如下:
1:启动一个时间循环处理器,该事件循环处理器负责job的启动的信息处理(主要负责UI更新),并没有处理Job启动事件,Job真正启动是job直接调用run进行启动执行 job。该事件循环处理器还负责job完成信息和异常信息处理。
2:ReceiverTracker创建和启动,如果是基于Receiver的计算源的话,ReceiverTracker主要负责在Executor上启动ReceiverSupvisor和Receiver。如果不是基于Receiver的 计算源的话,同样也会启动ReceiverTracker,但是此处不会在Executor上启动ReceiverSupvisor和Receiver(org.apache.spark.streaming.DStreamGraph#getReceiverInputStreams是否为空实现是否启动Recevier)
3:inputInfoTracker创建,负责Driver端输入数据源的跟踪。下图是三者之间一个关系,StreamingContext持有JobScheduler引用,JobScheduler持有InputInfoTracker引 用
每一次在ReceiverInputDStream(基于Receiver模式)的compute生成RDD时候都会向inputInfoTracker进行汇报RDD信息,如下是ReceiverInputDStream的compute方法:
/** * Generates RDDs with blocks received by the receiver of this stream. */ override def compute(validTime: Time): Option[RDD[T]] = { val blockRDD = { if (validTime < graph.startTime) { // If this is called for any time before the start time of the context, // then this returns an empty RDD. This may happen when recovering from a // driver failure without any write ahead log to recover pre-failure data. new BlockRDD[T](ssc.sc, Array.empty) } else { // Otherwise, ask the tracker for all the blocks that have been allocated to this stream // for this batch val receiverTracker = ssc.scheduler.receiverTracker //TODO receiverTracker是driver端的receiver跟踪器 val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty) // Register the input blocks information into InputInfoTracker val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)//向inputInfoTracker进行汇报输入源信息 // Create the BlockRDD createBlockRDD(validTime, blockInfos) } } Some(blockRDD) }
4:启动jobGenerator,JobGenerator持有JobScheduler的引用:
因此在org.apache.spark.streaming.scheduler.JobGenerator#generateJobs产生job时候很容易通过jobScheduler.inputInfoTracker.getInfo(time)方法直接得到输入源的信息。
- Spark Streaming源码初探 (2)
- Spark Streaming源码初探 (1)
- Spark Streaming源码初探 (3)
- Spark Streaming初探
- Spark Streaming初探
- Spark streaming 初探
- Spark Streaming源码分析
- spark-streaming源码分析
- Spark Streaming源码解读
- Spark Streaming源码简介
- spark streaming源码解读
- spark streaming + kafka +python(编程)初探
- Spark源码走读——Spark Streaming
- spark streaming 2 ParallelCollectionRDD
- spark streaming 2 ReceiverSupervisorImpl
- spark streaming 2 BlockGenerator
- Spark Streaming 2:概述
- Spark Streaming源码学习总结(一)
- H5---面试题二
- HDU
- 剑指offer之字符串左旋右旋问题
- java中使用for循环去打印正方形,三角形,菱形等图形
- sql server 数据库 2008 r2 允许远程连接
- Spark Streaming源码初探 (2)
- XWiki initialization failed!
- 进程信息之getrusage系统调用
- PPT排版细节,写给大家看的设计书,完美总结
- javaScript中的函数定义
- 输入某年某月某日,计算出今天是今年的第几天
- CodeForces
- 为什么不能用返回值类型来判断方法是否重载呢?
- 前 、中、后缀表达式