[spark streaming] 动态生成 Job 并提交执行
来源:互联网 发布:北京工业大学未来网络 编辑:程序博客网 时间:2024/06/05 05:39
前言
Spark Streaming Job的生成是通过JobGenerator
每隔 batchDuration 长时间动态生成的,每个batch 对应提交一个JobSet,因为针对一个batch可能有多个输出操作。
概述流程:
- 定时器定时向 eventLoop 发送生成job的请求
- 通过receiverTracker 为当前batch分配block
- 为当前batch生成对应的 Jobs
- 将Jobs封装成JobSet 提交执行
入口
在 JobGenerator 初始化的时候就创建了一个定时器:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
每隔 batchDuration 就会向 eventLoop 发送 GenerateJobs(new Time(longTime))消息,eventLoop的事件处理方法中会调用generateJobs(time)方法:
case GenerateJobs(time) => generateJobs(time)
private def generateJobs(time: Time) { // Checkpoint all RDDs marked for checkpointing to ensure their lineages are // truncated periodically. Otherwise, we may run into stack overflows (SPARK-6847). ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true") Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) PythonDStream.stopStreamingContextIfPythonProcessIsDead(e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }
为当前batchTime分配Block
首先调用receiverTracker.allocateBlocksToBatch(time)
方法为当前batchTime分配对应的Block,最终会调用receiverTracker
的Block管理者receivedBlockTracker
的allocateBlocksToBatch
方法:
def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { val streamIdToBlocks = streamIds.map { streamId => (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) }.toMap val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) { timeToAllocatedBlocks.put(batchTime, allocatedBlocks) lastAllocatedBatchTime = batchTime } else { logInfo(s"Possibly processed batch $batchTime needs to be processed again in WAL recovery") } } else { logInfo(s"Possibly processed batch $batchTime needs to be processed again in WAL recovery") } }
private def getReceivedBlockQueue(streamId: Int): ReceivedBlockQueue = { streamIdToUnallocatedBlockQueues.getOrElseUpdate(streamId, new ReceivedBlockQueue) }
可以看到是从streamIdToUnallocatedBlockQueues
中获取到所有streamId对应的未分配的blocks,该队列的信息是supervisor 存储好Block后向receiverTracker上报的Block信息,详情可见 ReceiverTracker 数据产生与存储。
获取到所有streamId对应的未分配的blockInfos后,将其放入了timeToAllocatedBlocks:Map[Time, AllocatedBlocks]
中,后面生成RDD的时候会用到。
为当前batchTime生成Jobs
调用DStreamGraph
的generateJobs
方法为当前batchTime生成job:
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { outputStreams.flatMap { outputStream => val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }
一个outputStream就对应一个job,遍历所有的outputStreams,为其生成job:
# ForEachDStreamoverride def generateJob(time: Time): Option[Job] = { parent.getOrCompute(time) match { case Some(rdd) => val jobFunc = () => createRDDWithLocalProperties(time, displayInnerRDDOps) { foreachFunc(rdd, time) } Some(new Job(time, jobFunc)) case None => None } }
先获取到time对应的RDD,然后将其作为参数再调用foreachFunc方法,foreachFunc方法是通过构造器传过来的,我们来看看print()输出的情况:
def print(num: Int): Unit = ssc.withScope { def foreachFunc: (RDD[T], Time) => Unit = { (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) // scalastyle:off println println("-------------------------------------------") println(s"Time: $time") println("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.length > num) println("...") println() // scalastyle:on println } } foreachRDD(context.sparkContext.clean(foreachFunc), displayInnerRDDOps = false) }
这里的构造的foreachFunc方法就是最终和rdd一起提交job的执行方法,也即对rdd调用take()后并打印,真正触发action操作的是在这个func函数里,现在再来看看是怎么拿到rdd的,每个DStream都有一个generatedRDDs:Map[Time, RDD[T]]
变量,来保存time对应的RDD,若获取不到则会通过compute()方法来计算,对于需要在executor上启动Receiver来接收数据的ReceiverInputDStream来说:
override def compute(validTime: Time): Option[RDD[T]] = { val blockRDD = { if (validTime < graph.startTime) { // If this is called for any time before the start time of the context, // then this returns an empty RDD. This may happen when recovering from a // driver failure without any write ahead log to recover pre-failure data. new BlockRDD[T](ssc.sc, Array.empty) } else { // Otherwise, ask the tracker for all the blocks that have been allocated to this stream // for this batch val receiverTracker = ssc.scheduler.receiverTracker val blockInfos = receiverTracker.getBlocksOfBatch(validTime).getOrElse(id, Seq.empty) // Register the input blocks information into InputInfoTracker val inputInfo = StreamInputInfo(id, blockInfos.flatMap(_.numRecords).sum) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) // Create the BlockRDD createBlockRDD(validTime, blockInfos) } } Some(blockRDD) }
会通过receiverTracker来获取该batch对应的blocks,前面已经分析过为所有streamId分配了对应的未分配的block,并且放在了timeToAllocatedBlocks:Map[Time, AllocatedBlocks]
中,这里底层就是从这个timeToAllocatedBlocks
获取到的blocksInfo,然后调用了createBlockRDD(validTime, blockInfos)
通过blockId创建了RDD。
最后,将通过此RDD和foreachFun构建jobFunc,并创建Job返回。
封装jobs成JobSet并提交执行
每个outputStream对应一个Job,最终就会生成一个jobs,为这个jobs创建JobSet,并通过jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
来提交这个JobSet:
jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))
然后通过jobExecutor来执行,jobExecutor是一个线程池,并行度默认为1,可通过spark.streaming.concurrentJobs
配置,即同时可执行几个批次的数据。
处理类JobHandler中调用的是Job.run(),执行的是前面构建的 jobFunc 方法。
- [spark streaming] 动态生成 Job 并提交执行
- spark生成jar文件并提交集群
- spark-streaming系列------- 4. Spark-Streaming Job的生成和执行
- 动态生成html表单并提交
- Spark Streaming生成RDD并执行Spark Job源码内幕解密
- Spark Streaming生成RDD并执行Spark Job源码内幕解密
- Spark 定制版:006~Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- 6.Spark streaming技术内幕 : Job动态生成原理与源码解析
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming源码解读之Job动态生成和深度思考
- Spark streaming技术内幕6 : Job动态生成原理与源码解析
- Spark streaming源码分析之Job动态生成原理与源码解析
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- FPGA中进行testbench的编写可复制的代码
- 小程序ssl
- 技术分享连载(七十三)
- angularJs表单校验(超级详细!!!)
- 8、ES6跨模块常量
- [spark streaming] 动态生成 Job 并提交执行
- Zookeeper介绍
- Mysql5安装教程
- Linux内核同步机制之completion
- json 数组的增加
- MySQL引擎
- Mac上安装MySQL服务与创建数据库
- HTML-img
- 学习回顾算法(顺序查找算法)