Spark学习之16:Spark Streaming执行流程(2)
来源:互联网 发布:安卓 知乎 输入法 编辑:程序博客网 时间:2024/05/21 07:06
在Spark Streaming执行流程(1)中,描述了SocketReceiver接收数据,然后由BlockGenerator将数据生成Block并存储的过程。本文将描述将Block生成RDD,并提交执行的流程。
2. 创建Job
该图是前文流程图的一部分。
在JobGenerator的启动流程中,将创建一个匿名Actor和一个RecurringTimer对象。RecurringTimer 定时的向该匿名Actor发送GenerateJobs消息。Actor对该消息进行处理,并调用JobGenerator.generateJobs方法开始生成Job并提交。
job的生成及提交流程:
2.1. JobGenerator.generateJobs
private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val receivedBlockInfos = jobScheduler.receiverTracker.getBlocksOfBatch(time).mapValues { _.toArray } jobScheduler.submitJobSet(JobSet(time, jobs, receivedBlockInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventActor ! DoCheckpoint(time) }
(1)调用ReceiverTracker.allocateBlocksToBatch,将block组成一批;
(2)调用DStreamGraph.generateJobs,根据其中的output DStream创建Job对象数组;
(3)根据Job数组创建JobSet,由JobScheduler.submitJobSet提交创建好的Job数组。
2.2. ReceiverTracker.allocateBlocksToBatch
def allocateBlocksToBatch(batchTime: Time): Unit = { if (receiverInputStreams.nonEmpty) { receivedBlockTracker.allocateBlocksToBatch(batchTime) } }
ReceivedBlockTracker对象中保存了每个Receiver对象生成的Block信息。
2.2.1. ReceivedBlockTracker.allocateBlocksToBatch
def allocateBlocksToBatch(batchTime: Time): Unit = synchronized { if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) { val streamIdToBlocks = streamIds.map { streamId => (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true)) }.toMap val allocatedBlocks = AllocatedBlocks(streamIdToBlocks) writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks)) timeToAllocatedBlocks(batchTime) = allocatedBlocks lastAllocatedBatchTime = batchTime allocatedBlocks } else { ... } }
将streamIdToUnallocatedBlockQueues中的Block信息封装成AllocatedBlocks,并以时间为key放入timeToAllocatedBlocks哈希表。
2.3. DStreamGraph.generateJobs
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { outputStreams.flatMap(outputStream => outputStream.generateJob(time)) } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }
创建所有output DStream的Job。output DStream是由DStream的action操作所产生的DStream。
DStream所提供的Action操作最终都会生成ForEachDStream,该类包含一个以RDD为参数的函数对象。
所以,这里将调用ForEachDStream.generateJob方法。
2.3.1. ForEachDStream.generateJob
override def generateJob(time: Time): Option[Job] = { parent.getOrCompute(time) match { case Some(rdd) => val jobFunc = () => { ssc.sparkContext.setCallSite(creationSite) foreachFunc(rdd, time) } Some(new Job(time, jobFunc)) case None => None } }
(1)调用父依赖的getOrCompute方法(DStream中实现,它是private的所以子类不会重载),在getOrCompute方法中将调用compute方法,该方法在DStream是个抽象方法,有子类具体实现,所以在Spark Streaming执行流程(1)中的例子中,将调用ReceiverInputDStream.compute方法;
从流程图中看出,compute将创建一个BlockRDD对象,其内容为Receiver产生的Block数据的BlockId;getOrCompute会将RDD存入DStream的 generatedRDDs哈希表;最终getOrCompute将返回该RDD对象;
(2)创建Job对象,其中的函数对象阐述包含了刚创建的RDD。
2.4. JobScheduler.submitJobSet
def submitJobSet(jobSet: JobSet) { if (jobSet.jobs.isEmpty) { logInfo("No jobs added for time " + jobSet.time) } else { jobSets.put(jobSet.time, jobSet) jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job))) logInfo("Added jobs for time " + jobSet.time) } }
jobExecutor是一个线程池。
将JobSet中的每个Job封装将JobHandler对象,并交由线程池来执行。
2.4.1. JobHandler
private class JobHandler(job: Job) extends Runnable { def run() { eventActor ! JobStarted(job) // Disable checks for existing output directories in jobs launched by the streaming scheduler, // since we may need to write output to an existing directory during checkpoint recovery; // see SPARK-4835 for more details. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { job.run() } eventActor ! JobCompleted(job) } }
调用Job对象的run方法。
2.4.2. Job
private[streaming]class Job(val time: Time, func: () => _) { var id: String = _ var result: Try[_] = null def run() { result = Try(func()) } def setId(number: Int) { id = "streaming job " + time + "." + number } override def toString = id}
run方法执行函数对象。该函数在ForEachDStream.generateJob方法中创建,其最终调用的是ForEachDStream的foreachFunc参数,它是一个以RDD为参数的函数对象。
在Spark Streaming执行流程(1)的例子中,调用了print方法,将调用下面的函数:
def print(num: Int) { def foreachFunc = (rdd: RDD[T], time: Time) => { val firstNum = rdd.take(num + 1) println ("-------------------------------------------") println ("Time: " + time) println ("-------------------------------------------") firstNum.take(num).foreach(println) if (firstNum.size > num) println("...") println() } new ForEachDStream(this, context.sparkContext.clean(foreachFunc)).register() }
即:foreachFunc函数。RDD的take方法将开始RDD的执行,包括Stage划分,创建Task及Task提交等。
0 0
- Spark学习之16:Spark Streaming执行流程(2)
- Spark学习之15:Spark Streaming执行流程(1)
- Spark组件之Spark Streaming学习2--StatefulNetworkWordCount 学习
- Spark streaming 执行流程源码图
- Spark学习笔记之-Spark-Streaming
- Spark学习之Spark Streaming(9)
- spark学习笔记:Spark Streaming
- Spark学习六:spark streaming
- Spark组件之Spark Streaming学习1--NetworkWordCount学习
- Spark组件之Spark Streaming学习4--HdfsWordCount 学习
- Spark组件之Spark Streaming学习5--WindowsWordCount学习
- Spark学习之10:Task执行结果返回流程
- Spark Streaming 再学习
- Spark Streaming学习笔记
- 4.Spark Streaming学习
- Spark Streaming学习
- Spark Streaming 学习笔记
- Spark Streaming学习笔记
- CodeForces 293E Close Vertices(点分治+Two Point法+树状数组)
- 多段图最短路径问题
- C++设计模式——代理模式
- 修复 FLASHMAIL 只能发不能收的问题
- php中关于string编码的问题
- Spark学习之16:Spark Streaming执行流程(2)
- Base64原理
- Linux内核源码组织结构
- rk3288 ov8858 camera移植
- HDOJ 2054 A == B ?(大数比较)
- 黑马程序员—DOS的常用命令及JDK安装
- 斐波那契数列
- JBoss5.1GA启动异常
- js判断指定函数、变量是否存在的方法