第6课:SparkStreaming源码解读之Job动态生成和深度思考
来源:互联网 发布:江苏网络卖淫 编辑:程序博客网 时间:2024/05/29 17:44
第6课:Spark Streaming源码解读之Job动态生成和深度思考
本讲内容:
a. Spark Streaming Job生成深度思考
b. Spark Streaming Job生成源码解析
一个关于流的观点:在做大数据应用程序时,如果不是流式作业的话,一般我们会通过调度器/定时器来定时启动任务(比如一个小时一次,一天一次等)来跑这个应用,一般不会手动提交这样的任务而是JavaEE驱动,这种定时任务放大了来看也类似于流式处理,只不过他的Batch Interval更大。所以可以这么说,所有的数据处理都会变成流处理,只是表现形式不一样,更进一步,一切处理终将被流处理所统一。
SparkStreaming里的Job就相当于Java中线程要处理的Runnable接口,它是对业务逻辑的封装,跟Spark Core中的Job不是一个概念,Spark Core中的Job就是一个运行的作业,当我们谈Spark Core中的Job时,其实我们谈的是具体做的某一件事情,而在SparkStreaming的Job是对Spark Core的job进行了封装,是更高层的抽象。
JobGenerator: 基于DStreams的依赖关系即DstreamGraph产生jobs,会产生多个Job,因为会基于batchDuration不断地产生作业。
DStream可以分为3种类型: 1. Input型DStream, 可以基于不同的数据来源构建InputDStream, 数据来源例如socket,kafka,flume等 2. Output型DStream,输出型DStream其实是逻辑级别的action,之所以是逻辑级别的action,这是框架级别提出的action,底层还是被翻译为物理级别的action,所谓物理级别的action就是RDD的action 3. Transformation型DStream, transformation就是状态转换,所谓状态转换就是处理业务逻辑的过程。
DStream有两种产生方式:1直接基于数据源产生 2对其他的DStream进行Transformation产生新的Dstream。
SparkStreaming除了定时生成的Job,还有其他方式产生的Job,例如进行聚合操作,或基于状态的操作,这些不是直接基于batchDuration的,会对很多batchDuration进行处理。所以说JobGenerator其实是最基础最核心的。为了有窗口之类的操作,JobGenerator也会‘This class generates jobs from DStreams as well as drives checkpointing and cleaning up DStream metadata.’。
Spark Streaming应用是以时间为触发器的,而Apache Storm是以事件(基于一个又一个的record)为触发器的。
Streaming程序的入口会指定batchDuration。
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(10));
Second(10) 就是Batch Duration,该代码表示,每10秒中JobGenerator都会产生一个Job,这个Job是逻辑级别的,所谓逻辑级别就是说有这个Job以及这个Job具体该怎么做,但是还没有做。谁去做?是由底层物理级别的RDD的action触发的。Spark Streaming是基于DStream的依赖关系构建的Job,导致了构建的这种Job是逻辑级别的,那底层的物理级别是基于RDD的依赖关系,DStream的action级别的操作也是逻辑级别的,Spark Streaming会根据你的action的操作,给你产生一个逻辑级别的Job,但是它不会运行,就相当于线程具体运行的时候处理代码需要的那个runnable接口。也正因为他是逻辑级别的,现在还没有生成物理级别的Job,才有机会对其进行各种调度和优化。逻辑级别的DStream依赖关系翻译成物理级别的RDD的依赖关系,最后一个操作肯定是RDD的action级别的操作。既要完成这种翻译又要RDD的action级别的操作不立即触发作业并执行,怎么实现?这个时候我们把翻译的东西(RDDs)作为Runnable接口封装,相当于把RDD依赖封装到方法中,由于在方法中而方法没被调用,所以RDD的最后一个action也不会被立即执行,而是放到队列中进行管理。
既要完成这种翻译又要对它进行管理,所以我们把DStream的依赖关系翻译为RDD的依赖关系,最后一个DStream的action级别的操作翻译成RDD的action级别的操作,翻译后的内容是一块内容,放在一个函数体里边,函数体会进行函数的定义,由于是定义还没有执行,所以里边RDD的action不会立即触发作业,当我们的jobScheduler看见要调度这个job的时候,就转过来在线程池中拿出一条线程来执行刚才封装的这个方法。
事实上,如果在翻译时就直接ACTION触发Job的话,就没有队列了,也没有元数据等等等等,整个Spark Streaming的Job提交就不受管理了。
接下来从源码的角度看一下。
SparkStreaming 作业动态生成三大核心:
a. JobGenerator: 负责Job生成
b. JobSheduler:负责Job调度
c. ReceiverTracker:获取元数据/记录数据的来源
无论是生产还是调度都需要元数据。其中,JobGenerator和ReceiverTracker是JobScheduler的成员。
JobScheduler.start():
receiverTracker = new ReceiverTracker(ssc) receiverTracker.start() jobGenerator.start()
jobGenerator.start():任何streaming程序启动都会调动它。checkpoint后续讲。
/** Start generation of jobs */ def start(): Unit = synchronized { if (eventLoop != null) return // generator has already been started // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock. // See SPARK-10125 checkpointWriter eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) } } eventLoop.start() if (ssc.isCheckpointPresent) { restart() } else { startFirstTime() } }
因为会不断地循环生成Job,所以需要EventLoop,这里复写了onReceive,用到了匿名内部类。
EventLoop.scala:
private val eventThread = new Thread(name) { setDaemon(true) override def run(): Unit = { try { while (!stopped.get) { val event = eventQueue.take() try { onReceive(event) } catch { case NonFatal(e) => { try { onError(e) } catch { case NonFatal(e) => logError("Unexpected error in " + name, e) } } } } } catch { case ie: InterruptedException => // exit even if eventQueue is not empty case NonFatal(e) => logError("Unexpected error in " + name, e) } } }
EventLoop.start():
def start(): Unit = { if (stopped.get) { throw new IllegalStateException(name + " has already been stopped") } // Call onStart before starting the event thread to make sure it happens before onReceive onStart() eventThread.start() }
EventLoop其内部有个后台线程,在启动是会不断地循环从eventQueue中取出event并调用onReceive(event)。
EventLoop.start()调用了eventThread.start(),即线程的start,导致它不断循环队列从eventQueue中取出event并调用onReceive(event)。
而onReceive(event)是个抽象方法,需要注意的是,在其内部不要阻塞,如下注释:
/** * Invoked in the event thread when polling events from the event queue. * * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked * and cannot process events in time. If you want to call some blocking actions, run them in * another thread. */ protected def onReceive(event: E): Unit
一个原则: 消息循环器一般都不应该处理耗时的业务逻辑,而是路由给其他的线程去处理。
onReceive(event)的实现,是在JobGenerator.scala中的:
eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") { override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event) override protected def onError(e: Throwable): Unit = { jobScheduler.reportError("Error in job generator", e) }
其路由到了:
/** Processes all events */ private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) } }
我们来看下generateJobs(time)方法。
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }
因为要生成作业,所以需要固定的数据(这里先不细看)。根据固定的时间,生成作业。
graph.generateJobs(time)是关键点,outputStreams是整个DStreams的最后一个,根据最后一个DStreams操作,基于时间产生Job(非常类似于RDD产生job的方式!)。
def generateJobs(time: Time): Seq[Job] = { logDebug("Generating jobs for time " + time) val jobs = this.synchronized { outputStreams.flatMap { outputStream => val jobOption = outputStream.generateJob(time) jobOption.foreach(_.setCallSite(outputStream.creationSite)) jobOption } } logDebug("Generated " + jobs.length + " jobs for time " + time) jobs }
outputStream.generateJob(time),关键点!
/** * Generate a SparkStreaming job for the given time. This is an internal method that * should not be called directly. This default implementation creates a job * that materializes the corresponding RDD. Subclasses of DStream may override this * to generate their own jobs. */ private[streaming] def generateJob(time: Time): Option[Job] = { getOrCompute(time) match { case Some(rdd) => { val jobFunc = () => { val emptyFunc = { (iterator: Iterator[T]) => {} } context.sparkContext.runJob(rdd, emptyFunc) } Some(new Job(time, jobFunc)) } case None => None } }
关键点jobFunc,为了把生成的Job放到队列里,用函数封装了Job本身,因为是个函数,肯定不执行。emptyFunc什么都没做;
context.sparkContext.runJob(rdd, emptyFunc)下面的步骤就是rdd的依赖关系,会触发真正的调度, 只不过在这里封装在了jobFunc里肯定不会执行;
/** * Run a job on all partitions in an RDD and return the results in an array. */ def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = { runJob(rdd, func, 0 until rdd.partitions.length) }
Dstream是逻辑界别的,RDD是物理级别的:’This default implementation creates a job that materializes the corresponding RDD.’
Some(new Job(time, jobFunc))产生Job.Job属于spark.streaming.scheduler层面,‘It may contain multiple Spark jobs’:
/** * Class representing a Spark computation. It may contain multiple Spark jobs. */private[streaming]class Job(val time: Time, func: () => _) { private var _id: String = _ private var _outputOpId: Int = _ private var isSet = false private var _result: Try[_] = null private var _callSite: CallSite = null private var _startTime: Option[Long] = None private var _endTime: Option[Long] = None
来看下Dstream.getOrCompute(time),它基于时间生成rdd,生成的是最后一个rdd,以time为key,以rdd为value:
/** * Get the RDD corresponding to the given time; either retrieve it from cache * or compute-and-cache it. */ private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = { // If RDD was already generated, then retrieve it from HashMap, // or else compute the RDD generatedRDDs.get(time).orElse { // Compute the RDD if time is valid (e.g. correct time in a sliding window) // of RDD generation, else generate nothing. if (isTimeValid(time)) { val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) { // Disable checks for existing output directories in jobs launched by the streaming // scheduler, since we may need to write output to an existing directory during checkpoint // recovery; see SPARK-4835 for more details. We need to have this call here because // compute() might cause Spark jobs to be launched. PairRDDFunctions.disableOutputSpecValidation.withValue(true) { compute(time) } } rddOption.foreach { case newRDD => // Register the generated RDD for caching and checkpointing if (storageLevel != StorageLevel.NONE) { newRDD.persist(storageLevel) logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel") } if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) { newRDD.checkpoint() logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing") } generatedRDDs.put(time, newRDD) } rddOption } else { None } } }
其中generatedRDDs是个数据结构:
// RDDs generated, marked as private[streaming] so that testsuites can access it @transient private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()
总结一下:
JobGenerator.generateJobs(time: Time)=>graph.generateJobs(time)=>outputStream.generateJob(time)
回到JobGenerator.start()方法中,他会根据checkpoint判断是否第一次启动。
if (ssc.isCheckpointPresent) { restart() } else { startFirstTime() }
看下JobGenerator.startFirstTime():
/** Starts the generator for the first time */ private def startFirstTime() { val startTime = new Time(timer.getStartTime()) graph.start(startTime - graph.batchDuration) timer.start(startTime.milliseconds) logInfo("Started JobGenerator at " + startTime) }
首次启动的时候,graph.start告诉Dstreamgraph第一次batch启动的时间,重要的操作在timer.start。
timer在JobGenerator中,只管时间,它用的的clock只是个时钟。
注意这里的timer!它使用到了匿名函数。
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
val clock = { val clockClass = ssc.sc.conf.get( "spark.streaming.clock", "org.apache.spark.util.SystemClock") try { Utils.classForName(clockClass).newInstance().asInstanceOf[Clock] } catch { case e: ClassNotFoundException if clockClass.startsWith("org.apache.spark.streaming") => val newClockClass = clockClass.replace("org.apache.spark.streaming", "org.apache.spark") Utils.classForName(newClockClass).newInstance().asInstanceOf[Clock] } }
RecurringTimer中又启动了一个后台线程,该线程不断loop,注意这里的callback,要想知道callback是哪里传过来的,要看他是怎么实例化的,其实是在JobGenerator中用匿名函数实力化的,很精简:
private[streaming]class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String) extends Logging { private val thread = new Thread("RecurringTimer - " + name) { setDaemon(true) override def run() { loop } }
在JobGenerator中用匿名函数实力化RecurringTimer:
private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds, longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
该函数根据时间,不断地发送(GenerateJobs(new Time(longTime))), “JobGenerator”)消息,而在消息处理时,确实有处理消息GenerateJobs(time)。
/** Processes all events */ private def processEvent(event: JobGeneratorEvent) { logDebug("Got event " + event) event match { case GenerateJobs(time) => generateJobs(time) case ClearMetadata(time) => clearMetadata(time) case DoCheckpoint(time, clearCheckpointDataLater) => doCheckpoint(time, clearCheckpointDataLater) case ClearCheckpointData(time) => clearCheckpointData(time) } }
至此,一切贯通了!基于batch的时间生成作业。
再来看下generateJobs:
/** Generate jobs and perform checkpoint for the given `time`. */ private def generateJobs(time: Time) { // Set the SparkEnv in this thread, so that job generation code can access the environment // Example: BlockRDDs are created in this thread, and it needs to access BlockManager // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed. SparkEnv.set(ssc.env) Try { jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch graph.generateJobs(time) // generate jobs using allocated block } match { case Success(jobs) => val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time) jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos)) case Failure(e) => jobScheduler.reportError("Error generating jobs for time " + time, e) } eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false)) }
总结起来,Job生成的步骤:
JobGenerator. generateJob()这个方法生成Job的步骤:
第一步:获取当前时间段内的数据;
第二步:生成Job(这里的Job只是业务的封装,RDD之间的依赖关系构成Job);
第三步:获取生成Job对应的StreamId的信息。
第四步:封装成JobSet交给JobScheduler去调度;
第五步:发消息进行checkpoint操作。
注意:
receiverTracker接受的是数据的元数据,不是数据本身,然后对它allocateBlocksToBatch(time);
graph.generateJobs(time)获取的是RDD的dag的依赖关系;由后向前遍历,遍历结束的时候正好生成了RDD的dag的依赖;
Job是代码的业务逻辑,类似于RDD之间的依赖也会封装成一个函数,其实就是最后一个函数,从后向前推!
Job构建成功的话,获取属于他的元数据信息;
Job构建成功的话,基于时间,要处理batchDuration的数据,和封装的业务逻辑,生成一个JobSet,包含了数据和业务逻辑。
JobSet:
/** Class representing a set of Jobs * belong to the same batch. */private[streaming]case class JobSet( time: Time, jobs: Seq[Job], streamIdToInputInfo: Map[Int, StreamInputInfo] = Map.empty) {
jobScheduler.submitJobSet():
def submitJobSet(jobSet: JobSet) { if (jobSet.jobs.isEmpty) { logInfo("No jobs added for time " + jobSet.time) } else { listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo)) jobSets.put(jobSet.time, jobSet) jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job))) logInfo("Added jobs for time " + jobSet.time) } }
JobHandler(job)就是一个runnable接口,这再次说明,job就是我们的业务逻辑,它代表了rdd之间的依赖关系,是Sparkstreaming框架更改层次地抽象了的对rdd的操作,由于是抽象的不是物理级别,不会立即执行。
本次分享来自于王家林老师的课程‘源码版本定制发行班’,在此向王家林老师表示感谢!
欢迎大家交流技术知识!一起学习,共同进步!
- 第6课:SparkStreaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- 第6课:Spark Streaming源码解读之Job动态生成和深度思考
- Spark定制班第6课:Spark Streaming源码解读之Job动态生成和深度思考
- Spark定制班第6课:Spark Streaming源码解读之Job动态生成和深度思考
- Spark学习笔记(6)源码解读之Job动态生成和深度思考
- Spark 定制版:006~Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming源码解读之Job动态生成和深度思考
- Spark Streaming源码解读之Job动态生成和深度思考
- 第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考
- 第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考
- 第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考
- 第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考
- 第7课:Spark Streaming源码解读之JobScheduler内幕实现和深度思考
- 百度地图覆盖物和定位覆盖物
- HBase描述
- springmvc+spring+mybatis整合案例 [first]
- Linux下开启FTP服务方法:
- iOS 设置tableView每个分区cell圆角
- 第6课:SparkStreaming源码解读之Job动态生成和深度思考
- 反编译插件神器 JadClipse
- ffmpeg学习(0)——什么是ffmpeg
- 多周期cpu设计(verilog)
- 计算机程序的构造和解释 练习 1.40 ~ 1.45
- SpringMVC经典必看——使用 Spring2.5注释详解(@Autowired、@Resource 、@PostConstruct、PreDestroy 和 @Component)
- javascript之BOM编程Screen(屏幕)对象
- schedule() 和 scheduleAtFixedRate() 区别
- 最强大脑唯爱水哥、ACM敬佩楼教主