第6课：SparkStreaming源码解读之Job动态生成和深度思考

来源：互联网发布：江苏网络卖淫编辑：程序博客网时间：2024/05/29 17:44

第6课：Spark Streaming源码解读之Job动态生成和深度思考
本讲内容：
a. Spark Streaming Job生成深度思考
b. Spark Streaming Job生成源码解析

一个关于流的观点：在做大数据应用程序时，如果不是流式作业的话，一般我们会通过调度器/定时器来定时启动任务（比如一个小时一次，一天一次等）来跑这个应用，一般不会手动提交这样的任务而是JavaEE驱动，这种定时任务放大了来看也类似于流式处理，只不过他的Batch Interval更大。所以可以这么说，所有的数据处理都会变成流处理，只是表现形式不一样，更进一步，一切处理终将被流处理所统一。

SparkStreaming里的Job就相当于Java中线程要处理的Runnable接口，它是对业务逻辑的封装，跟Spark Core中的Job不是一个概念，Spark Core中的Job就是一个运行的作业,当我们谈Spark Core中的Job时，其实我们谈的是具体做的某一件事情，而在SparkStreaming的Job是对Spark Core的job进行了封装,是更高层的抽象。

JobGenerator: 基于DStreams的依赖关系即DstreamGraph产生jobs，会产生多个Job，因为会基于batchDuration不断地产生作业。
DStream可以分为3种类型： 1. Input型DStream, 可以基于不同的数据来源构建InputDStream，数据来源例如socket，kafka，flume等 2. Output型DStream，输出型DStream其实是逻辑级别的action，之所以是逻辑级别的action，这是框架级别提出的action，底层还是被翻译为物理级别的action，所谓物理级别的action就是RDD的action 3. Transformation型DStream， transformation就是状态转换，所谓状态转换就是处理业务逻辑的过程。
DStream有两种产生方式：1直接基于数据源产生 2对其他的DStream进行Transformation产生新的Dstream。

SparkStreaming除了定时生成的Job,还有其他方式产生的Job，例如进行聚合操作，或基于状态的操作，这些不是直接基于batchDuration的，会对很多batchDuration进行处理。所以说JobGenerator其实是最基础最核心的。为了有窗口之类的操作，JobGenerator也会‘This class generates jobs from DStreams as well as drives checkpointing and cleaning up DStream metadata.’。

Spark Streaming应用是以时间为触发器的，而Apache Storm是以事件(基于一个又一个的record)为触发器的。

Streaming程序的入口会指定batchDuration。
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(10));
Second(10) 就是Batch Duration，该代码表示，每10秒中JobGenerator都会产生一个Job，这个Job是逻辑级别的，所谓逻辑级别就是说有这个Job以及这个Job具体该怎么做，但是还没有做。谁去做？是由底层物理级别的RDD的action触发的。Spark Streaming是基于DStream的依赖关系构建的Job，导致了构建的这种Job是逻辑级别的，那底层的物理级别是基于RDD的依赖关系，DStream的action级别的操作也是逻辑级别的，Spark Streaming会根据你的action的操作，给你产生一个逻辑级别的Job，但是它不会运行，就相当于线程具体运行的时候处理代码需要的那个runnable接口。也正因为他是逻辑级别的，现在还没有生成物理级别的Job，才有机会对其进行各种调度和优化。逻辑级别的DStream依赖关系翻译成物理级别的RDD的依赖关系，最后一个操作肯定是RDD的action级别的操作。既要完成这种翻译又要RDD的action级别的操作不立即触发作业并执行，怎么实现？这个时候我们把翻译的东西（RDDs）作为Runnable接口封装，相当于把RDD依赖封装到方法中，由于在方法中而方法没被调用，所以RDD的最后一个action也不会被立即执行，而是放到队列中进行管理。
既要完成这种翻译又要对它进行管理，所以我们把DStream的依赖关系翻译为RDD的依赖关系,最后一个DStream的action级别的操作翻译成RDD的action级别的操作，翻译后的内容是一块内容，放在一个函数体里边，函数体会进行函数的定义，由于是定义还没有执行，所以里边RDD的action不会立即触发作业，当我们的jobScheduler看见要调度这个job的时候，就转过来在线程池中拿出一条线程来执行刚才封装的这个方法。
事实上，如果在翻译时就直接ACTION触发Job的话，就没有队列了，也没有元数据等等等等，整个Spark Streaming的Job提交就不受管理了。

接下来从源码的角度看一下。
SparkStreaming 作业动态生成三大核心:
a. JobGenerator: 负责Job生成
b. JobSheduler:负责Job调度
c. ReceiverTracker:获取元数据/记录数据的来源
无论是生产还是调度都需要元数据。其中，JobGenerator和ReceiverTracker是JobScheduler的成员。

JobScheduler.start():

    receiverTracker = new ReceiverTracker(ssc)    receiverTracker.start()    jobGenerator.start()

jobGenerator.start()：任何streaming程序启动都会调动它。checkpoint后续讲。

  /** Start generation of jobs */  def start(): Unit = synchronized {    if (eventLoop != null) return // generator has already been started    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.    // See SPARK-10125    checkpointWriter    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)      override protected def onError(e: Throwable): Unit = {        jobScheduler.reportError("Error in job generator", e)      }    }    eventLoop.start()    if (ssc.isCheckpointPresent) {      restart()    } else {      startFirstTime()    }  }

因为会不断地循环生成Job，所以需要EventLoop，这里复写了onReceive，用到了匿名内部类。

EventLoop.scala:

private val eventThread = new Thread(name) {    setDaemon(true)    override def run(): Unit = {      try {        while (!stopped.get) {          val event = eventQueue.take()          try {            onReceive(event)          } catch {            case NonFatal(e) => {              try {                onError(e)              } catch {                case NonFatal(e) => logError("Unexpected error in " + name, e)              }            }          }        }      } catch {        case ie: InterruptedException => // exit even if eventQueue is not empty        case NonFatal(e) => logError("Unexpected error in " + name, e)      }    }  }

EventLoop.start()：

  def start(): Unit = {    if (stopped.get) {      throw new IllegalStateException(name + " has already been stopped")    }    // Call onStart before starting the event thread to make sure it happens before onReceive    onStart()    eventThread.start()  }

EventLoop其内部有个后台线程，在启动是会不断地循环从eventQueue中取出event并调用onReceive(event)。
EventLoop.start()调用了eventThread.start(),即线程的start，导致它不断循环队列从eventQueue中取出event并调用onReceive(event)。
而onReceive(event)是个抽象方法，需要注意的是，在其内部不要阻塞，如下注释：

  /**   * Invoked in the event thread when polling events from the event queue.   *   * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked   * and cannot process events in time. If you want to call some blocking actions, run them in   * another thread.   */  protected def onReceive(event: E): Unit

一个原则: 消息循环器一般都不应该处理耗时的业务逻辑，而是路由给其他的线程去处理。
onReceive(event)的实现，是在JobGenerator.scala中的：

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)      override protected def onError(e: Throwable): Unit = {        jobScheduler.reportError("Error in job generator", e)      }

其路由到了：

  /** Processes all events */  private def processEvent(event: JobGeneratorEvent) {    logDebug("Got event " + event)    event match {      case GenerateJobs(time) => generateJobs(time)      case ClearMetadata(time) => clearMetadata(time)      case DoCheckpoint(time, clearCheckpointDataLater) =>        doCheckpoint(time, clearCheckpointDataLater)      case ClearCheckpointData(time) => clearCheckpointData(time)    }  }

我们来看下generateJobs(time)方法。

  /** Generate jobs and perform checkpoint for the given `time`.  */  private def generateJobs(time: Time) {    // Set the SparkEnv in this thread, so that job generation code can access the environment    // Example: BlockRDDs are created in this thread, and it needs to access BlockManager    // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.    SparkEnv.set(ssc.env)    Try {      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch      graph.generateJobs(time) // generate jobs using allocated block    } match {      case Success(jobs) =>        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))      case Failure(e) =>        jobScheduler.reportError("Error generating jobs for time " + time, e)    }    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))  }

因为要生成作业，所以需要固定的数据（这里先不细看）。根据固定的时间，生成作业。
graph.generateJobs(time)是关键点，outputStreams是整个DStreams的最后一个，根据最后一个DStreams操作，基于时间产生Job（非常类似于RDD产生job的方式！）。

  def generateJobs(time: Time): Seq[Job] = {    logDebug("Generating jobs for time " + time)    val jobs = this.synchronized {      outputStreams.flatMap { outputStream =>        val jobOption = outputStream.generateJob(time)        jobOption.foreach(_.setCallSite(outputStream.creationSite))        jobOption      }    }    logDebug("Generated " + jobs.length + " jobs for time " + time)    jobs  }

outputStream.generateJob(time)，关键点！

  /**   * Generate a SparkStreaming job for the given time. This is an internal method that   * should not be called directly. This default implementation creates a job   * that materializes the corresponding RDD. Subclasses of DStream may override this   * to generate their own jobs.   */  private[streaming] def generateJob(time: Time): Option[Job] = {    getOrCompute(time) match {      case Some(rdd) => {        val jobFunc = () => {          val emptyFunc = { (iterator: Iterator[T]) => {} }          context.sparkContext.runJob(rdd, emptyFunc)        }        Some(new Job(time, jobFunc))      }      case None => None    }  }

关键点jobFunc，为了把生成的Job放到队列里，用函数封装了Job本身，因为是个函数，肯定不执行。emptyFunc什么都没做；
context.sparkContext.runJob(rdd, emptyFunc)下面的步骤就是rdd的依赖关系，会触发真正的调度，只不过在这里封装在了jobFunc里肯定不会执行；

  /**   * Run a job on all partitions in an RDD and return the results in an array.   */  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {    runJob(rdd, func, 0 until rdd.partitions.length)  }

Dstream是逻辑界别的，RDD是物理级别的：’This default implementation creates a job that materializes the corresponding RDD.’
Some(new Job(time, jobFunc))产生Job.Job属于spark.streaming.scheduler层面，‘It may contain multiple Spark jobs’：

/** * Class representing a Spark computation. It may contain multiple Spark jobs. */private[streaming]class Job(val time: Time, func: () => _) {  private var _id: String = _  private var _outputOpId: Int = _  private var isSet = false  private var _result: Try[_] = null  private var _callSite: CallSite = null  private var _startTime: Option[Long] = None  private var _endTime: Option[Long] = None

来看下Dstream.getOrCompute(time),它基于时间生成rdd，生成的是最后一个rdd，以time为key,以rdd为value:

/**   * Get the RDD corresponding to the given time; either retrieve it from cache   * or compute-and-cache it.   */  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {    // If RDD was already generated, then retrieve it from HashMap,    // or else compute the RDD    generatedRDDs.get(time).orElse {      // Compute the RDD if time is valid (e.g. correct time in a sliding window)      // of RDD generation, else generate nothing.      if (isTimeValid(time)) {        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {          // Disable checks for existing output directories in jobs launched by the streaming          // scheduler, since we may need to write output to an existing directory during checkpoint          // recovery; see SPARK-4835 for more details. We need to have this call here because          // compute() might cause Spark jobs to be launched.          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {            compute(time)          }        }        rddOption.foreach { case newRDD =>          // Register the generated RDD for caching and checkpointing          if (storageLevel != StorageLevel.NONE) {            newRDD.persist(storageLevel)            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")          }          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {            newRDD.checkpoint()            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")          }          generatedRDDs.put(time, newRDD)        }        rddOption      } else {        None      }    }  }

其中generatedRDDs是个数据结构：

  // RDDs generated, marked as private[streaming] so that testsuites can access it  @transient  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()

总结一下：
JobGenerator.generateJobs(time: Time)=>graph.generateJobs(time)=>outputStream.generateJob(time)

回到JobGenerator.start()方法中，他会根据checkpoint判断是否第一次启动。

    if (ssc.isCheckpointPresent) {      restart()    } else {      startFirstTime()    }

看下JobGenerator.startFirstTime():

  /** Starts the generator for the first time */  private def startFirstTime() {    val startTime = new Time(timer.getStartTime())    graph.start(startTime - graph.batchDuration)    timer.start(startTime.milliseconds)    logInfo("Started JobGenerator at " + startTime)  }

首次启动的时候，graph.start告诉Dstreamgraph第一次batch启动的时间，重要的操作在timer.start。
timer在JobGenerator中，只管时间，它用的的clock只是个时钟。
注意这里的timer！它使用到了匿名函数。

  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

  val clock = {    val clockClass = ssc.sc.conf.get(      "spark.streaming.clock", "org.apache.spark.util.SystemClock")    try {      Utils.classForName(clockClass).newInstance().asInstanceOf[Clock]    } catch {      case e: ClassNotFoundException if clockClass.startsWith("org.apache.spark.streaming") =>        val newClockClass = clockClass.replace("org.apache.spark.streaming", "org.apache.spark")        Utils.classForName(newClockClass).newInstance().asInstanceOf[Clock]    }  }

RecurringTimer中又启动了一个后台线程，该线程不断loop,注意这里的callback,要想知道callback是哪里传过来的，要看他是怎么实例化的，其实是在JobGenerator中用匿名函数实力化的，很精简：

private[streaming]class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)  extends Logging {  private val thread = new Thread("RecurringTimer - " + name) {    setDaemon(true)    override def run() { loop }  }

在JobGenerator中用匿名函数实力化RecurringTimer:

  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

该函数根据时间，不断地发送(GenerateJobs(new Time(longTime))), “JobGenerator”)消息,而在消息处理时，确实有处理消息GenerateJobs(time)。

  /** Processes all events */  private def processEvent(event: JobGeneratorEvent) {    logDebug("Got event " + event)    event match {      case GenerateJobs(time) => generateJobs(time)      case ClearMetadata(time) => clearMetadata(time)      case DoCheckpoint(time, clearCheckpointDataLater) =>        doCheckpoint(time, clearCheckpointDataLater)      case ClearCheckpointData(time) => clearCheckpointData(time)    }  }

至此，一切贯通了！基于batch的时间生成作业。
再来看下generateJobs：

  /** Generate jobs and perform checkpoint for the given `time`.  */  private def generateJobs(time: Time) {    // Set the SparkEnv in this thread, so that job generation code can access the environment    // Example: BlockRDDs are created in this thread, and it needs to access BlockManager    // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.    SparkEnv.set(ssc.env)    Try {      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch      graph.generateJobs(time) // generate jobs using allocated block    } match {      case Success(jobs) =>        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))      case Failure(e) =>        jobScheduler.reportError("Error generating jobs for time " + time, e)    }    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))  }

总结起来，Job生成的步骤：
JobGenerator. generateJob()这个方法生成Job的步骤:
第一步：获取当前时间段内的数据；
第二步：生成Job（这里的Job只是业务的封装，RDD之间的依赖关系构成Job）；
第三步：获取生成Job对应的StreamId的信息。
第四步：封装成JobSet交给JobScheduler去调度；
第五步：发消息进行checkpoint操作。
注意：
receiverTracker接受的是数据的元数据，不是数据本身,然后对它allocateBlocksToBatch(time)；
graph.generateJobs(time)获取的是RDD的dag的依赖关系；由后向前遍历，遍历结束的时候正好生成了RDD的dag的依赖；
Job是代码的业务逻辑，类似于RDD之间的依赖也会封装成一个函数,其实就是最后一个函数，从后向前推！
Job构建成功的话，获取属于他的元数据信息；
Job构建成功的话，基于时间，要处理batchDuration的数据，和封装的业务逻辑，生成一个JobSet，包含了数据和业务逻辑。

JobSet:

/** Class representing a set of Jobs  * belong to the same batch.  */private[streaming]case class JobSet(    time: Time,    jobs: Seq[Job],    streamIdToInputInfo: Map[Int, StreamInputInfo] = Map.empty) {

jobScheduler.submitJobSet()：

  def submitJobSet(jobSet: JobSet) {    if (jobSet.jobs.isEmpty) {      logInfo("No jobs added for time " + jobSet.time)    } else {      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))      jobSets.put(jobSet.time, jobSet)      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))      logInfo("Added jobs for time " + jobSet.time)    }  }

JobHandler(job)就是一个runnable接口，这再次说明，job就是我们的业务逻辑，它代表了rdd之间的依赖关系，是Sparkstreaming框架更改层次地抽象了的对rdd的操作，由于是抽象的不是物理级别，不会立即执行。

本次分享来自于王家林老师的课程‘源码版本定制发行班’，在此向王家林老师表示感谢！
欢迎大家交流技术知识！一起学习，共同进步!

0 0