第6课:SparkStreaming源码解读之Job动态生成和深度思考

来源:互联网 发布:江苏网络卖淫 编辑:程序博客网 时间:2024/05/29 17:44

第6课:Spark Streaming源码解读之Job动态生成和深度思考
本讲内容:
a. Spark Streaming Job生成深度思考
b. Spark Streaming Job生成源码解析

一个关于流的观点:在做大数据应用程序时,如果不是流式作业的话,一般我们会通过调度器/定时器来定时启动任务(比如一个小时一次,一天一次等)来跑这个应用,一般不会手动提交这样的任务而是JavaEE驱动,这种定时任务放大了来看也类似于流式处理,只不过他的Batch Interval更大。所以可以这么说,所有的数据处理都会变成流处理,只是表现形式不一样,更进一步,一切处理终将被流处理所统一。

SparkStreaming里的Job就相当于Java中线程要处理的Runnable接口,它是对业务逻辑的封装,跟Spark Core中的Job不是一个概念,Spark Core中的Job就是一个运行的作业,当我们谈Spark Core中的Job时,其实我们谈的是具体做的某一件事情,而在SparkStreaming的Job是对Spark Core的job进行了封装,是更高层的抽象。

JobGenerator: 基于DStreams的依赖关系即DstreamGraph产生jobs,会产生多个Job,因为会基于batchDuration不断地产生作业。
DStream可以分为3种类型: 1. Input型DStream, 可以基于不同的数据来源构建InputDStream, 数据来源例如socket,kafka,flume等 2. Output型DStream,输出型DStream其实是逻辑级别的action,之所以是逻辑级别的action,这是框架级别提出的action,底层还是被翻译为物理级别的action,所谓物理级别的action就是RDD的action 3. Transformation型DStream, transformation就是状态转换,所谓状态转换就是处理业务逻辑的过程。
DStream有两种产生方式:1直接基于数据源产生 2对其他的DStream进行Transformation产生新的Dstream。

SparkStreaming除了定时生成的Job,还有其他方式产生的Job,例如进行聚合操作,或基于状态的操作,这些不是直接基于batchDuration的,会对很多batchDuration进行处理。所以说JobGenerator其实是最基础最核心的。为了有窗口之类的操作,JobGenerator也会‘This class generates jobs from DStreams as well as drives checkpointing and cleaning up DStream metadata.’。

Spark Streaming应用是以时间为触发器的,而Apache Storm是以事件(基于一个又一个的record)为触发器的。

Streaming程序的入口会指定batchDuration。
JavaStreamingContext jsc = new JavaStreamingContext(conf, Durations.seconds(10));
Second(10) 就是Batch Duration,该代码表示,每10秒中JobGenerator都会产生一个Job,这个Job是逻辑级别的,所谓逻辑级别就是说有这个Job以及这个Job具体该怎么做,但是还没有做。谁去做?是由底层物理级别的RDD的action触发的。Spark Streaming是基于DStream的依赖关系构建的Job,导致了构建的这种Job是逻辑级别的,那底层的物理级别是基于RDD的依赖关系,DStream的action级别的操作也是逻辑级别的,Spark Streaming会根据你的action的操作,给你产生一个逻辑级别的Job,但是它不会运行,就相当于线程具体运行的时候处理代码需要的那个runnable接口。也正因为他是逻辑级别的,现在还没有生成物理级别的Job,才有机会对其进行各种调度和优化。逻辑级别的DStream依赖关系翻译成物理级别的RDD的依赖关系,最后一个操作肯定是RDD的action级别的操作。既要完成这种翻译又要RDD的action级别的操作不立即触发作业并执行,怎么实现?这个时候我们把翻译的东西(RDDs)作为Runnable接口封装,相当于把RDD依赖封装到方法中,由于在方法中而方法没被调用,所以RDD的最后一个action也不会被立即执行,而是放到队列中进行管理。
既要完成这种翻译又要对它进行管理,所以我们把DStream的依赖关系翻译为RDD的依赖关系,最后一个DStream的action级别的操作翻译成RDD的action级别的操作,翻译后的内容是一块内容,放在一个函数体里边,函数体会进行函数的定义,由于是定义还没有执行,所以里边RDD的action不会立即触发作业,当我们的jobScheduler看见要调度这个job的时候,就转过来在线程池中拿出一条线程来执行刚才封装的这个方法。
事实上,如果在翻译时就直接ACTION触发Job的话,就没有队列了,也没有元数据等等等等,整个Spark Streaming的Job提交就不受管理了。

接下来从源码的角度看一下。
SparkStreaming 作业动态生成三大核心:
a. JobGenerator: 负责Job生成
b. JobSheduler:负责Job调度
c. ReceiverTracker:获取元数据/记录数据的来源
无论是生产还是调度都需要元数据。其中,JobGenerator和ReceiverTracker是JobScheduler的成员。

JobScheduler.start():

    receiverTracker = new ReceiverTracker(ssc)    receiverTracker.start()    jobGenerator.start()

jobGenerator.start():任何streaming程序启动都会调动它。checkpoint后续讲。

  /** Start generation of jobs */  def start(): Unit = synchronized {    if (eventLoop != null) return // generator has already been started    // Call checkpointWriter here to initialize it before eventLoop uses it to avoid a deadlock.    // See SPARK-10125    checkpointWriter    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)      override protected def onError(e: Throwable): Unit = {        jobScheduler.reportError("Error in job generator", e)      }    }    eventLoop.start()    if (ssc.isCheckpointPresent) {      restart()    } else {      startFirstTime()    }  }

因为会不断地循环生成Job,所以需要EventLoop,这里复写了onReceive,用到了匿名内部类。

EventLoop.scala:

private val eventThread = new Thread(name) {    setDaemon(true)    override def run(): Unit = {      try {        while (!stopped.get) {          val event = eventQueue.take()          try {            onReceive(event)          } catch {            case NonFatal(e) => {              try {                onError(e)              } catch {                case NonFatal(e) => logError("Unexpected error in " + name, e)              }            }          }        }      } catch {        case ie: InterruptedException => // exit even if eventQueue is not empty        case NonFatal(e) => logError("Unexpected error in " + name, e)      }    }  }

EventLoop.start():

  def start(): Unit = {    if (stopped.get) {      throw new IllegalStateException(name + " has already been stopped")    }    // Call onStart before starting the event thread to make sure it happens before onReceive    onStart()    eventThread.start()  }

EventLoop其内部有个后台线程,在启动是会不断地循环从eventQueue中取出event并调用onReceive(event)。
EventLoop.start()调用了eventThread.start(),即线程的start,导致它不断循环队列从eventQueue中取出event并调用onReceive(event)。
而onReceive(event)是个抽象方法,需要注意的是,在其内部不要阻塞,如下注释:

  /**   * Invoked in the event thread when polling events from the event queue.   *   * Note: Should avoid calling blocking actions in `onReceive`, or the event thread will be blocked   * and cannot process events in time. If you want to call some blocking actions, run them in   * another thread.   */  protected def onReceive(event: E): Unit

一个原则: 消息循环器一般都不应该处理耗时的业务逻辑,而是路由给其他的线程去处理。
onReceive(event)的实现,是在JobGenerator.scala中的:

    eventLoop = new EventLoop[JobGeneratorEvent]("JobGenerator") {      override protected def onReceive(event: JobGeneratorEvent): Unit = processEvent(event)      override protected def onError(e: Throwable): Unit = {        jobScheduler.reportError("Error in job generator", e)      }

其路由到了:

  /** Processes all events */  private def processEvent(event: JobGeneratorEvent) {    logDebug("Got event " + event)    event match {      case GenerateJobs(time) => generateJobs(time)      case ClearMetadata(time) => clearMetadata(time)      case DoCheckpoint(time, clearCheckpointDataLater) =>        doCheckpoint(time, clearCheckpointDataLater)      case ClearCheckpointData(time) => clearCheckpointData(time)    }  }

我们来看下generateJobs(time)方法。

  /** Generate jobs and perform checkpoint for the given `time`.  */  private def generateJobs(time: Time) {    // Set the SparkEnv in this thread, so that job generation code can access the environment    // Example: BlockRDDs are created in this thread, and it needs to access BlockManager    // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.    SparkEnv.set(ssc.env)    Try {      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch      graph.generateJobs(time) // generate jobs using allocated block    } match {      case Success(jobs) =>        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))      case Failure(e) =>        jobScheduler.reportError("Error generating jobs for time " + time, e)    }    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))  }

因为要生成作业,所以需要固定的数据(这里先不细看)。根据固定的时间,生成作业。
graph.generateJobs(time)是关键点,outputStreams是整个DStreams的最后一个,根据最后一个DStreams操作,基于时间产生Job(非常类似于RDD产生job的方式!)。

  def generateJobs(time: Time): Seq[Job] = {    logDebug("Generating jobs for time " + time)    val jobs = this.synchronized {      outputStreams.flatMap { outputStream =>        val jobOption = outputStream.generateJob(time)        jobOption.foreach(_.setCallSite(outputStream.creationSite))        jobOption      }    }    logDebug("Generated " + jobs.length + " jobs for time " + time)    jobs  }

outputStream.generateJob(time),关键点!

  /**   * Generate a SparkStreaming job for the given time. This is an internal method that   * should not be called directly. This default implementation creates a job   * that materializes the corresponding RDD. Subclasses of DStream may override this   * to generate their own jobs.   */  private[streaming] def generateJob(time: Time): Option[Job] = {    getOrCompute(time) match {      case Some(rdd) => {        val jobFunc = () => {          val emptyFunc = { (iterator: Iterator[T]) => {} }          context.sparkContext.runJob(rdd, emptyFunc)        }        Some(new Job(time, jobFunc))      }      case None => None    }  }

关键点jobFunc,为了把生成的Job放到队列里,用函数封装了Job本身,因为是个函数,肯定不执行。emptyFunc什么都没做;
context.sparkContext.runJob(rdd, emptyFunc)下面的步骤就是rdd的依赖关系,会触发真正的调度, 只不过在这里封装在了jobFunc里肯定不会执行;

  /**   * Run a job on all partitions in an RDD and return the results in an array.   */  def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {    runJob(rdd, func, 0 until rdd.partitions.length)  }

Dstream是逻辑界别的,RDD是物理级别的:’This default implementation creates a job that materializes the corresponding RDD.’
Some(new Job(time, jobFunc))产生Job.Job属于spark.streaming.scheduler层面,‘It may contain multiple Spark jobs’:

/** * Class representing a Spark computation. It may contain multiple Spark jobs. */private[streaming]class Job(val time: Time, func: () => _) {  private var _id: String = _  private var _outputOpId: Int = _  private var isSet = false  private var _result: Try[_] = null  private var _callSite: CallSite = null  private var _startTime: Option[Long] = None  private var _endTime: Option[Long] = None

来看下Dstream.getOrCompute(time),它基于时间生成rdd,生成的是最后一个rdd,以time为key,以rdd为value:

/**   * Get the RDD corresponding to the given time; either retrieve it from cache   * or compute-and-cache it.   */  private[streaming] final def getOrCompute(time: Time): Option[RDD[T]] = {    // If RDD was already generated, then retrieve it from HashMap,    // or else compute the RDD    generatedRDDs.get(time).orElse {      // Compute the RDD if time is valid (e.g. correct time in a sliding window)      // of RDD generation, else generate nothing.      if (isTimeValid(time)) {        val rddOption = createRDDWithLocalProperties(time, displayInnerRDDOps = false) {          // Disable checks for existing output directories in jobs launched by the streaming          // scheduler, since we may need to write output to an existing directory during checkpoint          // recovery; see SPARK-4835 for more details. We need to have this call here because          // compute() might cause Spark jobs to be launched.          PairRDDFunctions.disableOutputSpecValidation.withValue(true) {            compute(time)          }        }        rddOption.foreach { case newRDD =>          // Register the generated RDD for caching and checkpointing          if (storageLevel != StorageLevel.NONE) {            newRDD.persist(storageLevel)            logDebug(s"Persisting RDD ${newRDD.id} for time $time to $storageLevel")          }          if (checkpointDuration != null && (time - zeroTime).isMultipleOf(checkpointDuration)) {            newRDD.checkpoint()            logInfo(s"Marking RDD ${newRDD.id} for time $time for checkpointing")          }          generatedRDDs.put(time, newRDD)        }        rddOption      } else {        None      }    }  }

其中generatedRDDs是个数据结构:

  // RDDs generated, marked as private[streaming] so that testsuites can access it  @transient  private[streaming] var generatedRDDs = new HashMap[Time, RDD[T]] ()

总结一下:
JobGenerator.generateJobs(time: Time)=>graph.generateJobs(time)=>outputStream.generateJob(time)

回到JobGenerator.start()方法中,他会根据checkpoint判断是否第一次启动。

    if (ssc.isCheckpointPresent) {      restart()    } else {      startFirstTime()    }

看下JobGenerator.startFirstTime():

  /** Starts the generator for the first time */  private def startFirstTime() {    val startTime = new Time(timer.getStartTime())    graph.start(startTime - graph.batchDuration)    timer.start(startTime.milliseconds)    logInfo("Started JobGenerator at " + startTime)  }

首次启动的时候,graph.start告诉Dstreamgraph第一次batch启动的时间,重要的操作在timer.start。
timer在JobGenerator中,只管时间,它用的的clock只是个时钟。
注意这里的timer!它使用到了匿名函数。

  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")
  val clock = {    val clockClass = ssc.sc.conf.get(      "spark.streaming.clock", "org.apache.spark.util.SystemClock")    try {      Utils.classForName(clockClass).newInstance().asInstanceOf[Clock]    } catch {      case e: ClassNotFoundException if clockClass.startsWith("org.apache.spark.streaming") =>        val newClockClass = clockClass.replace("org.apache.spark.streaming", "org.apache.spark")        Utils.classForName(newClockClass).newInstance().asInstanceOf[Clock]    }  }

RecurringTimer中又启动了一个后台线程,该线程不断loop,注意这里的callback,要想知道callback是哪里传过来的,要看他是怎么实例化的,其实是在JobGenerator中用匿名函数实力化的,很精简:

private[streaming]class RecurringTimer(clock: Clock, period: Long, callback: (Long) => Unit, name: String)  extends Logging {  private val thread = new Thread("RecurringTimer - " + name) {    setDaemon(true)    override def run() { loop }  }

在JobGenerator中用匿名函数实力化RecurringTimer:

  private val timer = new RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,    longTime => eventLoop.post(GenerateJobs(new Time(longTime))), "JobGenerator")

该函数根据时间,不断地发送(GenerateJobs(new Time(longTime))), “JobGenerator”)消息,而在消息处理时,确实有处理消息GenerateJobs(time)。

  /** Processes all events */  private def processEvent(event: JobGeneratorEvent) {    logDebug("Got event " + event)    event match {      case GenerateJobs(time) => generateJobs(time)      case ClearMetadata(time) => clearMetadata(time)      case DoCheckpoint(time, clearCheckpointDataLater) =>        doCheckpoint(time, clearCheckpointDataLater)      case ClearCheckpointData(time) => clearCheckpointData(time)    }  }

至此,一切贯通了!基于batch的时间生成作业。
再来看下generateJobs:

  /** Generate jobs and perform checkpoint for the given `time`.  */  private def generateJobs(time: Time) {    // Set the SparkEnv in this thread, so that job generation code can access the environment    // Example: BlockRDDs are created in this thread, and it needs to access BlockManager    // Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.    SparkEnv.set(ssc.env)    Try {      jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch      graph.generateJobs(time) // generate jobs using allocated block    } match {      case Success(jobs) =>        val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)        jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))      case Failure(e) =>        jobScheduler.reportError("Error generating jobs for time " + time, e)    }    eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))  }

总结起来,Job生成的步骤:
JobGenerator. generateJob()这个方法生成Job的步骤:
第一步:获取当前时间段内的数据;
第二步:生成Job(这里的Job只是业务的封装,RDD之间的依赖关系构成Job);
第三步:获取生成Job对应的StreamId的信息。
第四步:封装成JobSet交给JobScheduler去调度;
第五步:发消息进行checkpoint操作。
注意:
receiverTracker接受的是数据的元数据,不是数据本身,然后对它allocateBlocksToBatch(time);
graph.generateJobs(time)获取的是RDD的dag的依赖关系;由后向前遍历,遍历结束的时候正好生成了RDD的dag的依赖;
Job是代码的业务逻辑,类似于RDD之间的依赖也会封装成一个函数,其实就是最后一个函数,从后向前推!
Job构建成功的话,获取属于他的元数据信息;
Job构建成功的话,基于时间,要处理batchDuration的数据,和封装的业务逻辑,生成一个JobSet,包含了数据和业务逻辑。

JobSet:

/** Class representing a set of Jobs  * belong to the same batch.  */private[streaming]case class JobSet(    time: Time,    jobs: Seq[Job],    streamIdToInputInfo: Map[Int, StreamInputInfo] = Map.empty) {

jobScheduler.submitJobSet():

  def submitJobSet(jobSet: JobSet) {    if (jobSet.jobs.isEmpty) {      logInfo("No jobs added for time " + jobSet.time)    } else {      listenerBus.post(StreamingListenerBatchSubmitted(jobSet.toBatchInfo))      jobSets.put(jobSet.time, jobSet)      jobSet.jobs.foreach(job => jobExecutor.execute(new JobHandler(job)))      logInfo("Added jobs for time " + jobSet.time)    }  }

JobHandler(job)就是一个runnable接口,这再次说明,job就是我们的业务逻辑,它代表了rdd之间的依赖关系,是Sparkstreaming框架更改层次地抽象了的对rdd的操作,由于是抽象的不是物理级别,不会立即执行。

本次分享来自于王家林老师的课程‘源码版本定制发行班’,在此向王家林老师表示感谢!
欢迎大家交流技术知识!一起学习,共同进步!

0 0