Spark Streaming之运行原理

来源:互联网 发布:快手视频采集源码 编辑:程序博客网 时间:2024/05/17 23:14

一 启动流处理引擎

1.1初始化StreamingContext

首先需要初始化StreamingContext,在初始化的过程中会对DStreamGraph、JobScheduler等进行初始化,DStreamGraph类似于RDD的有向无环图,包含DStream之间相互依赖的有向无环图;JobScheduler定时查看DStreamGraph,然后根据流入的数据生成运行作业

 

1.2 创建InputDStream

根据你采用不同的数据源,可能生成的输入数据流不一样

 

1.3 启动JobScheduler

创建完成InputDStream之后,调用StreamingContext的start方法启动应用程序,并且需要启动JobScheduler,启动JobScheduler的时候会实例化ReceiverTracker和JobGenerator.

defstart(): Unit = synchronized {
  // JobShceduler已经启动则退出
 
if (eventLoop!= null) return

 
logDebug("Starting JobScheduler")
  eventLoop = new EventLoop[JobSchedulerEvent]("JobScheduler") {
    overrideprotected def onReceive(event:JobSchedulerEvent): Unit = processEvent(event)

    overrideprotected def onError(e:Throwable): Unit = reportError("Error in job scheduler", e)
  }
  eventLoop.start()

  for {
    // 获取InputDStream
   
inputDStream<- ssc.graph.getInputStreams
   
rateController<- inputDStream.rateController
 
} ssc.addStreamingListener(rateController)

  listenerBus.start()
  // 构建ReceiverTrackerInputInfoTracker
 
receiverTracker
= new ReceiverTracker(ssc)
  inputInfoTracker= new InputInfoTracker(ssc)

  val executorAllocClient:ExecutorAllocationClient= ssc.sparkContext.schedulerBackendmatch {
    case b:ExecutorAllocationClient => b.asInstanceOf[ExecutorAllocationClient]
    case _ => null
 
}

  executorAllocationManager= ExecutorAllocationManager.createIfEnabled(
    executorAllocClient,
    receiverTracker,
    ssc.conf,
    ssc.graph.batchDuration.milliseconds,
    clock)
  executorAllocationManager.foreach(ssc.addStreamingListener)
  // 启动ReceiverTracker
 
receiverTracker
.start()
  // 启动JobGenerator
 
jobGenerator
.start()
  executorAllocationManager.foreach(_.start())
  logInfo("Started JobScheduler")
}

 

1.4 启动JobGenerator

启动JobGenerator需要判断是否第一次运行,如果不是第一次运行需要进行上次检查点的恢复,如果是第一次运行则调用startFirstTime方法,在该方法中初始化了定时器的开启时间,并启动了DStreamGraph和定时器timer

private def startFirstTime() {  val startTime = new Time(timer.getStartTime())  graph.start(startTime - graph.batchDurationtimer.start(startTime.millisecondslogInfo("Started JobGenerator at " + startTime)}

 

timer的getStartTime方法会计算出来下一个周期到期时间,计算公式: 当前时间 / 间隔时间

 

二 接收机存储流数据

2.1 启动ReceiverTracker

启动ReceiverTracker的时候,如果输入数据流不为空,则调用launchReceivers方法,然后他就会向ReceiverTrackerEndpoint发送StartAllReceivers方法,启动所有Receivers

private def launchReceivers(): Unit = {  val receivers = receiverInputStreams.map { nis =>    val rcvr = nis.getReceiver()    rcvr.setReceiverId(nis.id)    rcvr  runDummySparkJob()  // 发送启动所有receiver的消息  endpoint.send(StartAllReceivers(receivers))}

 

case StartAllReceivers(receivers) =>  // 根据receiver分发策略,获取与之对应的receiverexecutor调度信息  val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors// 遍历receivers,为根据receiver获取候选的executor,更新被调度receiver的位置信息,即executor信息  // 开启receiver  for (receiver <- receivers) {    val executors = scheduledLocations(receiver.streamId)    updateReceiverScheduledExecutors(receiver.streamId, executors)    // 保存流数据接收器Receiver首选位置    receiverPreferredLocations(receiver.streamId) = receiver.preferredLocation    // 启动每一个Receiver    startReceiver(receiver, executors)  }

最后创建ReceiverSupervisor,并启动,在启动的时候,由它启动Receiver

2.2Receiver启动并接收数据

Receiver启动会调用各个具体子类的onstart方法,这里面就会接收数据,以kafka为例,则会根据提供配置创建连接,获取消息流,构造一个线程池,为每一个topic分区分配一个线程处理数据

def onStart() {  // 获取kafka连接参数  val props = new Properties()  kafkaParams.foreach(param => props.put(param._1, param._2))  val zkConnect = kafkaParams("zookeeper.connect"// Create the connection to the cluster  logInfo("Connecting to Zookeeper: " + zkConnect// 构造消费者配置文件  val consumerConfig = new ConsumerConfig(props// 根据消费者配置文件创建消费者连接  consumerConnector = Consumer.create(consumerConfiglogInfo("Connected to " + zkConnect// 构造keyDecodervalueDecoder  val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])    .newInstance(consumerConfig.props)    .asInstanceOf[Decoder[K]]  val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])    .newInstance(consumerConfig.props)    .asInstanceOf[Decoder[V]]  // Create threads for each topic/message Stream we are listening  // 创建消息流  val topicMessageStreams = consumerConnector.createMessageStreams(    topics, keyDecoder, valueDecoder// 构造线程池  val executorPool =    ThreadUtils.newDaemonFixedThreadPool(topics.values.sum, "KafkaMessageHandler"try {    // 开始处理每一个分区的数据    topicMessageStreams.values.foreach { streams =>      streams.foreach { stream => executorPool.submit(new MessageHandler(stream)) }    }  } finally {    executorPool.shutdown() // Just causes threads to terminate after work is done  }}

 

2.3 启动BlockGenerator生成block

在ReceiverSupervisorImpl的onstart方法中调用BlockGenerator的start启动BlockGenerator

override protected def onStart() {  registeredBlockGenerators.asScala.foreach { _.start() }}

启动时候会先更新自身状态为Active,然后启动2个线程:

blockIntervalTimer:定义开始一个新batch,然后准备把之前的batch作为一个block

blockPushingThread:把数据块 push到block manager

def start(): Unit = synchronized {  if (state == Initialized) {    // 更改状态    state = Active    // 开启一个定时器,定期的把缓存中的数据封装成数据块    blockIntervalTimer.start()    // 开始一个线程,不断将封装好的数据封装成数据块    blockPushingThread.start()    logInfo("Started BlockGenerator")  } else {    throw new SparkException(      s"Cannot start BlockGenerator as its not in the Initialized state [state = $state]")  }}

 

private def updateCurrentBuffer(time: Long): Unit = {  try {    var newBlock: Block = null    synchronized {      // 判断当前放数据的buffer是否为空,如果不为空      if (currentBuffer.nonEmpty) {        // 则赋给一个新的block buffer,然后再把置为currentBuffer        val newBlockBuffer = currentBuffer        currentBuffer = new ArrayBuffer[Any]        // 构建一个blockId        val blockId = StreamBlockId(receiverId, time - blockIntervalMs)        listener.onGenerateBlock(blockId)        // 构建block        newBlock = new Block(blockId, newBlockBuffer)      }    }    // 新的block不为空,则放入push队列,如果该队列满了则由其他线程pushblock manager    if (newBlock != null) {      blocksForPushing.put(newBlock// put is blocking when queue is full    }  } catch {    case ie: InterruptedException =>      logInfo("Block updating timer thread was interrupted")    case e: Exception =>      reportError("Error in block updating thread", e)  }}

 

2.4 数据存储

Receiver会进行数据的存储,如果数据量很少,则攒多条数据成数据块在进行块存储;如果数据量很大,则直接进行存储,对于需要攒多条数据成数据块的操作在Receiver.store方法里面调用ReceiverSupervisor的pushSingle方法处理。在pushSingle中把数据先保存在内存中,这些内存数据被BlockGenerator的定时器线程blockIntervalTimer加入队列并调用ReceiverSupervisor的pushArrayBuffer方法进行处理。

他们其实都是调用的是pushAndReportBlock,该方法会调用ReceiveBlockHandler的storeBlock方法保存数据并根据配置进行预写日志;另外存储数据块并向driver报告:

def pushAndReportBlock(    receivedBlock: ReceivedBlock,    metadataOption: Option[Any],    blockIdOption: Option[StreamBlockId]  ) {  // 获取一个blockId  val blockId = blockIdOption.getOrElse(nextBlockIdval time = System.currentTimeMillis  // 存储block  val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlocklogDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms"// 结果数量  val numRecords = blockStoreResult.numRecords  // 构建ReceivedBlockInfo  val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult// ReceiverTrackerEndpoint发送AddBlock消息  trackerEndpoint.askWithRetry[Boolean](AddBlock(blockInfo))  logDebug(s"Reported block $blockId")}

 

 

 

三 数据处理

我们知道DStream在进行action操作时,会触发job。我们以saveAsTextFiles方法为例:

defsaveAsTextFiles(prefix:String, suffix:String = ""): Unit =ssc.withScope {
  // 封装了一个保存函数,内部其实调用的RDDsaveAsTextFile
 
val saveFunc= (rdd: RDD[T],time: Time) => {
    val file = rddToFileName(prefix,suffix, time)
    rdd.saveAsTextFile(file)
  }
  // 调用foreachRDD方法遍历RDD
 
this.foreachRDD(saveFunc,displayInnerRDDOps = false)
}

 

foreachRDD:它会向DStreamGraph注册,根据返回的当前的DStream然后创建ForEachDStream

private defforeachRDD(
    foreachFunc: (RDD[T],Time) => Unit,
    displayInnerRDDOps: Boolean): Unit = {
  // 它会向DStreamGraph注册,根据返回的当前的DStream然后创建ForEachDStream
 
new ForEachDStream(this,
    context.sparkContext.clean(foreachFunc,false), displayInnerRDDOps).register()
}

 

register: 向DStreamGraph注册,即向DStreamGraph添加输出流

private[streaming]def register():DStream[T] = {
  // DStreamGraph添加输出流
 
ssc.graph.addOutputStream(this)
  this
}

 

JobGenerator初始化的时候会构造一个timer定时器:

private valtimer = new RecurringTimer(clock,ssc.graph.batchDuration.milliseconds,
  longTime => eventLoop.post(GenerateJobs(newTime(longTime))),"JobGenerator")

 

它会启动一个后台线程,不断去调用triggerActionForNextInterval方法,该方法就会不断调用processsEvent方法,并且传递GenerateJobs事件

 

private defprocessEvent(event:JobGeneratorEvent) {
  logDebug("Got event "+ event)
  event match {
    case GenerateJobs(time) =>generateJobs(time)
    case ClearMetadata(time) =>clearMetadata(time)
    case DoCheckpoint(time,clearCheckpointDataLater) =>
      doCheckpoint(time,clearCheckpointDataLater)
    case ClearCheckpointData(time) =>clearCheckpointData(time)
  }
}

 

JobGenerator#     generateJobs

调用DStreamGraph的generateJobs方法产生job,然后利用JobScheduler开始提交job集合

private defgenerateJobs(time:Time) {
  // checkpoint所有那些标记为checkpointing状态的RDDs以确保他们的血缘
  //
关系会定期删除,否则血缘关系太长会造成栈溢出
 
ssc
.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS,"true")
  Try {
    // 根据时间分配blockbatch,一个batch可能你有多个block
   
jobScheduler.receiverTracker.allocateBlocksToBatch(time)// allocate received blocks to batch
    // DStream
Dgraph根据时间产生job集合,使用分配的数据块
   
graph
.generateJobs(time)// generate jobs using allocated block
 
} match {
    case Success(jobs) =>
      // 如果成功,则提交jobset
     
val streamIdToInputInfos= jobScheduler.inputInfoTracker.getInfo(time)
      jobScheduler.submitJobSet(JobSet(time,jobs, streamIdToInputInfos))
    case Failure(e) =>
      jobScheduler.reportError("Error generating jobs fortime "+ time, e)
      PythonDStream.stopStreamingContextIfPythonProcessIsDead(e)
  }
  // 进行checkpoint
 
eventLoop
.post(DoCheckpoint(time,clearCheckpointDataLater = false))
}

# DStreamGraph的generateJobs根据时间产生job集

defgenerateJobs(time:Time): Seq[Job] = {
  logDebug("Generating jobs for time "+ time)
  // 根据DStreamGraph的输出流创建job集合
 
val jobs= this.synchronized {
    outputStreams.flatMap{ outputStream =>
      // 调用DStreamgenerateJob方法产生job
     
val jobOption= outputStream.generateJob(time)
      jobOption.foreach(_.setCallSite(outputStream.creationSite))
      jobOption
   
}
  }
  logDebug("Generated "+ jobs.length+ " jobs for time " + time)
  jobs
}

 

# 然后调用DStream的generateJobs产生job

private[streaming] def generateJob(time: Time): Option[Job] = {  getOrCompute(time) match {    case Some(rdd) =>      val jobFunc = () => {        val emptyFunc = { (iterator: Iterator[T]) => {} }        context.sparkContext.runJob(rdd, emptyFunc)      }      Some(new Job(time, jobFunc))    case None => None  }}

 

# 最后提交job集合

提交job集合,遍历每一个job,创建JobHandler,然后JobHandler是一个线程类,在其run方法中会向JobScheduler发送JobStarted事件,从而开始处理job

private class JobHandler(job: Job) extends Runnable with Logging import JobScheduler._  def run() {    val oldProps = ssc.sparkContext.getLocalProperties    try {      ssc.sparkContext.setLocalProperties(SerializationUtils.clone(ssc.savedProperties.get()))      val formattedTime = UIUtils.formatBatchTime(        job.time.milliseconds, ssc.graph.batchDuration.milliseconds, showYYYYMMSS = false)      val batchUrl = s"/streaming/batch/?id=${job.time.milliseconds}"      val batchLinkText = s"[output operation ${job.outputOpId}, batch time ${formattedTime}]"      ssc.sc.setJobDescription(        s"""Streaming job from <a href="$batchUrl">$batchLinkText</a>""")      ssc.sc.setLocalProperty(BATCH_TIME_PROPERTY_KEY, job.time.milliseconds.toString)      ssc.sc.setLocalProperty(OUTPUT_OP_ID_PROPERTY_KEY, job.outputOpId.toString)      ssc.sparkContext.setLocalProperty(RDD.CHECKPOINT_ALL_MARKED_ANCESTORS, "true")      var _eventLoop = eventLoop      if (_eventLoop != null) {        _eventLoop.post(JobStarted(job, clock.getTimeMillis()))        PairRDDFunctions.disableOutputSpecValidation.withValue(true) {          job.run() //真正开始处理job        }        _eventLoop = eventLoop        if (_eventLoop != null) {          _eventLoop.post(JobCompleted(job, clock.getTimeMillis()))        }      } else {      }    } finally {      ssc.sparkContext.setLocalProperties(oldProps)    }  }}

 

private def handleJobStart(job: Job, startTime: Long) {  // 根据时间获取jobSet  val jobSet = jobSets.get(job.time// 判断是否已经开始运行  val isFirstJobOfJobSet = !jobSet.hasStarted  // 更新jobset开始时间  jobSet.handleJobStart(jobif (isFirstJobOfJobSet) {    listenerBus.post(StreamingListenerBatchStarted(jobSet.toBatchInfo))  }  job.setStartTime(startTimelistenerBus.post(StreamingListenerOutputOperationStarted(job.toOutputOperationInfo))  logInfo("Starting job " + job.id + " from job set of time " + jobSet.time)}