第13课：Spark Streaming源码解读之Driver容错安全性

来源：互联网发布：tps跨境电商是网络传销编辑：程序博客网时间：2024/06/06 23:51

本期内容：
1.ReceivedBlockTracker容错安全性
2.DStream和JobGenerator容错安全性

因为Driver指挥了整个spark程序的运行，所以driver的安全性至关重要。我们这里主要从sparkstreaming的角度谈driver的安全性，例如通过wal保存处理的数据的元数据，在驱动层面/调度逻辑的安全容错方面才用checkpoint。
ReceivedBlockTracker负责管理sparkstreaming程序运行的数据的元数据，是数据层面；DStream和JobGenerator是框架调度的核心，是业务逻辑层面和作业生成层面。这三者都需要保存状态，都需要容错。
容错会保存历史状态，在出错后基于保存的状态进行回复。
一。ReceivedBlockTracker容错安全性.
ReceivedBlockTracker，在注释中说的很明白:

/** * Class that keep track of all the received blocks, and allocate them to batches * when required. All actions taken by this class can be saved to a write ahead log * (if a checkpoint directory has been provided), so that the state of the tracker * (received blocks and block-to-batch allocations) can be recovered after driver failure. * * Note that when any instance of this class is created with a checkpoint directory, * it will try reading events from logs in the directory. */private[streaming] class ReceivedBlockTracker(    conf: SparkConf,    hadoopConf: Configuration,    streamIds: Seq[Int],    clock: Clock,    recoverFromWriteAheadLog: Boolean,    checkpointDirOption: Option[String])  extends Logging {

ReceivedBlockTracker收到数据后是如何处理的呢？这里收到的数据是元数据receivedBlockInfo，是个简单的case class, 保存了（streamId，numRecords: metadata和ReceivedBlockStoreResult）。
ReceivedBlockTracker收到receiverSupervisorImpl汇报的receiver接受的数据的元数据后，首先通过writeToLog保存元数据，这就是所谓的冷备份；然后才会写入内存中的数据结构streamIdToUnallocatedBlockQueues（是个hashmap[Int, ReceivedBlockQueue],其中ReceivedBlockQueue就是Queue[ReceivedBlockInfo]）,以供jobGenerator去使用。

  /** Add received block. This event will get written to the write ahead log (if enabled). */  def addBlock(receivedBlockInfo: ReceivedBlockInfo): Boolean = {    try {      val writeResult = writeToLog(BlockAdditionEvent(receivedBlockInfo))      if (writeResult) {        synchronized {          getReceivedBlockQueue(receivedBlockInfo.streamId) += receivedBlockInfo        }        logDebug(s"Stream ${receivedBlockInfo.streamId} received " +          s"block ${receivedBlockInfo.blockStoreResult.blockId}")      } else {        logDebug(s"Failed to acknowledge stream ${receivedBlockInfo.streamId} receiving " +          s"block ${receivedBlockInfo.blockStoreResult.blockId} in the Write Ahead Log.")      }      writeResult    } catch {      case NonFatal(e) =>        logError(s"Error adding block $receivedBlockInfo", e)        false    }  }

以上保存的是接受的数据的元数据，还没有进行分配。记下里看看是怎么分配的。
在分配给Job之前，同样首先writeToLog来保存元数据，这样后续失败时可以从Log恢复。
分配后的数据以时间窗口为Key，保存在内存数据结构timeToAllocatedBlocks 这个HashMap[Time, AllocatedBlocks]，其中AllocatedBlocks是个case class,保存了streamIdToAllocatedBlocks这个Map[Int, Seq[ReceivedBlockInfo]。
由于内存数据结构timeToAllocatedBlocks可以保存很多时间窗口的数据，这就为state和wiondow操作提供了可能。

/**   * Allocate all unallocated blocks to the given batch.   * This event will get written to the write ahead log (if enabled).   */  def allocateBlocksToBatch(batchTime: Time): Unit = synchronized {    if (lastAllocatedBatchTime == null || batchTime > lastAllocatedBatchTime) {      //可以同时接受来自不同的数据源的数据      //获得所有数据源接受的数据      val streamIdToBlocks = streamIds.map { streamId =>          (streamId, getReceivedBlockQueue(streamId).dequeueAll(x => true))      }.toMap      //分配接受的数据      val allocatedBlocks = AllocatedBlocks(streamIdToBlocks)      //首先写Log      if (writeToLog(BatchAllocationEvent(batchTime, allocatedBlocks))) {        timeToAllocatedBlocks.put(batchTime, allocatedBlocks)        lastAllocatedBatchTime = batchTime      } else {        logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")      }    } else {      // This situation occurs when:      // 1. WAL is ended with BatchAllocationEvent, but without BatchCleanupEvent,      // possibly processed batch job or half-processed batch job need to be processed again,      // so the batchTime will be equal to lastAllocatedBatchTime.      // 2. Slow checkpointing makes recovered batch time older than WAL recovered      // lastAllocatedBatchTime.      // This situation will only occurs in recovery time.      logInfo(s"Possibly processed batch $batchTime need to be processed again in WAL recovery")    }  }

需要说明的是，allocateBlocksToBatch(batchTime: Time)这里的输入参数batchTime是由jobGenerator在generateJobs(time: Time) 方法中通过语句jobScheduler.receiverTracker.allocateBlocksToBatch(time)传过来的，该time最初是由定时器RecurringTimer(clock, ssc.graph.batchDuration.milliseconds,
longTime => eventLoop.post(GenerateJobs(new Time(longTime))), “JobGenerator”)传过来的。
定时器在sparkStreaming中非常重要。

上面讲到了接受到数据后，ReceivedBlockTracker会把元数据wal, 在销毁数据的时候，ReceivedBlockTracker同样会把元数据wal:

    * Clean up block information of old batches. If waitForCompletion is true, this method   * returns only after the files are cleaned up.   */  def cleanupOldBatches(cleanupThreshTime: Time, waitForCompletion: Boolean): Unit = synchronized {    require(cleanupThreshTime.milliseconds < clock.getTimeMillis())    val timesToCleanup = timeToAllocatedBlocks.keys.filter { _ < cleanupThreshTime }.toSeq    logInfo("Deleting batches " + timesToCleanup)    if (writeToLog(BatchCleanupEvent(timesToCleanup))) {      timeToAllocatedBlocks --= timesToCleanup      writeAheadLogOption.foreach(_.clean(cleanupThreshTime.milliseconds, waitForCompletion))    } else {      logWarning("Failed to acknowledge batch clean up in the Write Ahead Log.")    }  }

用来写Log的方法writeToLog的代码如下：

  /** Write an update to the tracker to the write ahead log */  private def writeToLog(record: ReceivedBlockTrackerLogEvent): Boolean = {    if (isWriteAheadLogEnabled) {      logTrace(s"Writing record: $record")      try {        writeAheadLogOption.get.write(ByteBuffer.wrap(Utils.serialize(record)),          clock.getTimeMillis())        true      } catch {        case NonFatal(e) =>          logWarning(s"Exception thrown while writing record: $record to the WriteAheadLog.", e)          false      }    } else {      true    }  }

二。DStream和JobScheduler容错安全性
以上是数据层面，数据层面默认对wal的写是开启的。接下来我们看业务逻辑层面和作业生成层面，他们用的是checkpoint,本质上checkpoint和wal是一样的，只是机制和读写数据的方式不同。
checkpoint一般会写在hdfs,且时间间隔是batchduration,(数据层面是只要有数据进来或销毁就进行wal)，Job生成和完成都对当前状态进行checkpoint。

JobGenerator在generateJobs方法中产生job后，会发送消息DoCheckpoint(time, clearCheckpointDataLater = false):
/* Generate jobs and perform checkpoint for the given time. /
private def generateJobs(time: Time) {
// Set the SparkEnv in this thread, so that job generation code can access the environment
// Example: BlockRDDs are created in this thread, and it needs to access BlockManager
// Update: This is probably redundant after threadlocal stuff in SparkEnv has been removed.
SparkEnv.set(ssc.env)
Try {
jobScheduler.receiverTracker.allocateBlocksToBatch(time) // allocate received blocks to batch
graph.generateJobs(time) // generate jobs using allocated block
} match {
case Success(jobs) =>
val streamIdToInputInfos = jobScheduler.inputInfoTracker.getInfo(time)
jobScheduler.submitJobSet(JobSet(time, jobs, streamIdToInputInfos))
case Failure(e) =>
jobScheduler.reportError(“Error generating jobs for time ” + time, e)
}
eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = false))
}

JobGenerator在清除DStream metadata时，也会发送消息DoCheckpoint(time, clearCheckpointDataLater = true):

 /** Clear DStream metadata for the given `time`. */  private def clearMetadata(time: Time) {    ssc.graph.clearMetadata(time)    // If checkpointing is enabled, then checkpoint,    // else mark batch to be fully processed    if (shouldCheckpoint) {      eventLoop.post(DoCheckpoint(time, clearCheckpointDataLater = true))    } else {      // If checkpointing is not enabled, then delete metadata information about      // received blocks (block data not saved in any case). Otherwise, wait for      // checkpointing of this batch to complete.      val maxRememberDuration = graph.getMaxInputStreamRememberDuration()      jobScheduler.receiverTracker.cleanupOldBlocksAndBatches(time - maxRememberDuration)      jobScheduler.inputInfoTracker.cleanup(time - maxRememberDuration)      markBatchFullyProcessed(time)    }  }

JobGenerator收到DoCheckpoint消息后，会调用自己的doCheckpoint方法：

/* Processes all events /
private def processEvent(event: JobGeneratorEvent) {
logDebug(“Got event ” + event)
event match {
case GenerateJobs(time) => generateJobs(time)
case ClearMetadata(time) => clearMetadata(time)
case DoCheckpoint(time, clearCheckpointDataLater) =>
doCheckpoint(time, clearCheckpointDataLater)
case ClearCheckpointData(time) => clearCheckpointData(time)
}
}

JobGenerator(jobScheduler: JobScheduler)中的doCheckpoint方法：

/* Perform checkpoint for the give time. /
private def doCheckpoint(time: Time, clearCheckpointDataLater: Boolean) {
if (shouldCheckpoint && (time - graph.zeroTime).isMultipleOf(ssc.checkpointDuration)) {
logInfo(“Checkpointing graph for time ” + time)
ssc.graph.updateCheckpointData(time)
checkpointWriter.write(new Checkpoint(ssc, time), clearCheckpointDataLater)
}
}

可以看到，checkpoint的真正操作是在DStreamGraph中的。
DStreamGraph：
def updateCheckpointData(time: Time) {
logInfo(“Updating checkpoint data for time ” + time)
this.synchronized {
outputStreams.foreach(_.updateCheckpointData(time))
}
logInfo(“Updated checkpoint data for time ” + time)
}

def clearCheckpointData(time: Time) {
logInfo(“Clearing checkpoint data for time ” + time)
this.synchronized {
outputStreams.foreach(_.clearCheckpointData(time))
}
logInfo(“Cleared checkpoint data for time ” + time)
}

DStreamGraph路由到DStream：
/**
* Refresh the list of checkpointed RDDs that will be saved along with checkpoint of
* this stream. This is an internal method that should not be called directly. This is
* a default implementation that saves only the file names of the checkpointed RDDs to
* checkpointData. Subclasses of DStream (especially those of InputDStream) may override
* this method to save custom checkpoint data.
*/
private[streaming] def updateCheckpointData(currentTime: Time) {
logDebug(“Updating checkpoint data for time ” + currentTime)
checkpointData.update(currentTime)
dependencies.foreach(_.updateCheckpointData(currentTime))
logDebug(“Updated checkpoint data for time ” + currentTime + “: ” + checkpointData)
}

private[streaming] def clearCheckpointData(time: Time) {
logDebug(“Clearing checkpoint data”)
checkpointData.cleanup(time)
dependencies.foreach(_.clearCheckpointData(time))
logDebug(“Cleared checkpoint data”)
}

JobGenerator中的shouldCheckpoint是怎么确定的呢?是由构建StreamingContext时的传入参数，即对象Checkpoint决定的：

  // This is marked lazy so that this is initialized after checkpoint duration has been set  // in the context and the generator has been started.  private lazy val shouldCheckpoint = ssc.checkpointDuration != null && ssc.checkpointDir != null

Checkpoint类：

private[streaming]class Checkpoint(ssc: StreamingContext, val checkpointTime: Time)  extends Logging with Serializable {  val master = ssc.sc.master  val framework = ssc.sc.appName  val jars = ssc.sc.jars  val graph = ssc.graph  val checkpointDir = ssc.checkpointDir  val checkpointDuration = ssc.checkpointDuration  val pendingTimes = ssc.scheduler.getPendingTimes().toArray  val delaySeconds = MetadataCleaner.getDelaySeconds(ssc.conf)  val sparkConfPairs = ssc.conf.getAll

通过查看StreamingContext的构造方法可知，只有在构建StreamingContext的时候有路径的时候才会构建
Checkpoint对象，才能执行checkpoint操作。

  /**   * Recreate a StreamingContext from a checkpoint file.   * @param path Path to the directory that was specified as the checkpoint directory   * @param hadoopConf Optional, configuration object if necessary for reading from   *                   HDFS compatible filesystems   */  def this(path: String, hadoopConf: Configuration) =    this(null, CheckpointReader.read(path, new SparkConf(), hadoopConf).get, null)

/**
* Recreate a StreamingContext from a checkpoint file using an existing SparkContext.
* @param path Path to the directory that was specified as the checkpoint directory
* @param sparkContext Existing SparkContext
*/
def this(path: String, sparkContext: SparkContext) = {
this(
sparkContext,
CheckpointReader.read(path, sparkContext.conf, sparkContext.hadoopConfiguration).get,
null)
}

本次分享来自于王家林老师的课程‘源码版本定制发行班’，在此向王家林老师表示感谢！
欢迎大家交流技术知识！一起学习，共同进步!

0 0