第12课：Spark Streaming源码解读之Executor容错安全性

来源：互联网发布：地勘报告包括哪些数据编辑：程序博客网时间：2024/06/06 19:19

本节课聚焦executor的安全容错，driver的安全容错下节课讲。
executor的安全容错主要是executor接受的数据的安全性，计算的安全容错完全可以借助于底层的rdd的安全容错。数据的安全性对spark streaming至关重要，这有2个原因：
第一个原因：spark streaming不断地持续地接受数据,不断地持续地产生JOb不断地持续地提交Job；
第二个原因：由于是基于spark core，如果能够确保数据安全可靠，即使运行时有故障也可以借助rdd的容错性自动进行恢复。

最简单的容错机制是副本，其次还有是数据源接受重放（即可以从数据源重新读取过去若干时间内的数据）。数据副本也有2中方式：通过配置storage level基于block manager做备份，这样接受到的数据存储到executor时，就可以天然地借助bm的机制做备份，这也是默认采用了的方式；通过wal来做备份。

一。基于bm做备份：
storage level默认是MEMORY_AND_DISK_SER_2,这时接受到的数据除了存储到receiver所在executor的机器的内存（和磁盘），还会有一份存储到其他的executor的内存（和磁盘）中，以socketTextStream(）为例：

  /**   * Create a input stream from TCP source hostname:port. Data is received using   * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited   * lines.   * @param hostname      Hostname to connect to for receiving data   * @param port          Port to connect to for receiving data   * @param storageLevel  Storage level to use for storing the received objects   *                      (default: StorageLevel.MEMORY_AND_DISK_SER_2)   */  def socketTextStream(      hostname: String,      port: Int,      storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2    ): ReceiverInputDStream[String] = withNamedScope("socket text stream") {    socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel)  }

创建Receiver时用到的storageLevel就是来自这里。

private[streaming]class SocketReceiver[T: ClassTag](    host: String,    port: Int,    bytesToObjects: InputStream => Iterator[T],    storageLevel: StorageLevel  ) extends Receiver[T](storageLevel) with Logging {

ReceiverSupervisorImpl中，在没有做wal时，实例化了BlockManagerBasedBlockHandler：

 private val receivedBlockHandler: ReceivedBlockHandler = {    if (WriteAheadLogUtils.enableReceiverLog(env.conf)) {      if (checkpointDirOption.isEmpty) {        throw new SparkException(          "Cannot enable receiver write-ahead log without checkpoint directory set. " +            "Please use streamingContext.checkpoint() to set the checkpoint directory. " +            "See documentation for more details.")      }      new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId,        receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get)    } else {      new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel)    }  }

BlockManagerBasedBlockHandler最终通过blockManager来存储数据：

/** * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which * stores the received blocks into a block manager with the specified storage level. */private[streaming] class BlockManagerBasedBlockHandler(    blockManager: BlockManager, storageLevel: StorageLevel)  extends ReceivedBlockHandler with Logging {  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {    var numRecords = None: Option[Long]    val putResult: Seq[(BlockId, BlockStatus)] = block match {      case ArrayBufferBlock(arrayBuffer) =>        numRecords = Some(arrayBuffer.size.toLong)        blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel,          tellMaster = true)      case IteratorBlock(iterator) =>        val countIterator = new CountingIterator(iterator)        val putResult = blockManager.putIterator(blockId, countIterator, storageLevel,          tellMaster = true)        numRecords = countIterator.count        putResult      case ByteBufferBlock(byteBuffer) =>        blockManager.putBytes(blockId, byteBuffer, storageLevel, tellMaster = true)      case o =>        throw new SparkException(          s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}")    }    if (!putResult.map { _._1 }.contains(blockId)) {      throw new SparkException(        s"Could not store $blockId to block manager with storage level $storageLevel")    }    BlockManagerBasedStoreResult(blockId, numRecords)  }  def cleanupOldBlocks(threshTime: Long) {    // this is not used as blocks inserted into the BlockManager are cleared by DStream's clearing    // of BlockRDDs.  }}

二。基于wal做备份：
ReceiverSupervisorImpl中，在采用wal时，实例化了WriteAheadLogBasedBlockHandler,
可以看到wal机制使用了checkpoint目录，生产环境下一般都是放在hdfs，此时默认采用了3份副本，安全，但比较耗时,性能可能有影响,在对实时性要求比较高的情况下,不建议采用

/** * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which * stores the received blocks in both, a write ahead log and a block manager. */private[streaming] class WriteAheadLogBasedBlockHandler(    blockManager: BlockManager,    streamId: Int,    storageLevel: StorageLevel,    conf: SparkConf,    hadoopConf: Configuration,    checkpointDir: String,    clock: Clock = new SystemClock  ) extends ReceivedBlockHandler with Logging {  private val blockStoreTimeout = conf.getInt(    "spark.streaming.receiver.blockStoreTimeout", 30).seconds  private val effectiveStorageLevel = {    if (storageLevel.deserialized) {      logWarning(s"Storage level serialization ${storageLevel.deserialized} is not supported when" +        s" write ahead log is enabled, change to serialization false")    }    //此时，由于做了wal，就没有必要再用有备份的storageLevel，因为checkpoint存放在hdfs上，默认就有了3份副本；    if (storageLevel.replication > 1) {      logWarning(s"Storage level replication ${storageLevel.replication} is unnecessary when " +        s"write ahead log is enabled, change to replication 1")    }    StorageLevel(storageLevel.useDisk, storageLevel.useMemory, storageLevel.useOffHeap, false, 1)  }  if (storageLevel != effectiveStorageLevel) {    logWarning(s"User defined storage level $storageLevel is changed to effective storage level " +      s"$effectiveStorageLevel when write ahead log is enabled")  }  //写日志  // Write ahead log manages  private val writeAheadLog = WriteAheadLogUtils.createLogForReceiver(    conf, checkpointDirToLogDir(checkpointDir, streamId), hadoopConf)  // For processing futures used in parallel block storing into block manager and write ahead log  // # threads = 2, so that both writing to BM and WAL can proceed in parallel  implicit private val executionContext = ExecutionContext.fromExecutorService(    ThreadUtils.newDaemonFixedThreadPool(2, this.getClass.getSimpleName))  //并行写BM与wal  /**   * This implementation stores the block into the block manager as well as a write ahead log.   * It does this in parallel, using Scala Futures, and returns only after the block has   * been stored in both places.   */  def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = {    var numRecords = None: Option[Long]    // Serialize the block so that it can be inserted into both    val serializedBlock = block match {      case ArrayBufferBlock(arrayBuffer) =>        numRecords = Some(arrayBuffer.size.toLong)        blockManager.dataSerialize(blockId, arrayBuffer.iterator)      case IteratorBlock(iterator) =>        val countIterator = new CountingIterator(iterator)        val serializedBlock = blockManager.dataSerialize(blockId, countIterator)        numRecords = countIterator.count        serializedBlock      case ByteBufferBlock(byteBuffer) =>        byteBuffer      case _ =>        throw new Exception(s"Could not push $blockId to block manager, unexpected block type")    }    // Store the block in block manager    val storeInBlockManagerFuture = Future {      val putResult =        blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true)      if (!putResult.map { _._1 }.contains(blockId)) {        throw new SparkException(          s"Could not store $blockId to block manager with storage level $storageLevel")      }    }    // Store the block in write ahead log    val storeInWriteAheadLogFuture = Future {      writeAheadLog.write(serializedBlock, clock.getTimeMillis())    }    // Combine the futures, wait for both to complete, and return the write ahead log record handle    val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2)    val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout)    WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle)  }  def cleanupOldBlocks(threshTime: Long) {    writeAheadLog.clean(threshTime, false)  }  def stop() {    writeAheadLog.close()    executionContext.shutdown()  }}private[streaming] object WriteAheadLogBasedBlockHandler {  def checkpointDirToLogDir(checkpointDir: String, streamId: Int): String = {    new Path(checkpointDir, new Path("receivedData", streamId.toString)).toString  }}

来看下WriteAheadLog，抽象类，wal是顺序写数据，顺序或随机读数据，没有更改或删除记录之类的，读时只需要游标或指针，所以还是很快的：

** * :: DeveloperApi :: * * This abstract class represents a write ahead log (aka journal) that is used by Spark Streaming * to save the received data (by receivers) and associated metadata to a reliable storage, so that * they can be recovered after driver failures. See the Spark documentation for more information * on how to plug in your own custom implementation of a write ahead log. */@org.apache.spark.annotation.DeveloperApipublic abstract class WriteAheadLog {  /**   * Write the record to the log and return a record handle, which contains all the information   * necessary to read back the written record. The time is used to the index the record,   * such that it can be cleaned later. Note that implementations of this abstract class must   * ensure that the written data is durable and readable (using the record handle) by the   * time this function returns.   */  abstract public WriteAheadLogRecordHandle write(ByteBuffer record, long time);  /**   * Read a written record based on the given record handle.   */  abstract public ByteBuffer read(WriteAheadLogRecordHandle handle);  /**   * Read and return an iterator of all the records that have been written but not yet cleaned up.   */  abstract public Iterator<ByteBuffer> readAll();  /**   * Clean all the records that are older than the threshold time. It can wait for   * the completion of the deletion.   */  abstract public void clean(long threshTime, boolean waitForCompletion);  /**   * Close this log and release any resources.   */  abstract public void close();}

返回的句柄是WriteAheadLogRecordHandle，它的一个具体实现是FileBasedWriteAheadLogSegment：

/** * :: DeveloperApi :: * * This abstract class represents a handle that refers to a record written in a * {@link org.apache.spark.streaming.util.WriteAheadLog WriteAheadLog}. * It must contain all the information necessary for the record to be read and returned by * an implemenation of the WriteAheadLog class. * * @see org.apache.spark.streaming.util.WriteAheadLog */@org.apache.spark.annotation.DeveloperApipublic abstract class WriteAheadLogRecordHandle implements java.io.Serializable {}

/** Class for representing a segment of data in a write ahead log file */private[streaming] case class FileBasedWriteAheadLogSegment(path: String, offset: Long, length: Int)  extends WriteAheadLogRecordHandle

来看下WriteAheadLog抽象类的具体实现FileBasedWriteAheadLog，需要说明的是，这里的注释中说了hdfs,其实使用的可以是hadoop支持的所有文件系统:

/** * This class manages write ahead log files. * *  - Writes records (bytebuffers) to periodically rotating log files. *  - Recovers the log files and the reads the recovered records upon failures. *  - Cleans up old log files. * * Uses [[org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter]] to write * and [[org.apache.spark.streaming.util.FileBasedWriteAheadLogReader]] to read. * * @param logDirectory Directory when rotating log files will be created. * @param hadoopConf Hadoop configuration for reading/writing log files. */private[streaming] class FileBasedWriteAheadLog(    conf: SparkConf,    logDirectory: String,    hadoopConf: Configuration,    rollingIntervalSecs: Int,    maxFailures: Int,    closeFileAfterWrite: Boolean  ) extends WriteAheadLog with Logging {  import FileBasedWriteAheadLog._  private val pastLogs = new ArrayBuffer[LogInfo]  private val callerNameTag = getCallerName.map(c => s" for $c").getOrElse("")  private val threadpoolName = s"WriteAheadLogManager $callerNameTag"  private val threadpool = ThreadUtils.newDaemonCachedThreadPool(threadpoolName, 20)  private val executionContext = ExecutionContext.fromExecutorService(threadpool)  override protected val logName = s"WriteAheadLogManager $callerNameTag"  private var currentLogPath: Option[String] = None  private var currentLogWriter: FileBasedWriteAheadLogWriter = null  private var currentLogWriterStartTime: Long = -1L  private var currentLogWriterStopTime: Long = -1L  initializeOrRecover()  /**   * Write a byte buffer to the log file. This method synchronously writes the data in the   * ByteBuffer to HDFS. When this method returns, the data is guaranteed to have been flushed   * to HDFS, and will be available for readers to read.   */  def write(byteBuffer: ByteBuffer, time: Long): FileBasedWriteAheadLogSegment = synchronized {    var fileSegment: FileBasedWriteAheadLogSegment = null    var failures = 0    var lastException: Exception = null    var succeeded = false    while (!succeeded && failures < maxFailures) {      try {        fileSegment = getLogWriter(time).write(byteBuffer)        if (closeFileAfterWrite) {          resetWriter()        }        succeeded = true      } catch {        case ex: Exception =>          lastException = ex          logWarning("Failed to write to write ahead log")          resetWriter()          failures += 1      }    }    if (fileSegment == null) {      logError(s"Failed to write to write ahead log after $failures failures")      throw lastException    }    fileSegment  }  def read(segment: WriteAheadLogRecordHandle): ByteBuffer = {    val fileSegment = segment.asInstanceOf[FileBasedWriteAheadLogSegment]    var reader: FileBasedWriteAheadLogRandomReader = null    var byteBuffer: ByteBuffer = null    try {      reader = new FileBasedWriteAheadLogRandomReader(fileSegment.path, hadoopConf)      byteBuffer = reader.read(fileSegment)    } finally {      reader.close()    }    byteBuffer  }  /**   * Read all the existing logs from the log directory.   *   * Note that this is typically called when the caller is initializing and wants   * to recover past state from the write ahead logs (that is, before making any writes).   * If this is called after writes have been made using this manager, then it may not return   * the latest the records. This does not deal with currently active log files, and   * hence the implementation is kept simple.   */  def readAll(): JIterator[ByteBuffer] = synchronized {    val logFilesToRead = pastLogs.map{ _.path} ++ currentLogPath    logInfo("Reading from the logs:\n" + logFilesToRead.mkString("\n"))    def readFile(file: String): Iterator[ByteBuffer] = {      logDebug(s"Creating log reader with $file")      val reader = new FileBasedWriteAheadLogReader(file, hadoopConf)      CompletionIterator[ByteBuffer, Iterator[ByteBuffer]](reader, reader.close _)    }    if (!closeFileAfterWrite) {      logFilesToRead.iterator.map(readFile).flatten.asJava    } else {      // For performance gains, it makes sense to parallelize the recovery if      // closeFileAfterWrite = true      seqToParIterator(threadpool, logFilesToRead, readFile).asJava    }  }  /**   * Delete the log files that are older than the threshold time.   *   * Its important to note that the threshold time is based on the time stamps used in the log   * files, which is usually based on the local system time. So if there is coordination necessary   * between the node calculating the threshTime (say, driver node), and the local system time   * (say, worker node), the caller has to take account of possible time skew.   *   * If waitForCompletion is set to true, this method will return only after old logs have been   * deleted. This should be set to true only for testing. Else the files will be deleted   * asynchronously.   */  def clean(threshTime: Long, waitForCompletion: Boolean): Unit = {    val oldLogFiles = synchronized {      val expiredLogs = pastLogs.filter { _.endTime < threshTime }      pastLogs --= expiredLogs      expiredLogs    }    logInfo(s"Attempting to clear ${oldLogFiles.size} old log files in $logDirectory " +      s"older than $threshTime: ${oldLogFiles.map { _.path }.mkString("\n")}")    def deleteFile(walInfo: LogInfo): Unit = {      try {        val path = new Path(walInfo.path)        val fs = HdfsUtils.getFileSystemForPath(path, hadoopConf)        fs.delete(path, true)        logDebug(s"Cleared log file $walInfo")      } catch {        case ex: Exception =>          logWarning(s"Error clearing write ahead log file $walInfo", ex)      }      logInfo(s"Cleared log files in $logDirectory older than $threshTime")    }    oldLogFiles.foreach { logInfo =>      if (!executionContext.isShutdown) {        try {          val f = Future { deleteFile(logInfo) }(executionContext)          if (waitForCompletion) {            import scala.concurrent.duration._            Await.ready(f, 1 second)          }        } catch {          case e: RejectedExecutionException =>            logWarning("Execution context shutdown before deleting old WriteAheadLogs. " +              "This would not affect recovery correctness.", e)        }      }    }  }  /** Stop the manager, close any open log writer */  def close(): Unit = synchronized {    if (currentLogWriter != null) {      currentLogWriter.close()    }    executionContext.shutdown()    logInfo("Stopped write ahead log manager")  }  /** Get the current log writer while taking care of rotation */  private def getLogWriter(currentTime: Long): FileBasedWriteAheadLogWriter = synchronized {    if (currentLogWriter == null || currentTime > currentLogWriterStopTime) {      resetWriter()      currentLogPath.foreach {        pastLogs += LogInfo(currentLogWriterStartTime, currentLogWriterStopTime, _)      }      currentLogWriterStartTime = currentTime      currentLogWriterStopTime = currentTime + (rollingIntervalSecs * 1000)      val newLogPath = new Path(logDirectory,        timeToLogFile(currentLogWriterStartTime, currentLogWriterStopTime))      currentLogPath = Some(newLogPath.toString)      currentLogWriter = new FileBasedWriteAheadLogWriter(currentLogPath.get, hadoopConf)    }    currentLogWriter  }  /** Initialize the log directory or recover existing logs inside the directory */  private def initializeOrRecover(): Unit = synchronized {    val logDirectoryPath = new Path(logDirectory)    val fileSystem = HdfsUtils.getFileSystemForPath(logDirectoryPath, hadoopConf)    if (fileSystem.exists(logDirectoryPath) && fileSystem.getFileStatus(logDirectoryPath).isDir) {      val logFileInfo = logFilesTologInfo(fileSystem.listStatus(logDirectoryPath).map { _.getPath })      pastLogs.clear()      pastLogs ++= logFileInfo      logInfo(s"Recovered ${logFileInfo.size} write ahead log files from $logDirectory")      logDebug(s"Recovered files are:\n${logFileInfo.map(_.path).mkString("\n")}")    }  }  private def resetWriter(): Unit = synchronized {    if (currentLogWriter != null) {      currentLogWriter.close()      currentLogWriter = null    }  }}private[streaming] object FileBasedWriteAheadLog {  case class LogInfo(startTime: Long, endTime: Long, path: String)  val logFileRegex = """log-(\d+)-(\d+)""".r  def timeToLogFile(startTime: Long, stopTime: Long): String = {    s"log-$startTime-$stopTime"  }  def getCallerName(): Option[String] = {    val stackTraceClasses = Thread.currentThread.getStackTrace().map(_.getClassName)    stackTraceClasses.find(!_.contains("WriteAheadLog")).flatMap(_.split("\\.").lastOption)  }  /** Convert a sequence of files to a sequence of sorted LogInfo objects */  def logFilesTologInfo(files: Seq[Path]): Seq[LogInfo] = {    files.flatMap { file =>      logFileRegex.findFirstIn(file.getName()) match {        case Some(logFileRegex(startTimeStr, stopTimeStr)) =>          val startTime = startTimeStr.toLong          val stopTime = stopTimeStr.toLong          Some(LogInfo(startTime, stopTime, file.toString))        case None =>          None      }    }.sortBy { _.startTime }  }  /**   * This creates an iterator from a parallel collection, by keeping at most `n` objects in memory   * at any given time, where `n` is the size of the thread pool. This is crucial for use cases   * where we create `FileBasedWriteAheadLogReader`s during parallel recovery. We don't want to   * open up `k` streams altogether where `k` is the size of the Seq that we want to parallelize.   */  def seqToParIterator[I, O](      tpool: ThreadPoolExecutor,      source: Seq[I],      handler: I => Iterator[O]): Iterator[O] = {    val taskSupport = new ThreadPoolTaskSupport(tpool)    val groupSize = tpool.getMaximumPoolSize.max(8)    source.grouped(groupSize).flatMap { group =>      val parallelCollection = group.par      parallelCollection.tasksupport = taskSupport      parallelCollection.map(handler)    }.flatten  }}

具体的读写用到了 FileBasedWriteAheadLogRandomReader, FileBasedWriteAheadLogReader.
二。基于数据重放：
此时不需要副本不需要容错,kafka就相当于一个文件存储系统,当然kafka有2中方式:基于receiver的与direct的,receiver的数据存放是交给zk来管理metadata如偏移量offset,如果失效后,kafka可以基于偏移量重新读取,(此时还没有发出acknowledged,kafka不会认为你已经消费了此数据),当然可能存在数据重复消费的问题,(当数据消费后还没来得及发送ask来同步zk中的元数据时就会发生重复消费)所以生产环境越来越多的使用direct的方式,自己管理offset, 可以确保有且仅有一次的容错处理.

来看下DirectKafkaInputDStream()类:

** *  A stream of {@link org.apache.spark.streaming.kafka.KafkaRDD} where * each given Kafka topic/partition corresponds to an RDD partition. * The spark configuration spark.streaming.kafka.maxRatePerPartition gives the maximum number *  of messages * per second that each '''partition''' will accept. * Starting offsets are specified in advance, * and this DStream is not responsible for committing offsets, * so that you can control exactly-once semantics. * For an easy interface to Kafka-managed offsets, *  see {@link org.apache.spark.streaming.kafka.KafkaCluster} * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. *   Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), *   NOT zookeeper servers, specified in host1:port1,host2:port2 form. * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive) *  starting point of the stream * @param messageHandler function for translating each message into the desired type */private[streaming]class DirectKafkaInputDStream[  K: ClassTag,  V: ClassTag,  U <: Decoder[K]: ClassTag,  T <: Decoder[V]: ClassTag,  R: ClassTag](    ssc_ : StreamingContext,    val kafkaParams: Map[String, String],    val fromOffsets: Map[TopicAndPartition, Long],    messageHandler: MessageAndMetadata[K, V] => R  ) extends InputDStream[R](ssc_) with Logging {  val maxRetries = context.sparkContext.getConf.getInt(    "spark.streaming.kafka.maxRetries", 1)  // Keep this consistent with how other streams are named (e.g. "Flume polling stream [2]")  private[streaming] override def name: String = s"Kafka direct stream [$id]"  protected[streaming] override val checkpointData =    new DirectKafkaInputDStreamCheckpointData  /**   * Asynchronously maintains & sends new rate limits to the receiver through the receiver tracker.   */  override protected[streaming] val rateController: Option[RateController] = {    if (RateController.isBackPressureEnabled(ssc.conf)) {      Some(new DirectKafkaRateController(id,        RateEstimator.create(ssc.conf, context.graph.batchDuration)))    } else {      None    }  }  protected val kc = new KafkaCluster(kafkaParams)  //限流用maxRateLimitPerPartition  private val maxRateLimitPerPartition: Int = context.sparkContext.getConf.getInt(      "spark.streaming.kafka.maxRatePerPartition", 0)  protected def maxMessagesPerPartition: Option[Long] = {    val estimatedRateLimit = rateController.map(_.getLatestRate().toInt)    val numPartitions = currentOffsets.keys.size    val effectiveRateLimitPerPartition = estimatedRateLimit      .filter(_ > 0)      .map { limit =>        if (maxRateLimitPerPartition > 0) {          Math.min(maxRateLimitPerPartition, (limit / numPartitions))        } else {          limit / numPartitions        }      }.getOrElse(maxRateLimitPerPartition)    if (effectiveRateLimitPerPartition > 0) {      val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000      Some((secsPerBatch * effectiveRateLimitPerPartition).toLong)    } else {      None    }  }  protected var currentOffsets = fromOffsets  @tailrec  protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = {    val o = kc.getLatestLeaderOffsets(currentOffsets.keySet)    // Either.fold would confuse @tailrec, do it manually    if (o.isLeft) {      val err = o.left.get.toString      if (retries <= 0) {        throw new SparkException(err)      } else {        log.error(err)        Thread.sleep(kc.config.refreshLeaderBackoffMs)        latestLeaderOffsets(retries - 1)      }    } else {      o.right.get    }  }  // limits the maximum number of messages per partition  protected def clamp(    leaderOffsets: Map[TopicAndPartition, LeaderOffset]): Map[TopicAndPartition, LeaderOffset] = {    maxMessagesPerPartition.map { mmp =>      leaderOffsets.map { case (tp, lo) =>        tp -> lo.copy(offset = Math.min(currentOffsets(tp) + mmp, lo.offset))      }    }.getOrElse(leaderOffsets)  }  override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {    val untilOffsets = clamp(latestLeaderOffsets(maxRetries))    //DirectKafkaInputDStream计算的时候会生成KafkaRDD    val rdd = KafkaRDD[K, V, U, T, R](      context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)    // Report the record number and metadata of this batch interval to InputInfoTracker.    val offsetRanges = currentOffsets.map { case (tp, fo) =>      val uo = untilOffsets(tp)      OffsetRange(tp.topic, tp.partition, fo, uo.offset)    }    val description = offsetRanges.filter { offsetRange =>      // Don't display empty ranges.      offsetRange.fromOffset != offsetRange.untilOffset    }.map { offsetRange =>      s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +        s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"    }.mkString("\n")    // Copy offsetRanges to immutable.List to prevent from being modified by the user    val metadata = Map(      "offsets" -> offsetRanges.toList,      StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)    val inputInfo = StreamInputInfo(id, rdd.count, metadata)    ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)    currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)    Some(rdd)  }  override def start(): Unit = {  }  def stop(): Unit = {  }  private[streaming]  class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) {    def batchForTime: mutable.HashMap[Time, Array[(String, Int, Long, Long)]] = {      data.asInstanceOf[mutable.HashMap[Time, Array[OffsetRange.OffsetRangeTuple]]]    }    override def update(time: Time) {      batchForTime.clear()      generatedRDDs.foreach { kv =>        val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray        batchForTime += kv._1 -> a      }    }    override def cleanup(time: Time) { }    override def restore() {      // this is assuming that the topics don't change during execution, which is true currently      val topics = fromOffsets.keySet      val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics))      batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>         logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}")         generatedRDDs += t -> new KafkaRDD[K, V, U, T, R](           context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, messageHandler)      }    }  }  /**   * A RateController to retrieve the rate from RateEstimator.   */  private[streaming] class DirectKafkaRateController(id: Int, estimator: RateEstimator)    extends RateController(id, estimator) {    override def publish(rate: Long): Unit = ()  }}

每次batch生成的时候都会调latestLeaderOffsets来看最新的offset,与上一次batch处理了的offset相减,就获得了此次offset范围,就确定了rdd的数据源.
来看下计算生成的KafkaRDD:

** * A batch-oriented interface for consuming from Kafka. * Starting and ending offsets are specified in advance, * so that you can control exactly-once semantics. * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set * with Kafka broker(s) specified in host1:port1,host2:port2 form. * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD * @param messageHandler function for translating each message into the desired type */private[kafka]class KafkaRDD[  K: ClassTag,  V: ClassTag,  U <: Decoder[_]: ClassTag,  T <: Decoder[_]: ClassTag,  R: ClassTag] private[spark] (    sc: SparkContext,    kafkaParams: Map[String, String],    val offsetRanges: Array[OffsetRange],    leaders: Map[TopicAndPartition, (String, Int)],    messageHandler: MessageAndMetadata[K, V] => R  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {  override def getPartitions: Array[Partition] = {    offsetRanges.zipWithIndex.map { case (o, i) =>        val (host, port) = leaders(TopicAndPartition(o.topic, o.partition))        new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port)    }.toArray  }  override def count(): Long = offsetRanges.map(_.count).sum  override def countApprox(      timeout: Long,      confidence: Double = 0.95  ): PartialResult[BoundedDouble] = {    val c = count    new PartialResult(new BoundedDouble(c, 1.0, c, c), true)  }  override def isEmpty(): Boolean = count == 0L  override def take(num: Int): Array[R] = {    val nonEmptyPartitions = this.partitions      .map(_.asInstanceOf[KafkaRDDPartition])      .filter(_.count > 0)    if (num < 1 || nonEmptyPartitions.size < 1) {      return new Array[R](0)    }    // Determine in advance how many messages need to be taken from each partition    val parts = nonEmptyPartitions.foldLeft(Map[Int, Int]()) { (result, part) =>      val remain = num - result.values.sum      if (remain > 0) {        val taken = Math.min(remain, part.count)        result + (part.index -> taken.toInt)      } else {        result      }    }    val buf = new ArrayBuffer[R]    val res = context.runJob(      this,      (tc: TaskContext, it: Iterator[R]) => it.take(parts(tc.partitionId)).toArray,      parts.keys.toArray)    res.foreach(buf ++= _)    buf.toArray  }  override def getPreferredLocations(thePart: Partition): Seq[String] = {    val part = thePart.asInstanceOf[KafkaRDDPartition]    // TODO is additional hostname resolution necessary here    Seq(part.host)  }  private def errBeginAfterEnd(part: KafkaRDDPartition): String =    s"Beginning offset ${part.fromOffset} is after the ending offset ${part.untilOffset} " +      s"for topic ${part.topic} partition ${part.partition}. " +      "You either provided an invalid fromOffset, or the Kafka topic has been damaged"  private def errRanOutBeforeEnd(part: KafkaRDDPartition): String =    s"Ran out of messages before reaching ending offset ${part.untilOffset} " +    s"for topic ${part.topic} partition ${part.partition} start ${part.fromOffset}." +    " This should not happen, and indicates that messages may have been lost"  private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): String =    s"Got ${itemOffset} > ending offset ${part.untilOffset} " +    s"for topic ${part.topic} partition ${part.partition} start ${part.fromOffset}." +    " This should not happen, and indicates a message may have been skipped"  override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {    val part = thePart.asInstanceOf[KafkaRDDPartition]    assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))    if (part.fromOffset == part.untilOffset) {      log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +        s"skipping ${part.topic} ${part.partition}")      Iterator.empty    } else {      new KafkaRDDIterator(part, context)    }  }  private class KafkaRDDIterator(      part: KafkaRDDPartition,      context: TaskContext) extends NextIterator[R] {    context.addTaskCompletionListener{ context => closeIfNeeded() }    log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +      s"offsets ${part.fromOffset} -> ${part.untilOffset}")    val kc = new KafkaCluster(kafkaParams)    val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])      .newInstance(kc.config.props)      .asInstanceOf[Decoder[K]]    val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])      .newInstance(kc.config.props)      .asInstanceOf[Decoder[V]]    val consumer = connectLeader    var requestOffset = part.fromOffset    var iter: Iterator[MessageAndOffset] = null    // The idea is to use the provided preferred host, except on task retry attempts,    // to minimize number of kafka metadata requests    private def connectLeader: SimpleConsumer = {      if (context.attemptNumber > 0) {        kc.connectLeader(part.topic, part.partition).fold(          errs => throw new SparkException(            s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " +              errs.mkString("\n")),          consumer => consumer        )      } else {        kc.connect(part.host, part.port)      }    }    private def handleFetchErr(resp: FetchResponse) {      if (resp.hasError) {        val err = resp.errorCode(part.topic, part.partition)        if (err == ErrorMapping.LeaderNotAvailableCode ||          err == ErrorMapping.NotLeaderForPartitionCode) {          log.error(s"Lost leader for topic ${part.topic} partition ${part.partition}, " +            s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms")          Thread.sleep(kc.config.refreshLeaderBackoffMs)        }        // Let normal rdd retry sort out reconnect attempts        throw ErrorMapping.exceptionFor(err)      }    }    private def fetchBatch: Iterator[MessageAndOffset] = {      val req = new FetchRequestBuilder()        .addFetch(part.topic, part.partition, requestOffset, kc.config.fetchMessageMaxBytes)        .build()      val resp = consumer.fetch(req)      handleFetchErr(resp)      // kafka may return a batch that starts before the requested offset      resp.messageSet(part.topic, part.partition)        .iterator        .dropWhile(_.offset < requestOffset)    }    override def close(): Unit = {      if (consumer != null) {        consumer.close()      }    }    override def getNext(): R = {      if (iter == null || !iter.hasNext) {        iter = fetchBatch      }      if (!iter.hasNext) {        assert(requestOffset == part.untilOffset, errRanOutBeforeEnd(part))        finished = true        null.asInstanceOf[R]      } else {        val item = iter.next()        if (item.offset >= part.untilOffset) {          assert(item.offset == part.untilOffset, errOvershotEnd(item.offset, part))          finished = true          null.asInstanceOf[R]        } else {          requestOffset = item.nextOffset          messageHandler(new MessageAndMetadata(            part.topic, part.partition, item.message, item.offset, keyDecoder, valueDecoder))        }      }    }  }}private[kafka]object KafkaRDD {  import KafkaCluster.LeaderOffset  /**   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">   * configuration parameters</a>.   *   Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s),   *   NOT zookeeper servers, specified in host1:port1,host2:port2 form.   * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive)   *  starting point of the batch   * @param untilOffsets per-topic/partition Kafka offsets defining the (exclusive)   *  ending point of the batch   * @param messageHandler function for translating each message into the desired type   */  def apply[    K: ClassTag,    V: ClassTag,    U <: Decoder[_]: ClassTag,    T <: Decoder[_]: ClassTag,    R: ClassTag](      sc: SparkContext,      kafkaParams: Map[String, String],      fromOffsets: Map[TopicAndPartition, Long],      untilOffsets: Map[TopicAndPartition, LeaderOffset],      messageHandler: MessageAndMetadata[K, V] => R    ): KafkaRDD[K, V, U, T, R] = {    val leaders = untilOffsets.map { case (tp, lo) =>        tp -> (lo.host, lo.port)    }.toMap    val offsetRanges = fromOffsets.map { case (tp, fo) =>        val uo = untilOffsets(tp)        OffsetRange(tp.topic, tp.partition, fo, uo.offset)    }.toArray    new KafkaRDD[K, V, U, T, R](sc, kafkaParams, offsetRanges, leaders, messageHandler)  }}

基于kafka的direct方式是经典的容错方式.
需要说明的是，所有的容错都会消耗一部分性能，由于不是所有情况都不能容忍数据丢（很多时候我们允许在容忍度范围内丢失一部分数据,比如5%），当数据完整性要求不高时，有时候就不需要配置额外的容错.
补充一点：1000个block丢失了1个，也是丢失，按照现有的机制，也需要从新读取处理所有的block，粒度太粗，可以通过修改direct kafka的方式的源码来修整这点。

本次分享来自于王家林老师的课程‘源码版本定制发行班’，在此向王家林老师表示感谢！
欢迎大家交流技术知识！一起学习，共同进步!

0 0