第12课:Spark Streaming源码解读之Executor容错安全性
来源:互联网 发布:地勘报告包括哪些数据 编辑:程序博客网 时间:2024/06/06 19:19
本节课聚焦executor的安全容错,driver的安全容错下节课讲。
executor的安全容错主要是executor接受的数据的安全性,计算的安全容错完全可以借助于底层的rdd的安全容错。数据的安全性对spark streaming至关重要,这有2个原因:
第一个原因:spark streaming不断地持续地接受数据,不断地持续地产生JOb不断地持续地提交Job;
第二个原因:由于是基于spark core,如果能够确保数据安全可靠,即使运行时有故障也可以借助rdd的容错性自动进行恢复。
最简单的容错机制是副本,其次还有是数据源接受重放(即可以从数据源重新读取过去若干时间内的数据)。数据副本也有2中方式:通过配置storage level基于block manager做备份,这样接受到的数据存储到executor时,就可以天然地借助bm的机制做备份,这也是默认采用了的方式;通过wal来做备份。
一。基于bm做备份:
storage level默认是MEMORY_AND_DISK_SER_2,这时接受到的数据除了存储到receiver所在executor的机器的内存(和磁盘),还会有一份存储到其他的executor的内存(和磁盘)中,以socketTextStream()为例:
/** * Create a input stream from TCP source hostname:port. Data is received using * a TCP socket and the receive bytes is interpreted as UTF8 encoded `\n` delimited * lines. * @param hostname Hostname to connect to for receiving data * @param port Port to connect to for receiving data * @param storageLevel Storage level to use for storing the received objects * (default: StorageLevel.MEMORY_AND_DISK_SER_2) */ def socketTextStream( hostname: String, port: Int, storageLevel: StorageLevel = StorageLevel.MEMORY_AND_DISK_SER_2 ): ReceiverInputDStream[String] = withNamedScope("socket text stream") { socketStream[String](hostname, port, SocketReceiver.bytesToLines, storageLevel) }
创建Receiver时用到的storageLevel就是来自这里。
private[streaming]class SocketReceiver[T: ClassTag]( host: String, port: Int, bytesToObjects: InputStream => Iterator[T], storageLevel: StorageLevel ) extends Receiver[T](storageLevel) with Logging {
ReceiverSupervisorImpl中,在没有做wal时,实例化了BlockManagerBasedBlockHandler:
private val receivedBlockHandler: ReceivedBlockHandler = { if (WriteAheadLogUtils.enableReceiverLog(env.conf)) { if (checkpointDirOption.isEmpty) { throw new SparkException( "Cannot enable receiver write-ahead log without checkpoint directory set. " + "Please use streamingContext.checkpoint() to set the checkpoint directory. " + "See documentation for more details.") } new WriteAheadLogBasedBlockHandler(env.blockManager, receiver.streamId, receiver.storageLevel, env.conf, hadoopConf, checkpointDirOption.get) } else { new BlockManagerBasedBlockHandler(env.blockManager, receiver.storageLevel) } }
BlockManagerBasedBlockHandler最终通过blockManager来存储数据:
/** * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which * stores the received blocks into a block manager with the specified storage level. */private[streaming] class BlockManagerBasedBlockHandler( blockManager: BlockManager, storageLevel: StorageLevel) extends ReceivedBlockHandler with Logging { def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = { var numRecords = None: Option[Long] val putResult: Seq[(BlockId, BlockStatus)] = block match { case ArrayBufferBlock(arrayBuffer) => numRecords = Some(arrayBuffer.size.toLong) blockManager.putIterator(blockId, arrayBuffer.iterator, storageLevel, tellMaster = true) case IteratorBlock(iterator) => val countIterator = new CountingIterator(iterator) val putResult = blockManager.putIterator(blockId, countIterator, storageLevel, tellMaster = true) numRecords = countIterator.count putResult case ByteBufferBlock(byteBuffer) => blockManager.putBytes(blockId, byteBuffer, storageLevel, tellMaster = true) case o => throw new SparkException( s"Could not store $blockId to block manager, unexpected block type ${o.getClass.getName}") } if (!putResult.map { _._1 }.contains(blockId)) { throw new SparkException( s"Could not store $blockId to block manager with storage level $storageLevel") } BlockManagerBasedStoreResult(blockId, numRecords) } def cleanupOldBlocks(threshTime: Long) { // this is not used as blocks inserted into the BlockManager are cleared by DStream's clearing // of BlockRDDs. }}
二。基于wal做备份:
ReceiverSupervisorImpl中,在采用wal时,实例化了WriteAheadLogBasedBlockHandler,
可以看到wal机制使用了checkpoint目录,生产环境下一般都是放在hdfs,此时默认采用了3份副本,安全,但比较耗时,性能可能有影响,在对实时性要求比较高的情况下,不建议采用
/** * Implementation of a [[org.apache.spark.streaming.receiver.ReceivedBlockHandler]] which * stores the received blocks in both, a write ahead log and a block manager. */private[streaming] class WriteAheadLogBasedBlockHandler( blockManager: BlockManager, streamId: Int, storageLevel: StorageLevel, conf: SparkConf, hadoopConf: Configuration, checkpointDir: String, clock: Clock = new SystemClock ) extends ReceivedBlockHandler with Logging { private val blockStoreTimeout = conf.getInt( "spark.streaming.receiver.blockStoreTimeout", 30).seconds private val effectiveStorageLevel = { if (storageLevel.deserialized) { logWarning(s"Storage level serialization ${storageLevel.deserialized} is not supported when" + s" write ahead log is enabled, change to serialization false") } //此时,由于做了wal,就没有必要再用有备份的storageLevel,因为checkpoint存放在hdfs上,默认就有了3份副本; if (storageLevel.replication > 1) { logWarning(s"Storage level replication ${storageLevel.replication} is unnecessary when " + s"write ahead log is enabled, change to replication 1") } StorageLevel(storageLevel.useDisk, storageLevel.useMemory, storageLevel.useOffHeap, false, 1) } if (storageLevel != effectiveStorageLevel) { logWarning(s"User defined storage level $storageLevel is changed to effective storage level " + s"$effectiveStorageLevel when write ahead log is enabled") } //写日志 // Write ahead log manages private val writeAheadLog = WriteAheadLogUtils.createLogForReceiver( conf, checkpointDirToLogDir(checkpointDir, streamId), hadoopConf) // For processing futures used in parallel block storing into block manager and write ahead log // # threads = 2, so that both writing to BM and WAL can proceed in parallel implicit private val executionContext = ExecutionContext.fromExecutorService( ThreadUtils.newDaemonFixedThreadPool(2, this.getClass.getSimpleName)) //并行写BM与wal /** * This implementation stores the block into the block manager as well as a write ahead log. * It does this in parallel, using Scala Futures, and returns only after the block has * been stored in both places. */ def storeBlock(blockId: StreamBlockId, block: ReceivedBlock): ReceivedBlockStoreResult = { var numRecords = None: Option[Long] // Serialize the block so that it can be inserted into both val serializedBlock = block match { case ArrayBufferBlock(arrayBuffer) => numRecords = Some(arrayBuffer.size.toLong) blockManager.dataSerialize(blockId, arrayBuffer.iterator) case IteratorBlock(iterator) => val countIterator = new CountingIterator(iterator) val serializedBlock = blockManager.dataSerialize(blockId, countIterator) numRecords = countIterator.count serializedBlock case ByteBufferBlock(byteBuffer) => byteBuffer case _ => throw new Exception(s"Could not push $blockId to block manager, unexpected block type") } // Store the block in block manager val storeInBlockManagerFuture = Future { val putResult = blockManager.putBytes(blockId, serializedBlock, effectiveStorageLevel, tellMaster = true) if (!putResult.map { _._1 }.contains(blockId)) { throw new SparkException( s"Could not store $blockId to block manager with storage level $storageLevel") } } // Store the block in write ahead log val storeInWriteAheadLogFuture = Future { writeAheadLog.write(serializedBlock, clock.getTimeMillis()) } // Combine the futures, wait for both to complete, and return the write ahead log record handle val combinedFuture = storeInBlockManagerFuture.zip(storeInWriteAheadLogFuture).map(_._2) val walRecordHandle = Await.result(combinedFuture, blockStoreTimeout) WriteAheadLogBasedStoreResult(blockId, numRecords, walRecordHandle) } def cleanupOldBlocks(threshTime: Long) { writeAheadLog.clean(threshTime, false) } def stop() { writeAheadLog.close() executionContext.shutdown() }}private[streaming] object WriteAheadLogBasedBlockHandler { def checkpointDirToLogDir(checkpointDir: String, streamId: Int): String = { new Path(checkpointDir, new Path("receivedData", streamId.toString)).toString }}
来看下WriteAheadLog,抽象类,wal是顺序写数据,顺序或随机读数据,没有更改或删除记录之类的,读时只需要游标或指针,所以还是很快的:
** * :: DeveloperApi :: * * This abstract class represents a write ahead log (aka journal) that is used by Spark Streaming * to save the received data (by receivers) and associated metadata to a reliable storage, so that * they can be recovered after driver failures. See the Spark documentation for more information * on how to plug in your own custom implementation of a write ahead log. */@org.apache.spark.annotation.DeveloperApipublic abstract class WriteAheadLog { /** * Write the record to the log and return a record handle, which contains all the information * necessary to read back the written record. The time is used to the index the record, * such that it can be cleaned later. Note that implementations of this abstract class must * ensure that the written data is durable and readable (using the record handle) by the * time this function returns. */ abstract public WriteAheadLogRecordHandle write(ByteBuffer record, long time); /** * Read a written record based on the given record handle. */ abstract public ByteBuffer read(WriteAheadLogRecordHandle handle); /** * Read and return an iterator of all the records that have been written but not yet cleaned up. */ abstract public Iterator<ByteBuffer> readAll(); /** * Clean all the records that are older than the threshold time. It can wait for * the completion of the deletion. */ abstract public void clean(long threshTime, boolean waitForCompletion); /** * Close this log and release any resources. */ abstract public void close();}
返回的句柄是WriteAheadLogRecordHandle,它的一个具体实现是FileBasedWriteAheadLogSegment:
/** * :: DeveloperApi :: * * This abstract class represents a handle that refers to a record written in a * {@link org.apache.spark.streaming.util.WriteAheadLog WriteAheadLog}. * It must contain all the information necessary for the record to be read and returned by * an implemenation of the WriteAheadLog class. * * @see org.apache.spark.streaming.util.WriteAheadLog */@org.apache.spark.annotation.DeveloperApipublic abstract class WriteAheadLogRecordHandle implements java.io.Serializable {}
/** Class for representing a segment of data in a write ahead log file */private[streaming] case class FileBasedWriteAheadLogSegment(path: String, offset: Long, length: Int) extends WriteAheadLogRecordHandle
来看下WriteAheadLog抽象类的具体实现FileBasedWriteAheadLog,需要说明的是,这里的注释中说了hdfs,其实使用的可以是hadoop支持的所有文件系统:
/** * This class manages write ahead log files. * * - Writes records (bytebuffers) to periodically rotating log files. * - Recovers the log files and the reads the recovered records upon failures. * - Cleans up old log files. * * Uses [[org.apache.spark.streaming.util.FileBasedWriteAheadLogWriter]] to write * and [[org.apache.spark.streaming.util.FileBasedWriteAheadLogReader]] to read. * * @param logDirectory Directory when rotating log files will be created. * @param hadoopConf Hadoop configuration for reading/writing log files. */private[streaming] class FileBasedWriteAheadLog( conf: SparkConf, logDirectory: String, hadoopConf: Configuration, rollingIntervalSecs: Int, maxFailures: Int, closeFileAfterWrite: Boolean ) extends WriteAheadLog with Logging { import FileBasedWriteAheadLog._ private val pastLogs = new ArrayBuffer[LogInfo] private val callerNameTag = getCallerName.map(c => s" for $c").getOrElse("") private val threadpoolName = s"WriteAheadLogManager $callerNameTag" private val threadpool = ThreadUtils.newDaemonCachedThreadPool(threadpoolName, 20) private val executionContext = ExecutionContext.fromExecutorService(threadpool) override protected val logName = s"WriteAheadLogManager $callerNameTag" private var currentLogPath: Option[String] = None private var currentLogWriter: FileBasedWriteAheadLogWriter = null private var currentLogWriterStartTime: Long = -1L private var currentLogWriterStopTime: Long = -1L initializeOrRecover() /** * Write a byte buffer to the log file. This method synchronously writes the data in the * ByteBuffer to HDFS. When this method returns, the data is guaranteed to have been flushed * to HDFS, and will be available for readers to read. */ def write(byteBuffer: ByteBuffer, time: Long): FileBasedWriteAheadLogSegment = synchronized { var fileSegment: FileBasedWriteAheadLogSegment = null var failures = 0 var lastException: Exception = null var succeeded = false while (!succeeded && failures < maxFailures) { try { fileSegment = getLogWriter(time).write(byteBuffer) if (closeFileAfterWrite) { resetWriter() } succeeded = true } catch { case ex: Exception => lastException = ex logWarning("Failed to write to write ahead log") resetWriter() failures += 1 } } if (fileSegment == null) { logError(s"Failed to write to write ahead log after $failures failures") throw lastException } fileSegment } def read(segment: WriteAheadLogRecordHandle): ByteBuffer = { val fileSegment = segment.asInstanceOf[FileBasedWriteAheadLogSegment] var reader: FileBasedWriteAheadLogRandomReader = null var byteBuffer: ByteBuffer = null try { reader = new FileBasedWriteAheadLogRandomReader(fileSegment.path, hadoopConf) byteBuffer = reader.read(fileSegment) } finally { reader.close() } byteBuffer } /** * Read all the existing logs from the log directory. * * Note that this is typically called when the caller is initializing and wants * to recover past state from the write ahead logs (that is, before making any writes). * If this is called after writes have been made using this manager, then it may not return * the latest the records. This does not deal with currently active log files, and * hence the implementation is kept simple. */ def readAll(): JIterator[ByteBuffer] = synchronized { val logFilesToRead = pastLogs.map{ _.path} ++ currentLogPath logInfo("Reading from the logs:\n" + logFilesToRead.mkString("\n")) def readFile(file: String): Iterator[ByteBuffer] = { logDebug(s"Creating log reader with $file") val reader = new FileBasedWriteAheadLogReader(file, hadoopConf) CompletionIterator[ByteBuffer, Iterator[ByteBuffer]](reader, reader.close _) } if (!closeFileAfterWrite) { logFilesToRead.iterator.map(readFile).flatten.asJava } else { // For performance gains, it makes sense to parallelize the recovery if // closeFileAfterWrite = true seqToParIterator(threadpool, logFilesToRead, readFile).asJava } } /** * Delete the log files that are older than the threshold time. * * Its important to note that the threshold time is based on the time stamps used in the log * files, which is usually based on the local system time. So if there is coordination necessary * between the node calculating the threshTime (say, driver node), and the local system time * (say, worker node), the caller has to take account of possible time skew. * * If waitForCompletion is set to true, this method will return only after old logs have been * deleted. This should be set to true only for testing. Else the files will be deleted * asynchronously. */ def clean(threshTime: Long, waitForCompletion: Boolean): Unit = { val oldLogFiles = synchronized { val expiredLogs = pastLogs.filter { _.endTime < threshTime } pastLogs --= expiredLogs expiredLogs } logInfo(s"Attempting to clear ${oldLogFiles.size} old log files in $logDirectory " + s"older than $threshTime: ${oldLogFiles.map { _.path }.mkString("\n")}") def deleteFile(walInfo: LogInfo): Unit = { try { val path = new Path(walInfo.path) val fs = HdfsUtils.getFileSystemForPath(path, hadoopConf) fs.delete(path, true) logDebug(s"Cleared log file $walInfo") } catch { case ex: Exception => logWarning(s"Error clearing write ahead log file $walInfo", ex) } logInfo(s"Cleared log files in $logDirectory older than $threshTime") } oldLogFiles.foreach { logInfo => if (!executionContext.isShutdown) { try { val f = Future { deleteFile(logInfo) }(executionContext) if (waitForCompletion) { import scala.concurrent.duration._ Await.ready(f, 1 second) } } catch { case e: RejectedExecutionException => logWarning("Execution context shutdown before deleting old WriteAheadLogs. " + "This would not affect recovery correctness.", e) } } } } /** Stop the manager, close any open log writer */ def close(): Unit = synchronized { if (currentLogWriter != null) { currentLogWriter.close() } executionContext.shutdown() logInfo("Stopped write ahead log manager") } /** Get the current log writer while taking care of rotation */ private def getLogWriter(currentTime: Long): FileBasedWriteAheadLogWriter = synchronized { if (currentLogWriter == null || currentTime > currentLogWriterStopTime) { resetWriter() currentLogPath.foreach { pastLogs += LogInfo(currentLogWriterStartTime, currentLogWriterStopTime, _) } currentLogWriterStartTime = currentTime currentLogWriterStopTime = currentTime + (rollingIntervalSecs * 1000) val newLogPath = new Path(logDirectory, timeToLogFile(currentLogWriterStartTime, currentLogWriterStopTime)) currentLogPath = Some(newLogPath.toString) currentLogWriter = new FileBasedWriteAheadLogWriter(currentLogPath.get, hadoopConf) } currentLogWriter } /** Initialize the log directory or recover existing logs inside the directory */ private def initializeOrRecover(): Unit = synchronized { val logDirectoryPath = new Path(logDirectory) val fileSystem = HdfsUtils.getFileSystemForPath(logDirectoryPath, hadoopConf) if (fileSystem.exists(logDirectoryPath) && fileSystem.getFileStatus(logDirectoryPath).isDir) { val logFileInfo = logFilesTologInfo(fileSystem.listStatus(logDirectoryPath).map { _.getPath }) pastLogs.clear() pastLogs ++= logFileInfo logInfo(s"Recovered ${logFileInfo.size} write ahead log files from $logDirectory") logDebug(s"Recovered files are:\n${logFileInfo.map(_.path).mkString("\n")}") } } private def resetWriter(): Unit = synchronized { if (currentLogWriter != null) { currentLogWriter.close() currentLogWriter = null } }}private[streaming] object FileBasedWriteAheadLog { case class LogInfo(startTime: Long, endTime: Long, path: String) val logFileRegex = """log-(\d+)-(\d+)""".r def timeToLogFile(startTime: Long, stopTime: Long): String = { s"log-$startTime-$stopTime" } def getCallerName(): Option[String] = { val stackTraceClasses = Thread.currentThread.getStackTrace().map(_.getClassName) stackTraceClasses.find(!_.contains("WriteAheadLog")).flatMap(_.split("\\.").lastOption) } /** Convert a sequence of files to a sequence of sorted LogInfo objects */ def logFilesTologInfo(files: Seq[Path]): Seq[LogInfo] = { files.flatMap { file => logFileRegex.findFirstIn(file.getName()) match { case Some(logFileRegex(startTimeStr, stopTimeStr)) => val startTime = startTimeStr.toLong val stopTime = stopTimeStr.toLong Some(LogInfo(startTime, stopTime, file.toString)) case None => None } }.sortBy { _.startTime } } /** * This creates an iterator from a parallel collection, by keeping at most `n` objects in memory * at any given time, where `n` is the size of the thread pool. This is crucial for use cases * where we create `FileBasedWriteAheadLogReader`s during parallel recovery. We don't want to * open up `k` streams altogether where `k` is the size of the Seq that we want to parallelize. */ def seqToParIterator[I, O]( tpool: ThreadPoolExecutor, source: Seq[I], handler: I => Iterator[O]): Iterator[O] = { val taskSupport = new ThreadPoolTaskSupport(tpool) val groupSize = tpool.getMaximumPoolSize.max(8) source.grouped(groupSize).flatMap { group => val parallelCollection = group.par parallelCollection.tasksupport = taskSupport parallelCollection.map(handler) }.flatten }}
具体的读写用到了 FileBasedWriteAheadLogRandomReader, FileBasedWriteAheadLogReader.
二。基于数据重放:
此时不需要副本不需要容错,kafka就相当于一个文件存储系统,当然kafka有2中方式:基于receiver的与direct的,receiver的数据存放是交给zk来管理metadata如偏移量offset,如果失效后,kafka可以基于偏移量重新读取,(此时还没有发出acknowledged,kafka不会认为你已经消费了此数据),当然可能存在数据重复消费的问题,(当数据消费后还没来得及发送ask来同步zk中的元数据时就会发生重复消费)所以生产环境越来越多的使用direct的方式,自己管理offset, 可以确保有且仅有一次的容错处理.
来看下DirectKafkaInputDStream()类:
** * A stream of {@link org.apache.spark.streaming.kafka.KafkaRDD} where * each given Kafka topic/partition corresponds to an RDD partition. * The spark configuration spark.streaming.kafka.maxRatePerPartition gives the maximum number * of messages * per second that each '''partition''' will accept. * Starting offsets are specified in advance, * and this DStream is not responsible for committing offsets, * so that you can control exactly-once semantics. * For an easy interface to Kafka-managed offsets, * see {@link org.apache.spark.streaming.kafka.KafkaCluster} * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. * Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), * NOT zookeeper servers, specified in host1:port1,host2:port2 form. * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive) * starting point of the stream * @param messageHandler function for translating each message into the desired type */private[streaming]class DirectKafkaInputDStream[ K: ClassTag, V: ClassTag, U <: Decoder[K]: ClassTag, T <: Decoder[V]: ClassTag, R: ClassTag]( ssc_ : StreamingContext, val kafkaParams: Map[String, String], val fromOffsets: Map[TopicAndPartition, Long], messageHandler: MessageAndMetadata[K, V] => R ) extends InputDStream[R](ssc_) with Logging { val maxRetries = context.sparkContext.getConf.getInt( "spark.streaming.kafka.maxRetries", 1) // Keep this consistent with how other streams are named (e.g. "Flume polling stream [2]") private[streaming] override def name: String = s"Kafka direct stream [$id]" protected[streaming] override val checkpointData = new DirectKafkaInputDStreamCheckpointData /** * Asynchronously maintains & sends new rate limits to the receiver through the receiver tracker. */ override protected[streaming] val rateController: Option[RateController] = { if (RateController.isBackPressureEnabled(ssc.conf)) { Some(new DirectKafkaRateController(id, RateEstimator.create(ssc.conf, context.graph.batchDuration))) } else { None } } protected val kc = new KafkaCluster(kafkaParams) //限流用maxRateLimitPerPartition private val maxRateLimitPerPartition: Int = context.sparkContext.getConf.getInt( "spark.streaming.kafka.maxRatePerPartition", 0) protected def maxMessagesPerPartition: Option[Long] = { val estimatedRateLimit = rateController.map(_.getLatestRate().toInt) val numPartitions = currentOffsets.keys.size val effectiveRateLimitPerPartition = estimatedRateLimit .filter(_ > 0) .map { limit => if (maxRateLimitPerPartition > 0) { Math.min(maxRateLimitPerPartition, (limit / numPartitions)) } else { limit / numPartitions } }.getOrElse(maxRateLimitPerPartition) if (effectiveRateLimitPerPartition > 0) { val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000 Some((secsPerBatch * effectiveRateLimitPerPartition).toLong) } else { None } } protected var currentOffsets = fromOffsets @tailrec protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = { val o = kc.getLatestLeaderOffsets(currentOffsets.keySet) // Either.fold would confuse @tailrec, do it manually if (o.isLeft) { val err = o.left.get.toString if (retries <= 0) { throw new SparkException(err) } else { log.error(err) Thread.sleep(kc.config.refreshLeaderBackoffMs) latestLeaderOffsets(retries - 1) } } else { o.right.get } } // limits the maximum number of messages per partition protected def clamp( leaderOffsets: Map[TopicAndPartition, LeaderOffset]): Map[TopicAndPartition, LeaderOffset] = { maxMessagesPerPartition.map { mmp => leaderOffsets.map { case (tp, lo) => tp -> lo.copy(offset = Math.min(currentOffsets(tp) + mmp, lo.offset)) } }.getOrElse(leaderOffsets) } override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) //DirectKafkaInputDStream计算的时候会生成KafkaRDD val rdd = KafkaRDD[K, V, U, T, R]( context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler) // Report the record number and metadata of this batch interval to InputInfoTracker. val offsetRanges = currentOffsets.map { case (tp, fo) => val uo = untilOffsets(tp) OffsetRange(tp.topic, tp.partition, fo, uo.offset) } val description = offsetRanges.filter { offsetRange => // Don't display empty ranges. offsetRange.fromOffset != offsetRange.untilOffset }.map { offsetRange => s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" + s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}" }.mkString("\n") // Copy offsetRanges to immutable.List to prevent from being modified by the user val metadata = Map( "offsets" -> offsetRanges.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> description) val inputInfo = StreamInputInfo(id, rdd.count, metadata) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset) Some(rdd) } override def start(): Unit = { } def stop(): Unit = { } private[streaming] class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) { def batchForTime: mutable.HashMap[Time, Array[(String, Int, Long, Long)]] = { data.asInstanceOf[mutable.HashMap[Time, Array[OffsetRange.OffsetRangeTuple]]] } override def update(time: Time) { batchForTime.clear() generatedRDDs.foreach { kv => val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray batchForTime += kv._1 -> a } } override def cleanup(time: Time) { } override def restore() { // this is assuming that the topics don't change during execution, which is true currently val topics = fromOffsets.keySet val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics)) batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) => logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}") generatedRDDs += t -> new KafkaRDD[K, V, U, T, R]( context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, messageHandler) } } } /** * A RateController to retrieve the rate from RateEstimator. */ private[streaming] class DirectKafkaRateController(id: Int, estimator: RateEstimator) extends RateController(id, estimator) { override def publish(rate: Long): Unit = () }}
每次batch生成的时候都会调latestLeaderOffsets来看最新的offset,与上一次batch处理了的offset相减,就获得了此次offset范围,就确定了rdd的数据源.
来看下计算生成的KafkaRDD:
** * A batch-oriented interface for consuming from Kafka. * Starting and ending offsets are specified in advance, * so that you can control exactly-once semantics. * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set * with Kafka broker(s) specified in host1:port1,host2:port2 form. * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD * @param messageHandler function for translating each message into the desired type */private[kafka]class KafkaRDD[ K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag, R: ClassTag] private[spark] ( sc: SparkContext, kafkaParams: Map[String, String], val offsetRanges: Array[OffsetRange], leaders: Map[TopicAndPartition, (String, Int)], messageHandler: MessageAndMetadata[K, V] => R ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges { override def getPartitions: Array[Partition] = { offsetRanges.zipWithIndex.map { case (o, i) => val (host, port) = leaders(TopicAndPartition(o.topic, o.partition)) new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port) }.toArray } override def count(): Long = offsetRanges.map(_.count).sum override def countApprox( timeout: Long, confidence: Double = 0.95 ): PartialResult[BoundedDouble] = { val c = count new PartialResult(new BoundedDouble(c, 1.0, c, c), true) } override def isEmpty(): Boolean = count == 0L override def take(num: Int): Array[R] = { val nonEmptyPartitions = this.partitions .map(_.asInstanceOf[KafkaRDDPartition]) .filter(_.count > 0) if (num < 1 || nonEmptyPartitions.size < 1) { return new Array[R](0) } // Determine in advance how many messages need to be taken from each partition val parts = nonEmptyPartitions.foldLeft(Map[Int, Int]()) { (result, part) => val remain = num - result.values.sum if (remain > 0) { val taken = Math.min(remain, part.count) result + (part.index -> taken.toInt) } else { result } } val buf = new ArrayBuffer[R] val res = context.runJob( this, (tc: TaskContext, it: Iterator[R]) => it.take(parts(tc.partitionId)).toArray, parts.keys.toArray) res.foreach(buf ++= _) buf.toArray } override def getPreferredLocations(thePart: Partition): Seq[String] = { val part = thePart.asInstanceOf[KafkaRDDPartition] // TODO is additional hostname resolution necessary here Seq(part.host) } private def errBeginAfterEnd(part: KafkaRDDPartition): String = s"Beginning offset ${part.fromOffset} is after the ending offset ${part.untilOffset} " + s"for topic ${part.topic} partition ${part.partition}. " + "You either provided an invalid fromOffset, or the Kafka topic has been damaged" private def errRanOutBeforeEnd(part: KafkaRDDPartition): String = s"Ran out of messages before reaching ending offset ${part.untilOffset} " + s"for topic ${part.topic} partition ${part.partition} start ${part.fromOffset}." + " This should not happen, and indicates that messages may have been lost" private def errOvershotEnd(itemOffset: Long, part: KafkaRDDPartition): String = s"Got ${itemOffset} > ending offset ${part.untilOffset} " + s"for topic ${part.topic} partition ${part.partition} start ${part.fromOffset}." + " This should not happen, and indicates a message may have been skipped" override def compute(thePart: Partition, context: TaskContext): Iterator[R] = { val part = thePart.asInstanceOf[KafkaRDDPartition] assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part)) if (part.fromOffset == part.untilOffset) { log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " + s"skipping ${part.topic} ${part.partition}") Iterator.empty } else { new KafkaRDDIterator(part, context) } } private class KafkaRDDIterator( part: KafkaRDDPartition, context: TaskContext) extends NextIterator[R] { context.addTaskCompletionListener{ context => closeIfNeeded() } log.info(s"Computing topic ${part.topic}, partition ${part.partition} " + s"offsets ${part.fromOffset} -> ${part.untilOffset}") val kc = new KafkaCluster(kafkaParams) val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties]) .newInstance(kc.config.props) .asInstanceOf[Decoder[K]] val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties]) .newInstance(kc.config.props) .asInstanceOf[Decoder[V]] val consumer = connectLeader var requestOffset = part.fromOffset var iter: Iterator[MessageAndOffset] = null // The idea is to use the provided preferred host, except on task retry attempts, // to minimize number of kafka metadata requests private def connectLeader: SimpleConsumer = { if (context.attemptNumber > 0) { kc.connectLeader(part.topic, part.partition).fold( errs => throw new SparkException( s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " + errs.mkString("\n")), consumer => consumer ) } else { kc.connect(part.host, part.port) } } private def handleFetchErr(resp: FetchResponse) { if (resp.hasError) { val err = resp.errorCode(part.topic, part.partition) if (err == ErrorMapping.LeaderNotAvailableCode || err == ErrorMapping.NotLeaderForPartitionCode) { log.error(s"Lost leader for topic ${part.topic} partition ${part.partition}, " + s" sleeping for ${kc.config.refreshLeaderBackoffMs}ms") Thread.sleep(kc.config.refreshLeaderBackoffMs) } // Let normal rdd retry sort out reconnect attempts throw ErrorMapping.exceptionFor(err) } } private def fetchBatch: Iterator[MessageAndOffset] = { val req = new FetchRequestBuilder() .addFetch(part.topic, part.partition, requestOffset, kc.config.fetchMessageMaxBytes) .build() val resp = consumer.fetch(req) handleFetchErr(resp) // kafka may return a batch that starts before the requested offset resp.messageSet(part.topic, part.partition) .iterator .dropWhile(_.offset < requestOffset) } override def close(): Unit = { if (consumer != null) { consumer.close() } } override def getNext(): R = { if (iter == null || !iter.hasNext) { iter = fetchBatch } if (!iter.hasNext) { assert(requestOffset == part.untilOffset, errRanOutBeforeEnd(part)) finished = true null.asInstanceOf[R] } else { val item = iter.next() if (item.offset >= part.untilOffset) { assert(item.offset == part.untilOffset, errOvershotEnd(item.offset, part)) finished = true null.asInstanceOf[R] } else { requestOffset = item.nextOffset messageHandler(new MessageAndMetadata( part.topic, part.partition, item.message, item.offset, keyDecoder, valueDecoder)) } } } }}private[kafka]object KafkaRDD { import KafkaCluster.LeaderOffset /** * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. * Requires "metadata.broker.list" or "bootstrap.servers" to be set with Kafka broker(s), * NOT zookeeper servers, specified in host1:port1,host2:port2 form. * @param fromOffsets per-topic/partition Kafka offsets defining the (inclusive) * starting point of the batch * @param untilOffsets per-topic/partition Kafka offsets defining the (exclusive) * ending point of the batch * @param messageHandler function for translating each message into the desired type */ def apply[ K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag, R: ClassTag]( sc: SparkContext, kafkaParams: Map[String, String], fromOffsets: Map[TopicAndPartition, Long], untilOffsets: Map[TopicAndPartition, LeaderOffset], messageHandler: MessageAndMetadata[K, V] => R ): KafkaRDD[K, V, U, T, R] = { val leaders = untilOffsets.map { case (tp, lo) => tp -> (lo.host, lo.port) }.toMap val offsetRanges = fromOffsets.map { case (tp, fo) => val uo = untilOffsets(tp) OffsetRange(tp.topic, tp.partition, fo, uo.offset) }.toArray new KafkaRDD[K, V, U, T, R](sc, kafkaParams, offsetRanges, leaders, messageHandler) }}
基于kafka的direct方式是经典的容错方式.
需要说明的是,所有的容错都会消耗一部分性能,由于不是所有情况都不能容忍数据丢(很多时候我们允许在容忍度范围内丢失一部分数据,比如5%),当数据完整性要求不高时,有时候就不需要配置额外的容错.
补充一点:1000个block丢失了1个,也是丢失,按照现有的机制,也需要从新读取处理所有的block,粒度太粗,可以通过修改direct kafka的方式的源码来修整这点。
本次分享来自于王家林老师的课程‘源码版本定制发行班’,在此向王家林老师表示感谢!
欢迎大家交流技术知识!一起学习,共同进步!
- Spark定制班第12课:Spark Streaming源码解读之Executor容错安全性
- 第12课:Spark Streaming源码解读之Executor容错安全性
- 第12课 :Spark Streaming源码解读之Executor容错安全性
- 第12课:Spark Streaming源码解读之executor容错安全性
- 第12课:Spark Streaming源码解读之Executor容错安全性
- 12、Spark Streaming源码解读之Executor容错安全性
- Spark Streaming源码解读之Executor容错安全性
- Spark Streaming源码解读之Executor容错安全性
- Spark 定制版:012~Spark Streaming源码解读之Executor容错安全性
- Spark定制班第13课:Spark Streaming源码解读之Driver容错安全性
- 第13课:Spark Streaming源码解读之Driver容错安全性
- 第13课:Spark Streaming 源码解读之Driver 容错安全性
- 第13课:Spark Streaming源码解读之Driver容错安全性
- 第13课:Spark Streaming源码解读之Driver容错安全性
- 第13课:Spark Streaming源码解读之Driver容错安全性
- Spark Streaming之Executor容错安全性
- Spark Streaming源码解读之Driver容错安全性
- Spark Streaming源码解读之Driver容错安全性
- Zabbix_AlertScript
- Python装饰器详解
- c++实现归并排序
- curl post传值,必须用urlencode
- 如何安装linux mint/ubuntu windows系统
- 第12课:Spark Streaming源码解读之Executor容错安全性
- poj 3378 Crazy Thairs
- Android 简单引导页实现
- 框架 day39-42 SSH整合练习项目CRM(配置文件,增删改查,ajax,上传/下载,分页,BaseDao/Action,MD5)
- /etc/crontab文件及 crontab命令
- HttpURLConnection用法详解
- ubuntu下svn+svnmanager搭建
- android 游戏 实战打飞机游戏 怪物(敌机)类的实现(4)
- Eclipse C++语法高亮设置/背景色设置