第15课:Spark Streaming源码解读之No Receivers彻底思考
来源:互联网 发布:矩阵乘法计算公式 编辑:程序博客网 时间:2024/05/01 03:56
本期内容:
Direct Access
Kafka
前面有几期我们讲了带Receiver的Spark Streaming 应用的相关源码解读。但是现在开发Spark Streaming的应用越来越多的采用No Receivers(Direct Approach)的方式。No Receivers具有更强的控制度,语义一致性。其实No Receivers的方式更符合我们读取数据,操作数据的思路的。因为Spark 本身是一个计算框架,他底层会有数据来源,如果没有Receivers,我们直接操作数据来源,这其实是一种更自然的方式。
- 在程序代码中,通过以下方式来以Direct使用kafka:
KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet)
- 其背后通过以下方法生成了一个DirectKafkaInputDStream:
/** * Create an input stream that directly pulls messages from Kafka Brokers * without using any receiver. This stream can guarantee that each message * from Kafka is included in transformations exactly once (see points below). * * Points to note: * - No receivers: This stream does not use any receiver. It directly queries Kafka * - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked * by the stream itself. For interoperability with Kafka monitoring tools that depend on * Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application. * You can access the offsets used in each batch from the generated RDDs (see * [[org.apache.spark.streaming.kafka.HasOffsetRanges]]). * - Failure Recovery: To recover from driver failures, you have to enable checkpointing * in the [[StreamingContext]]. The information on consumed offset can be * recovered from the checkpoint. See the programming guide for details (constraints, etc.). * - End-to-end semantics: This stream ensures that every records is effectively received and * transformed exactly once, but gives no guarantees on whether the transformed data are * outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure * that the output operation is idempotent, or use transactions to output records atomically. * See the programming guide for more details. * * @param ssc StreamingContext object * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" * to be set with Kafka broker(s) (NOT zookeeper servers), specified in * host1:port1,host2:port2 form. * If not starting from a checkpoint, "auto.offset.reset" may be set to "largest" or "smallest" * to determine where the stream starts (defaults to "largest") * @param topics Names of the topics to consume * @tparam K type of Kafka message key * @tparam V type of Kafka message value * @tparam KD type of Kafka message key decoder * @tparam VD type of Kafka message value decoder * @return DStream of (Kafka message key, Kafka message value) */ def createDirectStream[ K: ClassTag, V: ClassTag, KD <: Decoder[K]: ClassTag, VD <: Decoder[V]: ClassTag] ( ssc: StreamingContext, kafkaParams: Map[String, String], topics: Set[String] ): InputDStream[(K, V)] = { val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message) val kc = new KafkaCluster(kafkaParams) val fromOffsets = getFromOffsets(kc, kafkaParams, topics) new DirectKafkaInputDStream[K, V, KD, VD, (K, V)]( ssc, kafkaParams, fromOffsets, messageHandler) }
- 来看生成的DirectKafkaInputDStream,可见其内部确实没有像SocketInputDStream那样生成receiver,而是直接通过compute方法生成了KafkaRDD:
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) val rdd = KafkaRDD[K, V, U, T, R]( context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler) // Report the record number and metadata of this batch interval to InputInfoTracker. val offsetRanges = currentOffsets.map { case (tp, fo) => val uo = untilOffsets(tp) OffsetRange(tp.topic, tp.partition, fo, uo.offset) } val description = offsetRanges.filter { offsetRange => // Don't display empty ranges. offsetRange.fromOffset != offsetRange.untilOffset }.map { offsetRange => s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" + s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}" }.mkString("\n") // Copy offsetRanges to immutable.List to prevent from being modified by the user val metadata = Map( "offsets" -> offsetRanges.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> description) val inputInfo = StreamInputInfo(id, rdd.count, metadata) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset) Some(rdd) }
KafkaRDD中复写了getPartitions方法:
/** * A batch-oriented interface for consuming from Kafka. * Starting and ending offsets are specified in advance, * so that you can control exactly-once semantics. * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set * with Kafka broker(s) specified in host1:port1,host2:port2 form. * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD * @param messageHandler function for translating each message into the desired type */private[kafka]class KafkaRDD[ K: ClassTag, V: ClassTag, U <: Decoder[_]: ClassTag, T <: Decoder[_]: ClassTag, R: ClassTag] private[spark] ( sc: SparkContext, kafkaParams: Map[String, String], val offsetRanges: Array[OffsetRange], leaders: Map[TopicAndPartition, (String, Int)], messageHandler: MessageAndMetadata[K, V] => R ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges { override def getPartitions: Array[Partition] = { offsetRanges.zipWithIndex.map { case (o, i) => val (host, port) = leaders(TopicAndPartition(o.topic, o.partition)) new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port) }.toArray }
KafkaRDD中复写了compute方法:
override def compute(thePart: Partition, context: TaskContext): Iterator[R] = { val part = thePart.asInstanceOf[KafkaRDDPartition] assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part)) if (part.fromOffset == part.untilOffset) { log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " + s"skipping ${part.topic} ${part.partition}") Iterator.empty } else { new KafkaRDDIterator(part, context) } }
compute方法生成了KafkaRDDIterator:
private class KafkaRDDIterator( part: KafkaRDDPartition, context: TaskContext) extends NextIterator[R] { context.addTaskCompletionListener{ context => closeIfNeeded() } log.info(s"Computing topic ${part.topic}, partition ${part.partition} " + s"offsets ${part.fromOffset} -> ${part.untilOffset}") val kc = new KafkaCluster(kafkaParams) val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties]) .newInstance(kc.config.props) .asInstanceOf[Decoder[K]] val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties]) .newInstance(kc.config.props) .asInstanceOf[Decoder[V]] val consumer = connectLeader var requestOffset = part.fromOffset var iter: Iterator[MessageAndOffset] = null // The idea is to use the provided preferred host, except on task retry attempts, // to minimize number of kafka metadata requests private def connectLeader: SimpleConsumer = { if (context.attemptNumber > 0) { kc.connectLeader(part.topic, part.partition).fold( errs => throw new SparkException( s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " + errs.mkString("\n")), consumer => consumer ) } else { kc.connect(part.host, part.port) } }
KafkaRDDIterator会调用KafkaCluster的connect方法,KafkaCluster的connect方法返回了一个 SimpleConsumer,如果想自定义控制kafka消息的消费,则可自定义Kafka的consumer。
我们再重新思考有Receiver和No Receiver的Spark Streaming应用 Direct访问的好处:
1. 不需要缓存,不会出现OOM等问题(数据缓存在Kafka中)
2. 如果采用Receiver的方式,Receiver和Worker上Executor绑定了,不方便做分布式(配置一下也可以做)。如果采用Direct的方式,直接是RDD操作,数据默认分布在多个Executor上,天然就是分布式的。
3. 数据消费的问题,在实际操作的时候,如果采用Receiver的方式,如果数据操作来不及消费,Delay多次之后,Spark Streaming程序有可能崩溃。如果是Direct的方式,就不会。
4. 完全的语义一致性,不会重复消费,且只被消费一次。
- 第15课:Spark Streaming源码解读之No Receivers彻底思考 本节课分享Spark Streaming源码解读之No Receivers彻底思考,企业级开发Spark Strea
- Spark定制班第15课:Spark Streaming源码解读之No Receivers彻底思考
- 第15课:Spark Streaming源码解读之No Receivers彻底思考
- 第15课:Spark Streaming源码解读之No Receivers彻底思考
- 第15课:Spark Streaming源码解读之No Receivers彻底思考
- 15、Spark Streaming源码解读之No Receivers彻底思考
- 15. Spark Streaming源码解读之No Receivers彻底思考
- Spark 定制版:015~Spark Streaming源码解读之No Receivers彻底思考
- 第15课:spark streaming源码解读之No Receives彻底思考
- Spark Streaming源码解读之No Receivers详解
- Spark学习笔记(15)Spark Streaming源码解读之No Receivers
- 第8课:Spark Streaming源码解读之RDD生成全生命周期彻底研究和思考
- 第8课:Spark Streaming源码解读之RDD生成全生命周期彻底研究和思考
- 第8课:Spark Streaming源码解读之RDD生成全生命周期彻底研究和思考
- 第8课:Spark Streaming源码解读之RDD生成全生命周期彻底研究和思考
- 第8课:Spark Streaming源码解读之RDD生成全生命周期彻底研究和思考
- Spark定制班第8课:Spark Streaming源码解读之RDD生成全生命周期彻底研究和思考
- 第9课:Spark Streaming源码解读之Receiver在Driver的精妙实现全生命周期彻底研究和思考
- Java千百问_03基础语法(016)_main方法是什么
- Bulls and Cows
- 计蒜之道 青云的机房组网方案(中等)
- AsyncTask的使用
- Opensips 进程结构
- 第15课:Spark Streaming源码解读之No Receivers彻底思考
- 什么是线程执行器Executor
- 318. Maximum Product of Word Lengths
- 【连载】大话系统架构决策 - 伸缩性
- 第七届ACM山东省赛-D Swiss-system tournament
- 虚拟机的文件拖不出来
- Object.create() --- javascript一种新的对象创建方式
- 一些优秀项目的体会
- 位运算求相反数