第15课：Spark Streaming源码解读之No Receivers彻底思考

来源：互联网发布：矩阵乘法计算公式编辑：程序博客网时间：2024/05/01 03:56

本期内容：
Direct Access
Kafka

前面有几期我们讲了带Receiver的Spark Streaming 应用的相关源码解读。但是现在开发Spark Streaming的应用越来越多的采用No Receivers（Direct Approach）的方式。No Receivers具有更强的控制度，语义一致性。其实No Receivers的方式更符合我们读取数据，操作数据的思路的。因为Spark 本身是一个计算框架，他底层会有数据来源，如果没有Receivers，我们直接操作数据来源，这其实是一种更自然的方式。

在程序代码中，通过以下方式来以Direct使用kafka:

KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet)

其背后通过以下方法生成了一个DirectKafkaInputDStream:

  /**   * Create an input stream that directly pulls messages from Kafka Brokers   * without using any receiver. This stream can guarantee that each message   * from Kafka is included in transformations exactly once (see points below).   *   * Points to note:   *  - No receivers: This stream does not use any receiver. It directly queries Kafka   *  - Offsets: This does not use Zookeeper to store offsets. The consumed offsets are tracked   *    by the stream itself. For interoperability with Kafka monitoring tools that depend on   *    Zookeeper, you have to update Kafka/Zookeeper yourself from the streaming application.   *    You can access the offsets used in each batch from the generated RDDs (see   *    [[org.apache.spark.streaming.kafka.HasOffsetRanges]]).   *  - Failure Recovery: To recover from driver failures, you have to enable checkpointing   *    in the [[StreamingContext]]. The information on consumed offset can be   *    recovered from the checkpoint. See the programming guide for details (constraints, etc.).   *  - End-to-end semantics: This stream ensures that every records is effectively received and   *    transformed exactly once, but gives no guarantees on whether the transformed data are   *    outputted exactly once. For end-to-end exactly-once semantics, you have to either ensure   *    that the output operation is idempotent, or use transactions to output records atomically.   *    See the programming guide for more details.   *   * @param ssc StreamingContext object   * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration">   *   configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers"   *   to be set with Kafka broker(s) (NOT zookeeper servers), specified in   *   host1:port1,host2:port2 form.   *   If not starting from a checkpoint, "auto.offset.reset" may be set to "largest" or "smallest"   *   to determine where the stream starts (defaults to "largest")   * @param topics Names of the topics to consume   * @tparam K type of Kafka message key   * @tparam V type of Kafka message value   * @tparam KD type of Kafka message key decoder   * @tparam VD type of Kafka message value decoder   * @return DStream of (Kafka message key, Kafka message value)   */  def createDirectStream[    K: ClassTag,    V: ClassTag,    KD <: Decoder[K]: ClassTag,    VD <: Decoder[V]: ClassTag] (      ssc: StreamingContext,      kafkaParams: Map[String, String],      topics: Set[String]  ): InputDStream[(K, V)] = {    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)    val kc = new KafkaCluster(kafkaParams)    val fromOffsets = getFromOffsets(kc, kafkaParams, topics)    new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](      ssc, kafkaParams, fromOffsets, messageHandler)  }

来看生成的DirectKafkaInputDStream，可见其内部确实没有像SocketInputDStream那样生成receiver,而是直接通过compute方法生成了KafkaRDD：

 override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {    val untilOffsets = clamp(latestLeaderOffsets(maxRetries))    val rdd = KafkaRDD[K, V, U, T, R](      context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)    // Report the record number and metadata of this batch interval to InputInfoTracker.    val offsetRanges = currentOffsets.map { case (tp, fo) =>      val uo = untilOffsets(tp)      OffsetRange(tp.topic, tp.partition, fo, uo.offset)    }    val description = offsetRanges.filter { offsetRange =>      // Don't display empty ranges.      offsetRange.fromOffset != offsetRange.untilOffset    }.map { offsetRange =>      s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +        s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"    }.mkString("\n")    // Copy offsetRanges to immutable.List to prevent from being modified by the user    val metadata = Map(      "offsets" -> offsetRanges.toList,      StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)    val inputInfo = StreamInputInfo(id, rdd.count, metadata)    ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)    currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)    Some(rdd)  }

KafkaRDD中复写了getPartitions方法：

/** * A batch-oriented interface for consuming from Kafka. * Starting and ending offsets are specified in advance, * so that you can control exactly-once semantics. * @param kafkaParams Kafka <a href="http://kafka.apache.org/documentation.html#configuration"> * configuration parameters</a>. Requires "metadata.broker.list" or "bootstrap.servers" to be set * with Kafka broker(s) specified in host1:port1,host2:port2 form. * @param offsetRanges offset ranges that define the Kafka data belonging to this RDD * @param messageHandler function for translating each message into the desired type */private[kafka]class KafkaRDD[  K: ClassTag,  V: ClassTag,  U <: Decoder[_]: ClassTag,  T <: Decoder[_]: ClassTag,  R: ClassTag] private[spark] (    sc: SparkContext,    kafkaParams: Map[String, String],    val offsetRanges: Array[OffsetRange],    leaders: Map[TopicAndPartition, (String, Int)],    messageHandler: MessageAndMetadata[K, V] => R  ) extends RDD[R](sc, Nil) with Logging with HasOffsetRanges {  override def getPartitions: Array[Partition] = {    offsetRanges.zipWithIndex.map { case (o, i) =>        val (host, port) = leaders(TopicAndPartition(o.topic, o.partition))        new KafkaRDDPartition(i, o.topic, o.partition, o.fromOffset, o.untilOffset, host, port)    }.toArray  }

KafkaRDD中复写了compute方法：

  override def compute(thePart: Partition, context: TaskContext): Iterator[R] = {    val part = thePart.asInstanceOf[KafkaRDDPartition]    assert(part.fromOffset <= part.untilOffset, errBeginAfterEnd(part))    if (part.fromOffset == part.untilOffset) {      log.info(s"Beginning offset ${part.fromOffset} is the same as ending offset " +        s"skipping ${part.topic} ${part.partition}")      Iterator.empty    } else {      new KafkaRDDIterator(part, context)    }  }

compute方法生成了KafkaRDDIterator：

private class KafkaRDDIterator(      part: KafkaRDDPartition,      context: TaskContext) extends NextIterator[R] {    context.addTaskCompletionListener{ context => closeIfNeeded() }    log.info(s"Computing topic ${part.topic}, partition ${part.partition} " +      s"offsets ${part.fromOffset} -> ${part.untilOffset}")    val kc = new KafkaCluster(kafkaParams)    val keyDecoder = classTag[U].runtimeClass.getConstructor(classOf[VerifiableProperties])      .newInstance(kc.config.props)      .asInstanceOf[Decoder[K]]    val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])      .newInstance(kc.config.props)      .asInstanceOf[Decoder[V]]    val consumer = connectLeader    var requestOffset = part.fromOffset    var iter: Iterator[MessageAndOffset] = null    // The idea is to use the provided preferred host, except on task retry attempts,    // to minimize number of kafka metadata requests    private def connectLeader: SimpleConsumer = {      if (context.attemptNumber > 0) {        kc.connectLeader(part.topic, part.partition).fold(          errs => throw new SparkException(            s"Couldn't connect to leader for topic ${part.topic} ${part.partition}: " +              errs.mkString("\n")),          consumer => consumer        )      } else {        kc.connect(part.host, part.port)      }    }

KafkaRDDIterator会调用KafkaCluster的connect方法，KafkaCluster的connect方法返回了一个 SimpleConsumer，如果想自定义控制kafka消息的消费，则可自定义Kafka的consumer。

我们再重新思考有Receiver和No Receiver的Spark Streaming应用 Direct访问的好处：
1. 不需要缓存，不会出现OOM等问题（数据缓存在Kafka中）
2. 如果采用Receiver的方式，Receiver和Worker上Executor绑定了，不方便做分布式（配置一下也可以做）。如果采用Direct的方式，直接是RDD操作，数据默认分布在多个Executor上，天然就是分布式的。
3. 数据消费的问题，在实际操作的时候，如果采用Receiver的方式，如果数据操作来不及消费，Delay多次之后，Spark Streaming程序有可能崩溃。如果是Direct的方式，就不会。
4. 完全的语义一致性，不会重复消费，且只被消费一次。

0 0