Streaming kakfa 扩展源码 指定起始offset和结束offset

来源:互联网 发布:thinkphp sql注入漏洞 编辑:程序博客网 时间:2024/05/29 04:16

现在Stremaing链接kafka基本都用createDirectStream这个流了吧,官网说这个流要比之前那个老版的好,之前的是采用zookeeper存储消费者Offset的

每次请求都会去zookeeper获取offset,之后发布一个新版的流,就是这个createDirectStream 这个流是kafka单独用一个topic来管理消费者offset的

createDirectStream这个方法有四个重载

def createDirectStream[  K: ClassTag,  V: ClassTag,  KD <: Decoder[K]: ClassTag,  VD <: Decoder[V]: ClassTag] (    ssc: StreamingContext,    kafkaParams: Map[String, String],    topics: Set[String]): InputDStream[(K, V)] = {  val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)  val kc = new KafkaCluster(kafkaParams)  val fromOffsets = getFromOffsets(kc, kafkaParams, topics)  new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](    ssc, kafkaParams, fromOffsets, messageHandler)}


参数最少的一个 分别是StreamingContext的实例,kafka参数,topic集合


先从这个方法入手看,方法体的第一行是个匿名函数,kafka在生产的时候是KV类型的,在这消费的时候得到的也是KV,可以通过这个匿名函数修改整体流的返回值,从上面方法来看,官方默认提供了一个匿名函数,就是返回一个元组(K,V)类型,就是说在调用这个(createDirectStream)方法时候,返回的Stream类型就是(K,V)

第二行是一个kafkaCluster对象,主要是封装了一些方法,包括获取topic最新offset,分区信息,一些等等,

第三行是很关键的一行,可以从行的变量名理解,获取一个起始offset getFromOffsets这个方法,下面看看这个方法

private[kafka] def getFromOffsets(    kc: KafkaCluster,    kafkaParams: Map[String, String],    topics: Set[String]  ): Map[TopicAndPartition, Long] = {  val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase) //获取配置kafka时候的offset起始位置,最新和最早  val result = for {    topicPartitions <- kc.getPartitions(topics).right    leaderOffsets <- (if (reset == Some("smallest")) {       kc.getEarliestLeaderOffsets(topicPartitions)    } else {      kc.getLatestLeaderOffsets(topicPartitions)    }).right  } yield {    leaderOffsets.map { case (tp, lo) =>        (tp, lo.offset)    }  }  KafkaCluster.checkErrors(result)}
这个方法总体意思就是根据kafka的配置,决定起始offset是最新的还是最早的,
上面提到 createDirectStream 这个方法有四个重载,分别可以指定起始offset和匿名函数,和scala版 java版的
这时候取得了起始offset 接下来就初始化这个流,这时候我们可以指定起始offset了,接下来要更改一部分源码,来指定结束offset
new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](ssc, kafkaParams, fromOffsets, messageHandler)
这个类继承了InputDStream 实现了compute方法,Streaming就会调用这个方法来决定每个批次的数据来源
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {  val untilOffsets = clamp(latestLeaderOffsets(maxRetries))  val rdd = KafkaRDD[K, V, U, T, R](    context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)  // Report the record number and metadata of this batch interval to InputInfoTracker.  val offsetRanges = currentOffsets.map { case (tp, fo) =>    val uo = untilOffsets(tp)    OffsetRange(tp.topic, tp.partition, fo, uo.offset)  }  val description = offsetRanges.filter { offsetRange =>    // Don't display empty ranges.    offsetRange.fromOffset != offsetRange.untilOffset  }.map { offsetRange =>    s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +      s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"  }.mkString("\n")  // Copy offsetRanges to immutable.List to prevent from being modified by the user  val metadata = Map(    "offsets" -> offsetRanges.toList,    StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)  val inputInfo = StreamInputInfo(id, rdd.count, metadata)  ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)  currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)  Some(rdd)}

//重点看第一行
//从变量名字面意思理解,这个就是结束的offset,其他的不用管,现在看这个clamp方法,在执行这个方法之前要执行一个latestLeaderOffsets这个方法,这个方法意思
就是获取最新的offset
protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = {  val o = kc.getLatestLeaderOffsets(currentOffsets.keySet)  // Either.fold would confuse @tailrec, do it manually  if (o.isLeft) {    val err = o.left.get.toString    if (retries <= 0) {      throw new SparkException(err)    } else {      log.error(err)      Thread.sleep(kc.config.refreshLeaderBackoffMs)      latestLeaderOffsets(retries - 1)    }  } else {    o.right.get  }}
val o = kc.getLatestLeaderOffsets(currentOffsets.keySet) 
这行代码就是决定了结束offset,剩下的是判断在获取最新offset时候是否出现异常,如果是Left则重新尝试,直到retries<=0时,则抛出SparkException(err)异常
到这里,我们只需要改变latestLeaderOffsets这个方法的返回值,就可以指定了结束的offset,

注意:在改变源码时候,我们需要调用streaming内部的一些方法,而streaming还比较“淫性化”,它只私有了streaming这个包,就是说只要在这个包下都可以访问,

所以在扩展源码的时候要在org.apache.spark.streaming 这个包下建类。

第一次发文章,如果有不是很理解的欢迎我们一起讨论!

我会把我改过的两个scala类上传,配合着看可以更好理解。






package org.apache.spark.streaming.kafka


import java.util.Locale


import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.Decoder


import scala.reflect.ClassTag
import org.apache.spark.SparkException
import org.apache.spark.streaming.kafka.KafkaCluster.{Err, LeaderOffset}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.StreamingContext
import org.slf4j.LoggerFactory


/**
  *
  * <p>
  * 类名称:KafkaManager
  * </p>
  * <p>
  * 类描述:
  * </p>
  * <p>
  * 创建人:A_p
  * </p>
  * <p>
  * 创建时间:
  * </p>
  * <p>
  * 修改人:
  * </p>
  * <p>
  * 修改时间:
  * </p>
  * <p>
  * 修改备注:
  * </p>
  * <p>
  * W (c) 版权所有
  * </p>
  *
  * @version 1.0.0
  *
  */
class KafkaManager(val kafkaParams: Map[String, String]) extends Serializable {
  val logger = LoggerFactory.getLogger(classOf[KafkaManager])
  private val kc = new KafkaCluster(kafkaParams)






  /*
    自动获得最新offset作为起始offset
   */
  def createDirectStreamZYKJ[
  K: ClassTag,
  V: ClassTag,
  KD <: Decoder[K] : ClassTag,
  VD <: Decoder[V] : ClassTag
  ](ssc: StreamingContext, kafkaParams: Map[String, String], topics: Set[String], resultMap: => Map[TopicAndPartition, Long]): InputDStream[(K, V)] = {
    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
    val fromOffsets = getFromOffsets(kc, kafkaParams, topics)


    createDirectStreamZYKJ[K, V, KD, VD](ssc, kafkaParams, fromOffsets, topics, resultMap)
  }


  /*
    指定起始offset和结束offset
   */
  def createDirectStreamZYKJ[
  K: ClassTag,
  V: ClassTag,
  KD <: Decoder[K] : ClassTag,
  VD <: Decoder[V] : ClassTag
  ](ssc: StreamingContext, kafkaParams: Map[String, String], fromOffset: Map[TopicAndPartition, Long], topics: Set[String], untilOffsets : => Map[TopicAndPartition, Long]): InputDStream[(K, V)] = {
    val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
    val offsets = kc.getEarliestLeaderOffsets(kc.getPartitions(topics).right.get)
    new ZYKJ_DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
      ssc, kafkaParams, fromOffset, messageHandler, untilOffsets)
  }


  /**
    * 获取指定topic最新offset,
    *
    * @param kc
    * @param kafkaParams
    * @param topics
    * @return
    */
  private[this] def getFromOffsets(kc: KafkaCluster, kafkaParams: Map[String, String], topics: Set[String]): Map[TopicAndPartition, Long] = {
    val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase(Locale.ROOT))
    val result = for {
      topicPartitions <- kc.getPartitions(topics).right
      leaderOffsets <- (if (reset == Some("smallest")) {
        kc.getEarliestLeaderOffsets(topicPartitions)
      } else {
        kc.getLatestLeaderOffsets(topicPartitions)
      }).right
    } yield {
      leaderOffsets.map { case (tp, lo) =>
        (tp, lo.offset)


      }
    }
    KafkaCluster.checkErrors(result)


  }


  def getTopicPartitionInfo(topic: Set[String]): Map[TopicAndPartition, Long] = {
    val leaderOffsets = kc.getLatestLeaderOffsets(kc.getPartitions(topic).right.get)
    leaderOffsets.right.get.map {
      lines =>
        lines._1 -> lines._2.offset
    }
  }


}




package org.apache.spark.streaming.kafka


import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.{Decoder}
import org.apache.spark.{Logging, SparkException}
import org.apache.spark.streaming.{StreamingContext, Time}
import org.apache.spark.streaming.dstream.{DStreamCheckpointData, InputDStream}
import org.apache.spark.streaming.kafka.KafkaCluster.{Err, LeaderOffset}
import org.apache.spark.streaming.scheduler.{RateController, StreamInputInfo}
import org.apache.spark.streaming.scheduler.rate.RateEstimator
import org.slf4j.LoggerFactory


import scala.annotation.tailrec
import scala.collection.mutable
import scala.reflect.ClassTag


/**
  *
  * <p>
  * 类名称:Customize_DirectKafkaInputDStream
  * </p>
  * <p>
  * 类描述:
  * </p>
  * <p>
  * 创建人:A_p
  * </p>
  * <p>
  * 创建时间:
  * </p>
  * <p>
  * 修改人:
  * </p>
  * <p>
  * 修改时间:
  * </p>
  * <p>
  * 修改备注:
  * </p>
  * <p>
  * Copyright (c) 版权所有
  * </p>
  *
  * @version 1.0.0
  *
  */


private[streaming] class Customize_DirectKafkaInputDStream[
K: ClassTag,
V: ClassTag,
U <: Decoder[K] : ClassTag,
T <: Decoder[V] : ClassTag,
R: ClassTag](_ssc: StreamingContext, val kafkaParams: Map[String, String], val fromOffsets: Map[TopicAndPartition, Long], messageHandler: MessageAndMetadata[K, V] => R, resultMap: => Map[TopicAndPartition, Long]) extends InputDStream[R](_ssc) with Logging {
  val maxRetries = context.sparkContext.getConf.getInt(
    "spark.streaming.kafka.maxRetries", 1)
  val logger = LoggerFactory.getLogger(this.getClass.getName)


  private[streaming] override def name: String = s"Kafka direct stream [$id]"


  protected[streaming] override val checkpointData =
    new DirectKafkaInputDStreamCheckpointData


  override def start(): Unit = {}


  override def stop(): Unit = {}


  override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
    val leaderOffsets: Map[TopicAndPartition, LeaderOffset] = latestLeaderOffsets(maxRetries)


    val untilOffsets = clamp(leaderOffsets)


    untilOffsets.foreach {
      f =>
        logger.error(s"终止offset(${f._1},${f._2.offset})")
    }


    //TODO connect kafkaStream
    val rdd = KafkaRDD[K, V, U, T, R](context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)


    val offsetRanges = currentOffsets.map { case (tp, fo) =>
      val uo = untilOffsets(tp)
      OffsetRange(tp.topic, tp.partition, fo, uo.offset)
    }
    val description = offsetRanges.filter { offsetRange =>
      offsetRange.fromOffset != offsetRange.untilOffset
    }.map { offsetRange =>
      s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
        s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
    }.mkString("\n")


    // 将offsetRanges复制到immutable.List以防止修改
    val metadata = Map(
      "offsets" -> offsetRanges.toList,
      StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)


    val inputInfo = StreamInputInfo(id, rdd.count, metadata)


    ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)


    currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)


    Some(rdd)




  }




  override protected[streaming] val rateController: Option[RateController] = {
    if (RateController.isBackPressureEnabled(ssc.conf)) {
      Some(new DirectKafkaRateController(id,
        RateEstimator.create(ssc.conf, context.graph.batchDuration)))
    } else {
      None
    }
  }


  protected val kc = new KafkaCluster(kafkaParams)


  private val maxRateLimitPerPartition: Long = context.sparkContext.getConf.getLong(
    "spark.streaming.kafka.maxRatePerPartition", 0)




  /**
    * 根据streaming每个批次计算能及决定是否启动该机制,如果启动则会把当前批次数据分发到下批次处理,缓解streaming压力
    *
    * @param offsets latest offset
    * @return
    */
  protected[streaming] def maxMessagesPerPartition(offsets: Map[TopicAndPartition, Long]): Option[Map[TopicAndPartition, Long]] = {


    val estimatedRateLimit = rateController.map(_.getLatestRate())




    val effectiveRateLimitPerPartition = estimatedRateLimit.filter(_ > 0) match {
      case Some(rate) =>
        val lagPerPartition = offsets.map { case (tp, offset) =>
          tp -> Math.max(offset - currentOffsets(tp), 0)
        }
        val totalLag = lagPerPartition.values.sum


        lagPerPartition.map { case (tp, lag) =>
          val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
          tp -> (if (maxRateLimitPerPartition > 0) {
            Math.min(backpressureRate, maxRateLimitPerPartition)
          } else backpressureRate)
        }
      case None => offsets.map { case (tp, offset) => tp -> maxRateLimitPerPartition }
    }


    if (effectiveRateLimitPerPartition.values.sum > 0) {
      val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
      Some(effectiveRateLimitPerPartition.map {
        case (tp, limit) => tp -> (secsPerBatch * limit).toLong
      })
    } else {
      None
    }
  }


  protected var currentOffsets = fromOffsets


  @tailrec
  protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = {
    val oo: Either[Err, Map[TopicAndPartition, LeaderOffset]] = kc.getLatestLeaderOffsets(currentOffsets.keySet)


    this.currentOffsets.foreach {
      fo =>
        logger.error(s"起始offset:${fo}")
    }
    oo.right.foreach {
      fo =>
        fo.foreach {
          of =>
            logger.error(s"最大offset:(${of._1},${of._2.offset})")
        }


    }




    val o = oo.right.map {
      lines =>
        lines.map {
          map =>
            (map._1, KafkaCluster.LeaderOffset(map._2.host, map._2.port,
              resultMap(map._1) > map._2.offset match {
                case true => currentOffsets(map._1)
                case false => resultMap(map._1)
              }
            ))
        }
    }




    if (o.isLeft) {
      val err = o.left.get.toString
      if (retries <= 0) {
        throw new SparkException(err)
      } else {
        logError(err)
        Thread.sleep(kc.config.refreshLeaderBackoffMs)
        latestLeaderOffsets(retries - 1)
      }
    } else {
      o.right.get
    }
  }


  /**
    *
    * @param leaderOffsets
    * @return
    */
  protected def clamp(leaderOffsets: Map[TopicAndPartition, LeaderOffset]): Map[TopicAndPartition, LeaderOffset] = {
    val offsets = leaderOffsets.mapValues(lo => lo.offset)


    maxMessagesPerPartition(offsets).map { mmp =>
      mmp.map {
        case (tp, messages) =>


          val lo = leaderOffsets(tp)
          tp -> lo.copy(offset = Math.min(currentOffsets(tp) + messages, lo.offset))
      }
    }.getOrElse(leaderOffsets)
  }


  /**
    * 从Streaming检查点获取失败之前的RDD,从这个RDD里获取当时的offset ,重新组成数据流
    */
  private[streaming]
  class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) {
    def batchForTime: mutable.HashMap[Time, Array[(String, Int, Long, Long)]] = {
      data.asInstanceOf[mutable.HashMap[Time, Array[OffsetRange.OffsetRangeTuple]]]
    }


    override def update(time: Time): Unit = {
      batchForTime.clear()
      generatedRDDs.foreach { kv =>
        val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray
        batchForTime += kv._1 -> a
      }
    }


    override def cleanup(time: Time): Unit = {}


    override def restore(): Unit = {
      val topics = fromOffsets.keySet
      val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics))


      batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
        logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}")
        generatedRDDs += t -> new KafkaRDD[K, V, U, T, R](
          context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, messageHandler)
      }
    }
  }


  /**
    * RateController从RateEstimator检索速率。
    */
  private[streaming] class DirectKafkaRateController(id: Int, estimator: RateEstimator)
    extends RateController(id, estimator) {
    override def publish(rate: Long): Unit = ()
  }


}