Streaming kakfa 扩展源码 指定起始offset和结束offset
来源:互联网 发布:thinkphp sql注入漏洞 编辑:程序博客网 时间:2024/05/29 04:16
现在Stremaing链接kafka基本都用createDirectStream这个流了吧,官网说这个流要比之前那个老版的好,之前的是采用zookeeper存储消费者Offset的
每次请求都会去zookeeper获取offset,之后发布一个新版的流,就是这个createDirectStream 这个流是kafka单独用一个topic来管理消费者offset的
createDirectStream这个方法有四个重载
def createDirectStream[ K: ClassTag, V: ClassTag, KD <: Decoder[K]: ClassTag, VD <: Decoder[V]: ClassTag] ( ssc: StreamingContext, kafkaParams: Map[String, String], topics: Set[String]): InputDStream[(K, V)] = { val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message) val kc = new KafkaCluster(kafkaParams) val fromOffsets = getFromOffsets(kc, kafkaParams, topics) new DirectKafkaInputDStream[K, V, KD, VD, (K, V)]( ssc, kafkaParams, fromOffsets, messageHandler)}
参数最少的一个 分别是StreamingContext的实例,kafka参数,topic集合
先从这个方法入手看,方法体的第一行是个匿名函数,kafka在生产的时候是KV类型的,在这消费的时候得到的也是KV,可以通过这个匿名函数修改整体流的返回值,从上面方法来看,官方默认提供了一个匿名函数,就是返回一个元组(K,V)类型,就是说在调用这个(createDirectStream)方法时候,返回的Stream类型就是(K,V)
第二行是一个kafkaCluster对象,主要是封装了一些方法,包括获取topic最新offset,分区信息,一些等等,
第三行是很关键的一行,可以从行的变量名理解,获取一个起始offset getFromOffsets这个方法,下面看看这个方法
private[kafka] def getFromOffsets( kc: KafkaCluster, kafkaParams: Map[String, String], topics: Set[String] ): Map[TopicAndPartition, Long] = { val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase) //获取配置kafka时候的offset起始位置,最新和最早 val result = for { topicPartitions <- kc.getPartitions(topics).right leaderOffsets <- (if (reset == Some("smallest")) { kc.getEarliestLeaderOffsets(topicPartitions) } else { kc.getLatestLeaderOffsets(topicPartitions) }).right } yield { leaderOffsets.map { case (tp, lo) => (tp, lo.offset) } } KafkaCluster.checkErrors(result)}
这个方法总体意思就是根据kafka的配置,决定起始offset是最新的还是最早的,
上面提到 createDirectStream 这个方法有四个重载,分别可以指定起始offset和匿名函数,和scala版 java版的
这时候取得了起始offset 接下来就初始化这个流,这时候我们可以指定起始offset了,接下来要更改一部分源码,来指定结束offset
new DirectKafkaInputDStream[K, V, KD, VD, (K, V)](ssc, kafkaParams, fromOffsets, messageHandler)这个类继承了InputDStream 实现了compute方法,Streaming就会调用这个方法来决定每个批次的数据来源
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = { val untilOffsets = clamp(latestLeaderOffsets(maxRetries)) val rdd = KafkaRDD[K, V, U, T, R]( context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler) // Report the record number and metadata of this batch interval to InputInfoTracker. val offsetRanges = currentOffsets.map { case (tp, fo) => val uo = untilOffsets(tp) OffsetRange(tp.topic, tp.partition, fo, uo.offset) } val description = offsetRanges.filter { offsetRange => // Don't display empty ranges. offsetRange.fromOffset != offsetRange.untilOffset }.map { offsetRange => s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" + s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}" }.mkString("\n") // Copy offsetRanges to immutable.List to prevent from being modified by the user val metadata = Map( "offsets" -> offsetRanges.toList, StreamInputInfo.METADATA_KEY_DESCRIPTION -> description) val inputInfo = StreamInputInfo(id, rdd.count, metadata) ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo) currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset) Some(rdd)}//重点看第一行
到这里,我们只需要改变latestLeaderOffsets这个方法的返回值,就可以指定了结束的offset,//从变量名字面意思理解,这个就是结束的offset,其他的不用管,现在看这个clamp方法,在执行这个方法之前要执行一个latestLeaderOffsets这个方法,这个方法意思就是获取最新的offsetprotected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = { val o = kc.getLatestLeaderOffsets(currentOffsets.keySet) // Either.fold would confuse @tailrec, do it manually if (o.isLeft) { val err = o.left.get.toString if (retries <= 0) { throw new SparkException(err) } else { log.error(err) Thread.sleep(kc.config.refreshLeaderBackoffMs) latestLeaderOffsets(retries - 1) } } else { o.right.get }}val o = kc.getLatestLeaderOffsets(currentOffsets.keySet)
这行代码就是决定了结束offset,剩下的是判断在获取最新offset时候是否出现异常,如果是Left则重新尝试,直到retries<=0时,则抛出SparkException(err)异常
注意:在改变源码时候,我们需要调用streaming内部的一些方法,而streaming还比较“淫性化”,它只私有了streaming这个包,就是说只要在这个包下都可以访问,
所以在扩展源码的时候要在org.apache.spark.streaming 这个包下建类。
第一次发文章,如果有不是很理解的欢迎我们一起讨论!
我会把我改过的两个scala类上传,配合着看可以更好理解。
package org.apache.spark.streaming.kafka
import java.util.Locale
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.Decoder
import scala.reflect.ClassTag
import org.apache.spark.SparkException
import org.apache.spark.streaming.kafka.KafkaCluster.{Err, LeaderOffset}
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.StreamingContext
import org.slf4j.LoggerFactory
/**
*
* <p>
* 类名称:KafkaManager
* </p>
* <p>
* 类描述:
* </p>
* <p>
* 创建人:A_p
* </p>
* <p>
* 创建时间:
* </p>
* <p>
* 修改人:
* </p>
* <p>
* 修改时间:
* </p>
* <p>
* 修改备注:
* </p>
* <p>
* W (c) 版权所有
* </p>
*
* @version 1.0.0
*
*/
class KafkaManager(val kafkaParams: Map[String, String]) extends Serializable {
val logger = LoggerFactory.getLogger(classOf[KafkaManager])
private val kc = new KafkaCluster(kafkaParams)
/*
自动获得最新offset作为起始offset
*/
def createDirectStreamZYKJ[
K: ClassTag,
V: ClassTag,
KD <: Decoder[K] : ClassTag,
VD <: Decoder[V] : ClassTag
](ssc: StreamingContext, kafkaParams: Map[String, String], topics: Set[String], resultMap: => Map[TopicAndPartition, Long]): InputDStream[(K, V)] = {
val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
val fromOffsets = getFromOffsets(kc, kafkaParams, topics)
createDirectStreamZYKJ[K, V, KD, VD](ssc, kafkaParams, fromOffsets, topics, resultMap)
}
/*
指定起始offset和结束offset
*/
def createDirectStreamZYKJ[
K: ClassTag,
V: ClassTag,
KD <: Decoder[K] : ClassTag,
VD <: Decoder[V] : ClassTag
](ssc: StreamingContext, kafkaParams: Map[String, String], fromOffset: Map[TopicAndPartition, Long], topics: Set[String], untilOffsets : => Map[TopicAndPartition, Long]): InputDStream[(K, V)] = {
val messageHandler = (mmd: MessageAndMetadata[K, V]) => (mmd.key, mmd.message)
val offsets = kc.getEarliestLeaderOffsets(kc.getPartitions(topics).right.get)
new ZYKJ_DirectKafkaInputDStream[K, V, KD, VD, (K, V)](
ssc, kafkaParams, fromOffset, messageHandler, untilOffsets)
}
/**
* 获取指定topic最新offset,
*
* @param kc
* @param kafkaParams
* @param topics
* @return
*/
private[this] def getFromOffsets(kc: KafkaCluster, kafkaParams: Map[String, String], topics: Set[String]): Map[TopicAndPartition, Long] = {
val reset = kafkaParams.get("auto.offset.reset").map(_.toLowerCase(Locale.ROOT))
val result = for {
topicPartitions <- kc.getPartitions(topics).right
leaderOffsets <- (if (reset == Some("smallest")) {
kc.getEarliestLeaderOffsets(topicPartitions)
} else {
kc.getLatestLeaderOffsets(topicPartitions)
}).right
} yield {
leaderOffsets.map { case (tp, lo) =>
(tp, lo.offset)
}
}
KafkaCluster.checkErrors(result)
}
def getTopicPartitionInfo(topic: Set[String]): Map[TopicAndPartition, Long] = {
val leaderOffsets = kc.getLatestLeaderOffsets(kc.getPartitions(topic).right.get)
leaderOffsets.right.get.map {
lines =>
lines._1 -> lines._2.offset
}
}
}
package org.apache.spark.streaming.kafka
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.{Decoder}
import org.apache.spark.{Logging, SparkException}
import org.apache.spark.streaming.{StreamingContext, Time}
import org.apache.spark.streaming.dstream.{DStreamCheckpointData, InputDStream}
import org.apache.spark.streaming.kafka.KafkaCluster.{Err, LeaderOffset}
import org.apache.spark.streaming.scheduler.{RateController, StreamInputInfo}
import org.apache.spark.streaming.scheduler.rate.RateEstimator
import org.slf4j.LoggerFactory
import scala.annotation.tailrec
import scala.collection.mutable
import scala.reflect.ClassTag
/**
*
* <p>
* 类名称:Customize_DirectKafkaInputDStream
* </p>
* <p>
* 类描述:
* </p>
* <p>
* 创建人:A_p
* </p>
* <p>
* 创建时间:
* </p>
* <p>
* 修改人:
* </p>
* <p>
* 修改时间:
* </p>
* <p>
* 修改备注:
* </p>
* <p>
* Copyright (c) 版权所有
* </p>
*
* @version 1.0.0
*
*/
private[streaming] class Customize_DirectKafkaInputDStream[
K: ClassTag,
V: ClassTag,
U <: Decoder[K] : ClassTag,
T <: Decoder[V] : ClassTag,
R: ClassTag](_ssc: StreamingContext, val kafkaParams: Map[String, String], val fromOffsets: Map[TopicAndPartition, Long], messageHandler: MessageAndMetadata[K, V] => R, resultMap: => Map[TopicAndPartition, Long]) extends InputDStream[R](_ssc) with Logging {
val maxRetries = context.sparkContext.getConf.getInt(
"spark.streaming.kafka.maxRetries", 1)
val logger = LoggerFactory.getLogger(this.getClass.getName)
private[streaming] override def name: String = s"Kafka direct stream [$id]"
protected[streaming] override val checkpointData =
new DirectKafkaInputDStreamCheckpointData
override def start(): Unit = {}
override def stop(): Unit = {}
override def compute(validTime: Time): Option[KafkaRDD[K, V, U, T, R]] = {
val leaderOffsets: Map[TopicAndPartition, LeaderOffset] = latestLeaderOffsets(maxRetries)
val untilOffsets = clamp(leaderOffsets)
untilOffsets.foreach {
f =>
logger.error(s"终止offset(${f._1},${f._2.offset})")
}
//TODO connect kafkaStream
val rdd = KafkaRDD[K, V, U, T, R](context.sparkContext, kafkaParams, currentOffsets, untilOffsets, messageHandler)
val offsetRanges = currentOffsets.map { case (tp, fo) =>
val uo = untilOffsets(tp)
OffsetRange(tp.topic, tp.partition, fo, uo.offset)
}
val description = offsetRanges.filter { offsetRange =>
offsetRange.fromOffset != offsetRange.untilOffset
}.map { offsetRange =>
s"topic: ${offsetRange.topic}\tpartition: ${offsetRange.partition}\t" +
s"offsets: ${offsetRange.fromOffset} to ${offsetRange.untilOffset}"
}.mkString("\n")
// 将offsetRanges复制到immutable.List以防止修改
val metadata = Map(
"offsets" -> offsetRanges.toList,
StreamInputInfo.METADATA_KEY_DESCRIPTION -> description)
val inputInfo = StreamInputInfo(id, rdd.count, metadata)
ssc.scheduler.inputInfoTracker.reportInfo(validTime, inputInfo)
currentOffsets = untilOffsets.map(kv => kv._1 -> kv._2.offset)
Some(rdd)
}
override protected[streaming] val rateController: Option[RateController] = {
if (RateController.isBackPressureEnabled(ssc.conf)) {
Some(new DirectKafkaRateController(id,
RateEstimator.create(ssc.conf, context.graph.batchDuration)))
} else {
None
}
}
protected val kc = new KafkaCluster(kafkaParams)
private val maxRateLimitPerPartition: Long = context.sparkContext.getConf.getLong(
"spark.streaming.kafka.maxRatePerPartition", 0)
/**
* 根据streaming每个批次计算能及决定是否启动该机制,如果启动则会把当前批次数据分发到下批次处理,缓解streaming压力
*
* @param offsets latest offset
* @return
*/
protected[streaming] def maxMessagesPerPartition(offsets: Map[TopicAndPartition, Long]): Option[Map[TopicAndPartition, Long]] = {
val estimatedRateLimit = rateController.map(_.getLatestRate())
val effectiveRateLimitPerPartition = estimatedRateLimit.filter(_ > 0) match {
case Some(rate) =>
val lagPerPartition = offsets.map { case (tp, offset) =>
tp -> Math.max(offset - currentOffsets(tp), 0)
}
val totalLag = lagPerPartition.values.sum
lagPerPartition.map { case (tp, lag) =>
val backpressureRate = Math.round(lag / totalLag.toFloat * rate)
tp -> (if (maxRateLimitPerPartition > 0) {
Math.min(backpressureRate, maxRateLimitPerPartition)
} else backpressureRate)
}
case None => offsets.map { case (tp, offset) => tp -> maxRateLimitPerPartition }
}
if (effectiveRateLimitPerPartition.values.sum > 0) {
val secsPerBatch = context.graph.batchDuration.milliseconds.toDouble / 1000
Some(effectiveRateLimitPerPartition.map {
case (tp, limit) => tp -> (secsPerBatch * limit).toLong
})
} else {
None
}
}
protected var currentOffsets = fromOffsets
@tailrec
protected final def latestLeaderOffsets(retries: Int): Map[TopicAndPartition, LeaderOffset] = {
val oo: Either[Err, Map[TopicAndPartition, LeaderOffset]] = kc.getLatestLeaderOffsets(currentOffsets.keySet)
this.currentOffsets.foreach {
fo =>
logger.error(s"起始offset:${fo}")
}
oo.right.foreach {
fo =>
fo.foreach {
of =>
logger.error(s"最大offset:(${of._1},${of._2.offset})")
}
}
val o = oo.right.map {
lines =>
lines.map {
map =>
(map._1, KafkaCluster.LeaderOffset(map._2.host, map._2.port,
resultMap(map._1) > map._2.offset match {
case true => currentOffsets(map._1)
case false => resultMap(map._1)
}
))
}
}
if (o.isLeft) {
val err = o.left.get.toString
if (retries <= 0) {
throw new SparkException(err)
} else {
logError(err)
Thread.sleep(kc.config.refreshLeaderBackoffMs)
latestLeaderOffsets(retries - 1)
}
} else {
o.right.get
}
}
/**
*
* @param leaderOffsets
* @return
*/
protected def clamp(leaderOffsets: Map[TopicAndPartition, LeaderOffset]): Map[TopicAndPartition, LeaderOffset] = {
val offsets = leaderOffsets.mapValues(lo => lo.offset)
maxMessagesPerPartition(offsets).map { mmp =>
mmp.map {
case (tp, messages) =>
val lo = leaderOffsets(tp)
tp -> lo.copy(offset = Math.min(currentOffsets(tp) + messages, lo.offset))
}
}.getOrElse(leaderOffsets)
}
/**
* 从Streaming检查点获取失败之前的RDD,从这个RDD里获取当时的offset ,重新组成数据流
*/
private[streaming]
class DirectKafkaInputDStreamCheckpointData extends DStreamCheckpointData(this) {
def batchForTime: mutable.HashMap[Time, Array[(String, Int, Long, Long)]] = {
data.asInstanceOf[mutable.HashMap[Time, Array[OffsetRange.OffsetRangeTuple]]]
}
override def update(time: Time): Unit = {
batchForTime.clear()
generatedRDDs.foreach { kv =>
val a = kv._2.asInstanceOf[KafkaRDD[K, V, U, T, R]].offsetRanges.map(_.toTuple).toArray
batchForTime += kv._1 -> a
}
}
override def cleanup(time: Time): Unit = {}
override def restore(): Unit = {
val topics = fromOffsets.keySet
val leaders = KafkaCluster.checkErrors(kc.findLeaders(topics))
batchForTime.toSeq.sortBy(_._1)(Time.ordering).foreach { case (t, b) =>
logInfo(s"Restoring KafkaRDD for time $t ${b.mkString("[", ", ", "]")}")
generatedRDDs += t -> new KafkaRDD[K, V, U, T, R](
context.sparkContext, kafkaParams, b.map(OffsetRange(_)), leaders, messageHandler)
}
}
}
/**
* RateController从RateEstimator检索速率。
*/
private[streaming] class DirectKafkaRateController(id: Int, estimator: RateEstimator)
extends RateController(id, estimator) {
override def publish(rate: Long): Unit = ()
}
}
- Streaming kakfa 扩展源码 指定起始offset和结束offset
- spark streaming从指定offset处消费Kafka数据
- spark streaming从指定offset处消费Kafka数据
- spark streaming从指定offset处消费Kafka数据
- offset
- offset
- offset
- Kafka指定分区和offset消费。
- addr和offset区别
- 优化LIMIT和OFFSET
- addr和offset
- jquery offset() 和 positon()
- LIMIT 和 OFFSET
- lea和offset、addr
- offsetTop和offset().top
- mysql offset 和 limit
- TOP 和 OFFSET 筛选
- Sparak-Streaming基于Offset消费Kafka数据
- C++函数模板(泛型编程一)
- Linux 集群管理工具Clustershell,pssh
- java-多态(概念、扩展、转型、示例、多态中成员的特点、)
- javaweb
- hdu3549—Flow Problem(FF模板)
- Streaming kakfa 扩展源码 指定起始offset和结束offset
- 一个支持高并发的jdbc,内置连接池
- Spark成长之路(6)-Correlation
- 查看 SELinux状态及关闭SELinux
- angular使用下拉框+表单
- Java基础之final关键字
- 获取 Windows 系统版本号
- mac 下的开发工具
- 兼职