KafkaUtils.createDirectStream

来源：互联网发布：rsync windows 编辑：程序博客网时间：2024/06/07 14:07

转：http://blog.selfup.cn/1665.html

官网上对这个新接口的介绍很多，大致就是不与zookeeper交互，直接去kafka中读取数据，自己维护offset，于是速度比KafkaUtils.createStream要快上很多。但有利就有弊：无法进行offset的监控。

项目中需要尝试使用这个接口，同时还要进行offset的监控，于是只能按照官网所说的，自己将offset写入zookeeper。

方法1

1
2
3
4
5
6
7
8
9
def createDirectStream[
    K: ClassTag,
    V: ClassTag,
    KD <: Decoder[K]: ClassTag,
    VD <: Decoder[V]: ClassTag] (
      ssc: StreamingContext,
      kafkaParams: Map[String, String],
      topics: Set[String]
  ): InputDStream[(K, V)] {...}

这个方法只有3个参数，使用起来最为方便，但是每次启动的时候默认从Latest offset开始读取，或者设置参数auto.offset.reset="smallest"后将会从Earliest offset开始读取。

显然这2种读取位置都不适合生产环境。

方法2

1
2
3
4
5
6
7
8
9
10
11
defcreateDirectStream[
    K:ClassTag,
    V:ClassTag,
    KD<:Decoder[K]:ClassTag,
    VD<:Decoder[V]:ClassTag,
    R:ClassTag](
      ssc:StreamingContext,
      kafkaParams:Map[String,String],
      fromOffsets:Map[TopicAndPartition,Long],
      messageHandler:MessageAndMetadata[K,V]=>R
  ):InputDStream[R]={...}

这个方法可以在启动的时候可以设置offset，但参数设置起来复杂很多，首先是fromOffsets: Map[TopicAndPartition, Long]的设置，参考下方代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
val topic2Partitions = ZkUtils.getPartitionsForTopics(zkClient, Config.kafkaConfig.topic)
var fromOffsets: Map[TopicAndPartition, Long] = Map()
 
topic2Partitions.foreach(topic2Partitions => {
  val topic:String = topic2Partitions._1
  val partitions:Seq[Int] = topic2Partitions._2
  val topicDirs = new ZKGroupTopicDirs(Config.kafkaConfig.kafkaGroupId, topic)
 
  partitions.foreach(partition => {
    val zkPath = s"${topicDirs.consumerOffsetDir}/$partition"
    ZkUtils.makeSurePersistentPathExists(zkClient, zkPath)
    val untilOffset = zkClient.readData[String](zkPath)
 
    val tp = TopicAndPartition(topic, partition)
    val offset = try {
      if (untilOffset == null || untilOffset.trim == "")
        getMaxOffset(tp)
      else
        untilOffset.toLong
    } catch {
      case e: Exception => getMaxOffset(tp)
    }
    fromOffsets += (tp -> offset)
    logger.info(s"Offset init: set offset of $topic/$partition as $offset")
 
  })
})

其中getMaxOffset方法是用来获取最大的offset。当第一次启动spark任务或者zookeeper上的数据被删除或设置出错时，将选取最大的offset开始消费。代码如下：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
privatedefgetMaxOffset(tp:TopicAndPartition):Long={
  valrequest=OffsetRequest(immutable.Map(tp->PartitionOffsetRequestInfo(OffsetRequest.LatestTime,1)))
 
  ZkUtils.getLeaderForPartition(zkClient,tp.topic,tp.partition)match{
    caseSome(brokerId)=>{
      ZkUtils.readDataMaybeNull(zkClient,ZkUtils.BrokerIdsPath+"/"+brokerId)._1match{
        caseSome(brokerInfoString)=>{
          Json.parseFull(brokerInfoString)match{
            caseSome(m)=>
              valbrokerInfo=m.asInstanceOf[Map[String,Any]]
              valhost=brokerInfo.get("host").get.asInstanceOf[String]
              valport=brokerInfo.get("port").get.asInstanceOf[Int]
              newSimpleConsumer(host,port,10000,100000,"getMaxOffset")
                .getOffsetsBefore(request)
                .partitionErrorAndOffsets(tp)
                .offsets
                .head
            caseNone=>
              thrownewBrokerNotAvailableException("Broker id %d does not exist".format(brokerId))
          }
        }
        caseNone=>
          thrownewBrokerNotAvailableException("Broker id %d does not exist".format(brokerId))
      }
    }
    caseNone=>
      thrownewException("No broker for partition %s - %s".format(tp.topic,tp.partition))
  }
}

然后是参数messageHandler的设置，为了后续处理中能获取到topic，这里形成(topic, message)的tuple：

1
val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message())

接着将从获取rdd的offset并写入到zookeeper中：

1
2
3
4
5
6
7
8
9
10
11
12
13
varoffsetRanges=Array[OffsetRange]()
messages.transform{rdd=>
  offsetRanges=rdd.asInstanceOf[HasOffsetRanges].offsetRanges
  rdd
}.foreachRDD(rdd=>{
  rdd.foreachPartition(HBasePuter.batchSave)
  offsetRanges.foreach(o=>{
    valtopicDirs=newZKGroupTopicDirs(Config.kafkaConfig.kafkaGroupId,o.topic)
    valzkPath=s"${topicDirs.consumerOffsetDir}/${o.partition}"
    ZkUtils.updatePersistentPath(zkClient,zkPath,o.untilOffset.toString)
    logger.info(s"Offset update: set offset of ${o.topic}/${o.partition} as ${o.untilOffset.toString}")
  })
})

最后附上batchSave的示例：

1
2
3
4
5
6
7
defbatchSave(iter:Iterator[(String,String)]):Unit={
  iter.foreach(item=>{
    valtopic=item._1
    valmessage=item._2
    ...
  })
}

阅读全文

0 0