spark streaming读取kafka 零丢失(三)

来源:互联网 发布:varier 知乎 编辑:程序博客网 时间:2024/05/18 02:49

方式二:
方法二就是每次streaming 消费了kafka的数据后,将消费的kafka offsets更新到zookeeper。当你的程序挂掉或者升级的时候,就可以接着上次的读取,实现数据的令丢失和 at most once。而且使用checkpoint的方式可能会导致数据重复消费,spark streaming维护的offset和zookeeper维护的偏移量不同步导致数据丢失或者重复消费等。那么我们可以在dstream 出发action的时候 特别是在output的时候出发offset更新,这样子就能确保已消费的数据能够将offsets更新到zookeeper。好了不多说,直接上代码。

def start(ssc:StreamingContext,            brokerList:String,            zkConnect:String,            groupId:String,            topic: String,            pOffsets:scala.collection.mutable.Map[Int, Long]):InputDStream[(String, String)]={    val zkClient = new ZkClient(zkConnect, 60000, 60000, new ZkSerializer {      override def serialize(data: Object): Array[Byte] = {        try {          return data.toString().getBytes("UTF-8")        } catch {          case _: ZkMarshallingError => return null        }      }      override def deserialize(bytes: Array[Byte]): Object = {        try {          return new String(bytes, "UTF-8")        } catch {          case _: ZkMarshallingError => return null        }      }    })    val kafkaParams = Map("metadata.broker.list" -> brokerList, "group.id" -> groupId,      "zookeeper.connect"->zkConnect,      "auto.offset.reset" -> kafka.api.OffsetRequest.SmallestTimeString)    val topics = topic.split(",").toSet    val topicDirs = new ZKGroupTopicDirs(groupId, topic)    var kafkaStream: InputDStream[(String, String)] = null    val fromOffsets = scala.collection.mutable.Map.empty[TopicAndPartition,Long]    if (pOffsets.nonEmpty) {      pOffsets.foreach( po=>{        def partition = new TopicAndPartition(topic,po._1)        fromOffsets.+=((partition,po._2))      })      val messageHandler = (mmd: MessageAndMetadata[String, String]) => (mmd.topic, mmd.message())      kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, (String, String)](        ssc, kafkaParams, fromOffsets.toMap, messageHandler)    } else {      kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)    }    var offsetRanges = Array[OffsetRange]()    kafkaStream.transform { rdd =>      offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges      rdd    }.foreachRDD {      rdd =>      {        for (o <- offsetRanges) {          ZkUtils.updatePersistentPath(zkClient, s"${topicDirs.consumerOffsetDir}/${o.partition}", o.fromOffset.toString)        }      }    }    kafkaStream  }

可以指定开始读取的offset。那么问题来了,我怎么知道我已经读取的kafka的offset呢?后续

阅读全文
0 0