spark-streaming 编程(三)连接kafka消费数据
来源:互联网 发布:windows media pl 编辑:程序博客网 时间:2024/05/22 04:41
spark-streaming支持kafka消费,有以下方式:
我实验的版本是kafka0.10,试验的是spark-streaming-kafka-0.8的接入方式。另外,spark-streaming-kafka-0.10的分支并没有研究。
spark-streaming-kafka-0.8的方式支持kafka0.8.2.1以及更高的版本。有两种方式:
(1)Receiver Based Approach:基于kafka high-level consumer api,有一个Receiver负责接收数据到执行器
(2)Direct Approcah:基于kafka simple consumer api,没有receiver。
mavne项目需要添加依赖
<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.1.0</version> </dependency>
Reviced based approach代码:使用方法见注释
package com.lgh.sparkstreamingimport org.apache.spark.SparkConfimport org.apache.spark.streaming.{Minutes, Seconds, StreamingContext}import org.apache.spark.streaming.kafka.KafkaUtils/** * Created by Administrator on 2017/8/23. */object KafkaWordCount { def main(args: Array[String]): Unit = { if (args.length < 4) { System.err.println("Usage: KafkaWordCount <zkQuorum> <group> <topics> <numThreads>") System.exit(1) } //参数分别为 zk地址,消费者group名,topic名 多个的话,分隔 ,线程数 val Array(zkQuorum, group, topics, numThreads) = args //setmaster,local是调试模式使用 val sparkConf = new SparkConf().setAppName("KafkaWordCount").setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(2)) ssc.checkpoint("checkpoint") //Map类型存储的是 key: topic名字 values: 读取该topic的消费者的分区数 val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap //参数分别为StreamingContext,kafka的zk地址,消费者group,Map类型 val kafkamessage = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap) //_._2取出kafka的实际消息流 val lines=kafkamessage.map(_._2) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)) .reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2) wordCounts.print() ssc.start() ssc.awaitTermination() }}
Direct approach:
package com.lgh.sparkstreamingimport kafka.serializer.StringDecoderimport org.apache.spark.SparkConfimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.streaming.kafka.KafkaUtils/** * Created by Administrator on 2017/8/23. */object DirectKafkaWordCount { def main(args: Array[String]) { if (args.length < 2) { System.err.println(s""" |Usage: DirectKafkaWordCount <brokers> <topics> | <brokers> is a list of one or more Kafka brokers | <topics> is a list of one or more kafka topics to consume from | """.stripMargin) System.exit(1) } //borkers : kafka的broker 列表,多个的话以逗号分隔 //topics: kafka topic,多个的话以逗号分隔 val Array(brokers, topics) = args // Create context with 2 second batch interval val sparkConf = new SparkConf().setAppName("DirectKafkaWordCount").setMaster("local[2]") val ssc = new StreamingContext(sparkConf, Seconds(2)) // Create direct kafka stream with brokers and topics val topicsSet = topics.split(",").toSet val kafkaParams = Map[String, String]("metadata.broker.list" -> brokers) val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder]( ssc, kafkaParams, topicsSet) // Get the lines, split them into words, count the words and print val lines = messages.map(_._2) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1L)).reduceByKey(_ + _) wordCounts.print() // Start the computation ssc.start() ssc.awaitTermination() }}
关于这两种方式的区别
1.Simplified Parallelism
Direct 方式将会创建跟kafka分区一样多的RDD partiions,并行的读取kafka topic的partition数据。kafka和RDD partition将会有一对一的对应关系。
2.Efficiency
Receiver-based Approach需要启用WAL才能保证消费不丢失数据
,效率比较低
3.Exactly-once semantics
Receiver-based Approach使用kafka high-level consumer api,存储消费者offset在zookeeper中,跟Write Ahead Log配合使用,能够实现至少消费一次语义。
Direct Approach 使用kafka simple consumer api,跟踪offset信息存储在spark checkpoint中。能够实现数据有且只消费一次语义。
- spark-streaming 编程(三)连接kafka消费数据
- spark streaming从指定offset处消费Kafka数据
- spark streaming从指定offset处消费Kafka数据
- spark streaming从指定offset处消费Kafka数据
- 如何管理Spark Streaming消费Kafka的偏移量(三)
- Sparak-Streaming基于Offset消费Kafka数据
- Spark Streaming+kafka+eclipse编程
- Spark Streaming+kafka+eclipse编程
- Spark Streaming通过直连的方式消费Kafka中的数据
- spark streaming - kafka updateStateByKey 统计用户消费金额
- spark-streaming-[9]-SparkStreaming消费Kafka-Direct Approach
- Spark Streaming消费kafka,不同topic-join实时统计
- 关于Spark Streaming微批次,Flink真正流处理 消费Kafka数据,处理数据的差距对比
- Spark Streaming与Kafka集成编程
- spark streaming + kafka +python(编程)初探
- spark streaming接kafka数据方式汇总
- spark streaming 输出数据到kafka
- spark streaming 读取kafka数据问题
- 畅 怎样通过PDF Transformer+复制PDF文档中的内容
- centos7 防火墙
- angularJS on-hold用法
- 不建议使用JPasswordField.getText()
- 如何生成短链接
- spark-streaming 编程(三)连接kafka消费数据
- Java 集合深入理解(4):List<E> 接口
- slf4j的使用 log4j实现方式
- 内存动态分配和垃圾自动回收机制(一)
- Ⅱ vue2.0 项目结构简介
- pipe fd泄露
- 深度增强学习(DRL)简单梳理
- Github桌面版使用方式(How to Use Github [Desktop Version])
- extjs:Cannot read property 'processed' of undefined