SparkStream2.0.0 和kafka的无缝结合
来源:互联网 发布:原帅数据 编辑:程序博客网 时间:2024/06/06 00:05
Kafka是一个分布式的发布-订阅式的消息系统,简单来说就是一个消息队列,好处是数据是持久化到磁盘的(本文重点不是介绍kafka,就不多说了)。Kafka的使用场景还是比较多的,比如用作异步系统间的缓冲队列,另外,在很多场景下,我们都会如如下的设计:将一些数据(比如日志)写入到kafka做持久化存储,然后另一个服务消费kafka中的数据,做业务级别的分析,然后将分析结果写入HBase或者HDFS;正因为这个设计很通用,所以像Storm这样的大数据流式处理框架已经支持与kafka的无缝连接。当然,作为后起之秀,Spark同样对kafka提供了原生的支持。
注:以上测试通过,可以根据需要修改。如有疑问,请留言!
本文要介绍的是Spark streaming + kafka的实战。
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>sprakStream</groupId><artifactId>sprakStream</artifactId><version>0.0.1-SNAPSHOT</version><dependencies><!-- jar依赖正确 --><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.11</artifactId><version>2.0.0</version><scope>provided</scope></dependency><!-- jar依赖正确 --><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.11</artifactId><version>2.0.0</version><scope>provided</scope></dependency><!-- jar依赖正确 --><dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming_2.11</artifactId><version>2.0.0</version><scope>provided</scope></dependency><!-- jar依赖正确 --><dependency><groupId>org.apache.spark</groupId><artifactId>spark-mllib_2.11</artifactId><version>2.0.0</version><scope>provided</scope></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming-kafka-0-10_2.11</artifactId><version>2.0.0</version></dependency><dependency><groupId>org.apache.hbase</groupId><artifactId>hbase-client</artifactId><version>1.2.1</version><scope>provided</scope></dependency><dependency><groupId>org.apache.hbase</groupId><artifactId>hbase-server</artifactId><version>1.2.1</version><scope>provided</scope></dependency><dependency><groupId>redis.clients</groupId><artifactId>jedis</artifactId><version>2.8.0</version><scope>provided</scope></dependency><dependency><groupId>org.postgresql</groupId><artifactId>postgresql</artifactId><version>9.4-1202-jdbc4</version><scope>provided</scope></dependency><dependency><groupId>net.sf.json-lib</groupId><artifactId>json-lib</artifactId><version>2.2.3</version></dependency><dependency><groupId>org.apache.commons</groupId><artifactId>commons-pool2</artifactId><version>2.2</version></dependency></dependencies><build><sourceDirectory>${basedir}/src/main/scala</sourceDirectory><testSourceDirectory>${basedir}/src/test/scala</testSourceDirectory><resources><resource><directory>${basedir}/src/main/resources</directory></resource></resources><testResources><testResource><directory>${basedir}/src/test/resources</directory></testResource></testResources><plugins><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.1</version><configuration><source>1.8</source><target>1.8</target></configuration></plugin><plugin><groupId>org.apache.maven.plugins</groupId><artifactId>maven-shade-plugin</artifactId><version>2.2</version><configuration><createDependencyReducedPom>true</createDependencyReducedPom></configuration><executions><execution><phase>package</phase><goals><goal>shade</goal></goals><configuration><artifactSet><includes><include>*:*</include></includes></artifactSet><filters><filter><artifact>*:*</artifact><excludes><exclude>META-INF/*.SF</exclude><exclude>META-INF/*.DSA</exclude><exclude>META-INF/*.RSA</exclude></excludes></filter></filters><transformers><transformerimplementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" /><transformerimplementation="org.apache.maven.plugins.shade.resource.AppendingTransformer"><resource>reference.conf</resource></transformer><transformerimplementation="org.apache.maven.plugins.shade.resource.DontIncludeResourceTransformer"><resource>log4j.properties</resource></transformer></transformers></configuration></execution></executions></plugin></plugins></build></project>
package com.sprakStream.demoimport java.util.Propertiesimport java.util.regex.Matcherimport org.apache.kafka.common.serialization.StringDeserializerimport org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.sql.SQLContextimport org.apache.spark.sql.SparkSessionimport org.apache.spark.streaming.Secondsimport org.apache.spark.streaming.StreamingContextimport org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribeimport org.apache.spark.streaming.kafka010.KafkaUtilsimport org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistentimport org.apache.kafka.common.TopicPartitionimport org.apache.spark.streaming.kafka010.ConsumerStrategiesimport org.apache.spark.streaming.kafka010.LocationStrategiesimport org.apache.spark.streaming.kafka010.HasOffsetRangesimport org.apache.spark.streaming.kafka010.OffsetRangeimport org.apache.spark.TaskContextimport com.sprakStream.util.AppConstantimport com.sprakStream.bean.IpMapperimport com.sprakStream.util.CommUtilimport kafka.common.TopicAndPartitionimport com.logger.util.LoggerUtilobject KafkaExampleOffset { def main(args: Array[String]): Unit = { //val conf = new SparkConf() //val sc = new SparkContext() //屋企的环境 // System.setProperty("spark.sql.warehouse.dir", "D:\\tools\\spark-2.0.0-bin-hadoop2.6"); // System.setProperty("hadoop.home.dir", "D:\\tools\\hadoop-2.6.0"); //公司的环境 System.setProperty("spark.sql.warehouse.dir", "D:\\DevelopTool\\spark-2.0.0-bin-hadoop2.6"); println("success to Init...") val url = "jdbc:postgresql://172.16.12.190:5432/dataex_tmp" val prop = new Properties() prop.put("user", "postgres") prop.put("password", "issing") val conf = new SparkConf().setAppName("wordcount").setMaster("local") val ssc = new StreamingContext(conf, Seconds(2)) val sparkSession = SparkSession.builder().config(conf).getOrCreate() val util = Utilities; util.setupLogging() // Construct a regular expression (regex) to extract fields from raw Apache log lines val pattern = util.apacheLogPattern() // hostname:port for Kafka brokers, not Zookeeper val kafkaParams = Map[String, Object]( "bootstrap.servers" -> AppConstant.KAFKA_HOST, "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "example", "enable.auto.commit" -> (false: java.lang.Boolean) //"auto.offset.reset" -> "latest", // "auto.offset.reset" -> "largest" //自动将偏移重置为最新偏移(默认) // "auto.offset.reset" -> "earliest" //自动将偏移重置为最早的偏移 // "auto.offset.reset" -> "none" //如果没有为消费者组找到以前的偏移,则向消费者抛出异常 ) // List of topics you want to listen for from Kafka val topics = List(AppConstant.KAFKA_TOPIC).toSet /** * kafka offset */ /** * 从指定位置开始读取kakfa数据 * 注意:由于Exactly Once的机制,所以任何情况下,数据只会被消费一次! * 指定了开始的offset后,将会从上一次Streaming程序停止处,开始读取kafka数据 */ //实验得出,当TopicPartition有被放到offsets中的时候,程序可以去消费,否则不消费;消费者消费的模式是按照分区,一个分区一个分区消费的 //2L:L表示long类型,2指从偏移值为2的消息开始消费 val offsets = Map[TopicPartition, Long]( new TopicPartition(AppConstant.KAFKA_TOPIC, 0) -> 5000L, new TopicPartition(AppConstant.KAFKA_TOPIC, 1) -> 5000L, new TopicPartition(AppConstant.KAFKA_TOPIC, 2) -> 5000L) //通过KafkaUtils.createDirectStream(...)获得kafka数据,kafka相关参数由kafkaParams指定 val line = KafkaUtils.createDirectStream( ssc, PreferConsistent, ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets)); //数据操作 line.foreachRDD(mess => { //获取offset集合 val offsetsList = mess.asInstanceOf[HasOffsetRanges].offsetRanges mess.foreachPartition(lines => { lines.foreach(line => { //println() //println("---------------------------------------------------------------------------------------") val o: OffsetRange = offsetsList(TaskContext.get.partitionId) println("++++++++++++++++++++++++++++++此处记录offset+++++++++++++++++++++++++++++++++++++++") //println("--topic::" + o.topic + "--partition:" + o.partition + "--fromOffset:" + o.fromOffset + "--untilOffset:" + o.untilOffset) //println("+++++++++++++++++++++++++++++++此处消费数据操作++++++++++++++++++++++++++++++++++++++") println("The kafka line is " + line) LoggerUtil.loggerToBuffer(line.toString()) //println("---------------------------------------------------------------------------------------") //println() }) }) }) // Kick it off ssc.checkpoint("/user/root/spark/checkpoint") ssc.start() ssc.awaitTermination() println("KafkaExample-结束.................................") }}object SQLContextSingleton2 { @transient private var instance: SQLContext = _ def getInstance(sparkContext: SparkContext): SQLContext = { if (instance == null) { instance = new SQLContext(sparkContext) } instance }}
注:以上测试通过,可以根据需要修改。如有疑问,请留言!
阅读全文
1 0
- SparkStream2.0.0 和kafka的无缝结合
- Eclipse中开发Web和Maven2的无缝结合
- Android SqlDelight和SqlBrite无缝结合使用的Demo例子
- Flume和Kafka结合使用的分析
- storm和kafka结合的一个小问题
- ajax、Struts、spring的无缝结合
- kafka同SparkStreaming的结合
- DEV Gridview全选checkbox 和MultiSelection无缝结合
- 【storm-kafka】storm和kafka结合处理流式数据
- 兼容ie和ff的无缝滚动
- 兼容ie和ff的无缝滚动
- 图片无缝滚动的原理和实例
- js的节点和无缝滚动
- RxJava开发精要8 - 与REST无缝结合-RxJava和Retrofit
- storm与kafka结合
- kafka与eclipse结合
- Flume+Kafka+Strom基于分布式环境的结合使用
- Flume+Kafka+Strom基于伪分布式环境的结合使用
- MYSQL日期 字符串 时间戳互转
- ADO.NET五大对象
- MFC子窗口和父窗口(SetParent,SetOwner)
- gdb x 命令详解
- swift代码之路(一)
- SparkStream2.0.0 和kafka的无缝结合
- Mysql配置文件配置
- 《Keyword Search on RDF Graphs — A Query Graph Assembly Approach》——读书笔记之motivation
- 线性回归-小白篇(高级篇是详细解释)
- vim 常用快捷键总结,简单明了
- 数套 ASM RAC 的恢复案例
- 基于XMPP的即时通信系统的建立
- mysql连接数据库警告
- Rabbitmq集群高可用测试