sparkstreaming完整例子
来源:互联网 发布:数据加密 ipguard 编辑:程序博客网 时间:2024/05/17 13:06
博客地址:http://www.fanlegefan.com
摘要
本文主要实现一个简单sparkstreaming小栗子,整体流程是从kafka实时读取数据,计算pv,uv,以及sum(money)操作,最后将计算结果存入redis中,用sql表述大概就是
select time,page,count(*),count(distinct user) uv,sum(money) from test group by page,time
样例数据格式:
user,page,money,time
smith,iphone4.html,578.02,1500618981283andrew,mac.html,277.62,1500618981285smith,note.html,388.56,1500618981285
将数据push到kafka
启动kafka
造数据
package com.fan.spark.stream import java.text.DecimalFormatimport java.util.Properties import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord} import scala.util.Random/** * Created by http://www.fanlegefan.com on 17-7-21. */object ProduceMessage { def main(args: Array[String]): Unit = { val props = newProperties() props.put("bootstrap.servers","localhost:9092") props.put("acks","all") props.put("retries","0") props.put("batch.size","16384") props.put("linger.ms","1") props.put("buffer.memory","33554432") props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer") props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer") val producer = newKafkaProducer[String, String](props) val users = Array("jack","leo","andy","lucy","jim","smith","iverson","andrew") val pages = Array("iphone4.html","huawei.html","mi.html","mac.html","note.html","book.html","fanlegefan.com") val df = newDecimalFormat("#.00") val random = newRandom() val num = 10 for(i<- 0 to num ){ val message = users(random.nextInt(users.length))+","+pages(random.nextInt(pages.length))+ ","+df.format(random.nextDouble()*1000)+","+System.currentTimeMillis() producer.send(newProducerRecord[String, String]("test", Integer.toString(i),message)) println(message) } producer.close() }}
控制台消费如下
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginningandrew,book.html,309.58,1500620213384jack,book.html,954.01,1500620213456iverson,book.html,823.07,1500620213456iverson,iphone4.html,486.76,1500620213456lucy,book.html,14.00,1500620213457iverson,note.html,206.30,1500620213457jack,book.html,25.30,1500620213457jim,iphone4.html,513.82,1500620213457lucy,mac.html,677.29,1500620213457smith,mi.html,571.30,1500620213457lucy,iphone4.html,113.83,1500620213457
计算pv,uv以及累计金额
因为数据要存入redis中,获取redis客户端代码如下
package com.fan.spark.stream import org.apache.commons.pool2.impl.GenericObjectPoolConfigimport redis.clients.jedis.JedisPool /** * Created by http://www.fanlegefan.com on 17-7-21. */object RedisClient { val redisHost = "127.0.0.1" val redisPort = 6379 val redisTimeout = 30000 lazy val pool = newJedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout) lazy val hook = newThread { override def run = { println("Execute hook thread: " + this) pool.destroy() } } sys.addShutdownHook(hook.run)}
sparkstreaming 是按batch处理数据,例如设置batchDuration=10,则每批次处理10秒中内接收到的数据,计算pv的时候,直接count累加就可以;但是计算uv的时候,这10秒内出现的用户,在之前的batch中也可能出现,但是spark是按batch处理数据,没办法知道之前用户是否出现过,如果只是简单的累计的话,一天下来uv的数据会比真实的uv大很多,所以要解决这个问题就要引入HyperLogLog,还好redis已经提供了这个功能,具体使用情况直接看栗子
redis 127.0.0.1:6379> PFADD mykey a b c d e f g h i j(integer) 1redis 127.0.0.1:6379> PFCOUNT mykey(integer) 10
a b c d e f g h i j这些可以理解为user,每来一个user,我们就执行下pfadd user操作;使用pfcount key就可以直接获得去重后的uv,但是要注意的是这种算法是有误差的,查阅了相关文档误差大约在0.8%左右,用于计算uv,这种误差还是可以接受的,具体误差大家可以测试下,这里我就不测了
实时计算代码如下
package com.fan.spark.stream import java.text.SimpleDateFormatimport java.util.Date import kafka.serializer.StringDecoderimport org.apache.spark.streaming.kafka.KafkaUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext} /** * Created by http://www.fanlegefan.com on 17-7-21. */object UserActionStreaming { def main(args: Array[String]): Unit = { val df = newSimpleDateFormat("yyyyMMdd") val group = "test" val topics = "test" val sparkConf = newSparkConf().setAppName("pvuv").setMaster("local[3]") val sc = newSparkContext(sparkConf) val ssc = newStreamingContext(sc, Seconds(10)) ssc.checkpoint("/home/work/IdeaProjects/sparklearn/checkpoint") val topicSets = topics.split(",").toSet val kafkaParams = Map[String, String]( "metadata.broker.list"-> "localhost:9092", "group.id"-> group ) val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicSets) stream.foreachRDD(rdd=>rdd.foreachPartition(partition=>{ val jedis = RedisClient.pool.getResource partition.foreach(tuple=>{ val line = tuple._2 val arr = line.split(",") val user = arr(0) val page = arr(1) val money = arr(2) val day = df.format(newDate(arr(3).toLong)) //uv jedis.pfadd(day + "_"+ page , user) //pv jedis.hincrBy(day+"_pv", page, 1) //sum jedis.hincrByFloat(day+"_sum", page, money.toDouble) }) })) ssc.start() ssc.awaitTermination() }}
在redis中查看结果
127.0.0.1:6379> keys *1)"20170721_note.html"2)"20170721_book.html"3)"20170721_fanlegefan.com"4)"20170721_mac.html"5)"20170721_pv"6)"20170721_mi.html"7)"20170721_iphone4.html"8)"20170721_sum"9)"20170721_huawei.html"
查看pv
127.0.0.1:6379> HGETALL 20170721_pv 1)"mi.html" 2)"112" 3)"note.html" 4)"107" 5)"fanlegefan.com" 6)"124" 7)"huawei.html" 8)"122" 9)"iphone4.html"10)"92"11)"mac.html"12)"103"13)"book.html"14)"135"
查看sum
127.0.0.1:6379> HGETALL 20170721_sum 1)"mi.html" 2)"56949.65999999999998948" 3)"note.html" 4)"56803.50999999999999801" 5)"fanlegefan.com" 6)"59622.50999999999999801" 7)"huawei.html" 8)"64456.50000000000000711" 9)"iphone4.html"10)"48643.07000000000001094"11)"mac.html"12)"51693.17999999999998906"13)"book.html"14)"67724.17999999999999261"
查看UV,测试数据只有8个user,所以uv都是8
127.0.0.1:6379> PFCOUNT 20170721_huawei.html(integer) 8127.0.0.1:6379> PFCOUNT 20170721_fanlegefan.com(integer) 8
现在数据已经在redis中,可以写个定时任务将数据push到mysql中,前端就可以展示了,实时计算大概是这么个思路
- sparkstreaming完整例子
- SparkStreaming例子
- SparkStreaming小例子
- 完整例子
- spark:仿写sparkstreaming例子--15
- sparkstreaming性能测试简单例子--53
- SparkStreaming的实时单词统计小例子
- sparkStreaming
- sparkStreaming
- sparkstreaming
- SparkStreaming
- epoll完整例子
- // 完整的COM例子
- java打印完整例子
- 完整的struts2例子
- thrift完整例子
- webservice 完整的例子
- spring+quartz 完整例子
- Android WebView 因重定向无法正常goBack()解决方案
- mysql自增主键
- quartz-all-1.8.3.jar和quartz-2.2.0.jar的实现方式
- 【SVN】解决svn版本冲突的办法
- 奥威Power-BI电商运维BI解决方案-购买行为分析
- sparkstreaming完整例子
- js中Math.random()生成指定范围数值的随机数
- ios 自定义键盘
- ubuntu的默认桌面
- HDU
- 一劳永逸的搞定 flex 布局
- 被逼无奈自己修好了乐视电视!
- SQL的主键和外键的作用:
- 什么是I帧,P帧,B帧