sparkstreaming完整例子

来源：互联网发布：数据加密 ipguard 编辑：程序博客网时间：2024/05/17 13:06

博客地址：http://www.fanlegefan.com

原文地址：http://www.fanlegefan.com/archives/sparkstreaminglizi/

摘要

本文主要实现一个简单sparkstreaming小栗子，整体流程是从kafka实时读取数据，计算pv，uv，以及sum(money)操作，最后将计算结果存入redis中，用sql表述大概就是

select time,page,count(*),count(distinct user) uv,sum(money) from test group by page,time

样例数据格式:

user,page,money,time

smith,iphone4.html,578.02,1500618981283andrew,mac.html,277.62,1500618981285smith,note.html,388.56,1500618981285

将数据push到kafka

启动kafka

造数据

package com.fan.spark.stream import java.text.DecimalFormatimport java.util.Properties import org.apache.kafka.clients.producer.{KafkaProducer, ProducerRecord} import scala.util.Random/**  * Created by http://www.fanlegefan.com on 17-7-21.  */object ProduceMessage {   def main(args: Array[String]): Unit = {     val props = newProperties()    props.put("bootstrap.servers","localhost:9092")    props.put("acks","all")    props.put("retries","0")    props.put("batch.size","16384")    props.put("linger.ms","1")    props.put("buffer.memory","33554432")    props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")    props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")     val producer = newKafkaProducer[String, String](props)     val users = Array("jack","leo","andy","lucy","jim","smith","iverson","andrew")    val pages = Array("iphone4.html","huawei.html","mi.html","mac.html","note.html","book.html","fanlegefan.com")    val df = newDecimalFormat("#.00")    val random = newRandom()    val num = 10    for(i<- 0 to num ){      val message = users(random.nextInt(users.length))+","+pages(random.nextInt(pages.length))+      ","+df.format(random.nextDouble()*1000)+","+System.currentTimeMillis()      producer.send(newProducerRecord[String, String]("test", Integer.toString(i),message))      println(message)    }    producer.close()  }}

控制台消费如下

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginningandrew,book.html,309.58,1500620213384jack,book.html,954.01,1500620213456iverson,book.html,823.07,1500620213456iverson,iphone4.html,486.76,1500620213456lucy,book.html,14.00,1500620213457iverson,note.html,206.30,1500620213457jack,book.html,25.30,1500620213457jim,iphone4.html,513.82,1500620213457lucy,mac.html,677.29,1500620213457smith,mi.html,571.30,1500620213457lucy,iphone4.html,113.83,1500620213457

计算pv，uv以及累计金额

因为数据要存入redis中，获取redis客户端代码如下

package com.fan.spark.stream import org.apache.commons.pool2.impl.GenericObjectPoolConfigimport redis.clients.jedis.JedisPool /**  * Created by http://www.fanlegefan.com on 17-7-21.  */object RedisClient {  val redisHost = "127.0.0.1"  val redisPort = 6379  val redisTimeout = 30000   lazy val pool = newJedisPool(newGenericObjectPoolConfig(), redisHost, redisPort, redisTimeout)  lazy val hook = newThread {    override def run = {      println("Execute hook thread: " + this)      pool.destroy()    }  }   sys.addShutdownHook(hook.run)}

sparkstreaming 是按batch处理数据，例如设置batchDuration=10,则每批次处理10秒中内接收到的数据，计算pv的时候，直接count累加就可以;但是计算uv的时候，这10秒内出现的用户，在之前的batch中也可能出现，但是spark是按batch处理数据，没办法知道之前用户是否出现过，如果只是简单的累计的话，一天下来uv的数据会比真实的uv大很多，所以要解决这个问题就要引入HyperLogLog，还好redis已经提供了这个功能，具体使用情况直接看栗子

redis 127.0.0.1:6379> PFADD mykey a b c d e f g h i j(integer) 1redis 127.0.0.1:6379> PFCOUNT mykey(integer) 10

a b c d e f g h i j这些可以理解为user，每来一个user，我们就执行下pfadd user操作;使用pfcount key就可以直接获得去重后的uv，但是要注意的是这种算法是有误差的，查阅了相关文档误差大约在0.8%左右，用于计算uv，这种误差还是可以接受的，具体误差大家可以测试下，这里我就不测了

实时计算代码如下

package com.fan.spark.stream import java.text.SimpleDateFormatimport java.util.Date import kafka.serializer.StringDecoderimport org.apache.spark.streaming.kafka.KafkaUtilsimport org.apache.spark.streaming.{Seconds, StreamingContext}import org.apache.spark.{SparkConf, SparkContext} /**  * Created by http://www.fanlegefan.com on 17-7-21.  */object UserActionStreaming {   def main(args: Array[String]): Unit = {    val df = newSimpleDateFormat("yyyyMMdd")    val group = "test"    val topics = "test"     val sparkConf = newSparkConf().setAppName("pvuv").setMaster("local[3]")     val sc = newSparkContext(sparkConf)    val ssc = newStreamingContext(sc, Seconds(10))    ssc.checkpoint("/home/work/IdeaProjects/sparklearn/checkpoint")     val topicSets = topics.split(",").toSet    val kafkaParams = Map[String, String](      "metadata.broker.list"-> "localhost:9092",      "group.id"-> group    )    val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,      kafkaParams, topicSets)    stream.foreachRDD(rdd=>rdd.foreachPartition(partition=>{      val jedis = RedisClient.pool.getResource      partition.foreach(tuple=>{        val line = tuple._2        val arr = line.split(",")        val user = arr(0)        val page = arr(1)        val money = arr(2)        val day = df.format(newDate(arr(3).toLong))        //uv        jedis.pfadd(day  + "_"+ page , user)        //pv        jedis.hincrBy(day+"_pv", page, 1)        //sum        jedis.hincrByFloat(day+"_sum", page, money.toDouble)      })    }))    ssc.start()    ssc.awaitTermination()  }}

在redis中查看结果

127.0.0.1:6379> keys *1)"20170721_note.html"2)"20170721_book.html"3)"20170721_fanlegefan.com"4)"20170721_mac.html"5)"20170721_pv"6)"20170721_mi.html"7)"20170721_iphone4.html"8)"20170721_sum"9)"20170721_huawei.html"

查看pv

127.0.0.1:6379> HGETALL 20170721_pv 1)"mi.html" 2)"112" 3)"note.html" 4)"107" 5)"fanlegefan.com" 6)"124" 7)"huawei.html" 8)"122" 9)"iphone4.html"10)"92"11)"mac.html"12)"103"13)"book.html"14)"135"

查看sum

127.0.0.1:6379> HGETALL 20170721_sum 1)"mi.html" 2)"56949.65999999999998948" 3)"note.html" 4)"56803.50999999999999801" 5)"fanlegefan.com" 6)"59622.50999999999999801" 7)"huawei.html" 8)"64456.50000000000000711" 9)"iphone4.html"10)"48643.07000000000001094"11)"mac.html"12)"51693.17999999999998906"13)"book.html"14)"67724.17999999999999261"

查看UV，测试数据只有8个user，所以uv都是8

127.0.0.1:6379> PFCOUNT 20170721_huawei.html(integer) 8127.0.0.1:6379> PFCOUNT 20170721_fanlegefan.com(integer) 8

现在数据已经在redis中，可以写个定时任务将数据push到mysql中，前端就可以展示了，实时计算大概是这么个思路

阅读全文

0 0