spark-streaming-[2]-累加器(更新器)操作(updateStateByKey)
来源:互联网 发布:数据分析 excel 编辑:程序博客网 时间:2024/05/29 14:28
多谢分享,参考引用:【Spark八十八】Spark Streaming累加器操作(updateStateByKey)
updateStateByKey(func)
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key
1. Define the state - The state can be an arbitrary data type.
2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.
Spark Streaming的解决方案是累加器,工作原理是,定义一个类似全局的可更新的变量,每个时间窗口内得到的统计值都累加(updated 更新 更为确切)到上个时间窗口得到的值,这样这个累加值就是横跨多个时间间隔
更新(K,S)pairRDD的例子
/** * Return a new "state" DStream where the state for each key is updated by applying * the given function on the previous state of the key and the new values of each key. * org.apache.spark.Partitioner is used to control the partitioning of each RDD. * @param updateFunc State update function. Note, that this function may generate a different * tuple with a different key than the input key. Therefore keys may be removed * or added in this way. It is up to the developer to decide whether to * remember the partitioner despite the key being changed. * @param partitioner Partitioner for controlling the partitioning of each RDD in the new * DStream * @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs. * @param initialRDD initial state value of each key. * @tparam S State type */ def updateStateByKey[S: ClassTag]( updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)], partitioner: Partitioner, rememberPartitioner: Boolean, initialRDD: RDD[(K, S)]): DStream[(K, S)] = ssc.withScope { val cleanedFunc = ssc.sc.clean(updateFunc) val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => { cleanedFunc(it) } new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, Some(initialRDD)) }
(Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)]如何解读?
入参: 三元组迭代器,三元组中K表示Key,Seq[V]表示一个时间间隔中产生的Key对应的Value集合(Seq类型,需要对这个集合定义累加函数逻辑进行累加),Option[S]表示上个时间间隔的累加值(表示这个Key上个时间点的状态)
出参:二元组迭代器,二元组中K表示Key,S表示当前时间点执行结束后,得到的累加值(即最新状态)
package com.dt.spark.main.Streamingimport org.apache.log4j.{Level, Logger}import org.apache.spark.{HashPartitioner, SparkConf}import org.apache.spark.streaming.{Seconds, StreamingContext}/** * Created by hjw on 17/4/28. * * updateStateByKey(func) * Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key *1. Define the state - The state can be an arbitrary data type. 2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream. * Spark Streaming的解决方案是累加器,工作原理是,定义一个类似全局的可更新的变量,每个时间窗口内得到的统计值都累加(updated 更新 更为确切)到上个时间窗口得到的值,这样这个累加值就是横跨多个时间间隔 */object UpdateStateByKey { def main(args: Array[String]) { Logger.getLogger("org").setLevel(Level.ERROR) ///函数常量定义,返回类型是Some(Int),表示的含义是最新状态 ///函数的功能是将当前时间间隔内产生的Key的value集合,加到上一个状态中,得到最新状态 val updateFunc = (values: Seq[Int], state: Option[Int]) => { val currentCount = values.sum val previousCount = state.getOrElse(0) Some(currentCount + previousCount) } ///入参是三元组遍历器,三个元组分别表示Key、当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态 ///newUpdateFunc的返回值要求是iterator[(String,Int)]类型的 val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => { ///对每个Key调用updateFunc函数(入参是当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态)得到最新状态 ///然后将最新状态映射为Key和最新状态 iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s))) } val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[3]") // Create the context with a 5 second batch size val ssc = new StreamingContext(sparkConf, Seconds(5)) ssc.checkpoint(".") // Initial RDD input to updateStateByKey val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1))) // Create a ReceiverInputDStream on target ip:port and count the // words in input stream of \n delimited test (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val wordDstream = words.map(x => (x, 1)) // Update the cumulative count using updateStateByKey // This will give a Dstream made of state (which is the cumulative count of the words) //注意updateStateByKey的四个参数,第一个参数是状态更新函数 val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc, new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD) stateDstream.print() ssc.start() ssc.awaitTermination() }}
MacBook-Pro:~ hjw$ nc -lk 9999
2: run 程序
会打印:
-------------------------------------------
Time: 1493377725000 ms
-------------------------------------------
(world,1)
(hello,1)
3:终端继续键入
-Pro:~ hjw$ nc -lk 9999
test program
world
-------------------------------------------
Time: 1493377770000 ms
-------------------------------------------
(world,2)
(hello,1)
(test,1)
(program,1)
可见initalRDD被更新
- spark-streaming-[2]-累加器(更新器)操作(updateStateByKey)
- Spark Streaming 的 UpdateStateByKey操作
- spark streaming updateStateByKey 用法
- spark streaming updateStateByKey 用法
- spark streaming updateStateByKey 用法
- spark streaming updateStateByKey 用法
- Spark Streaming updateStateByKey 算法
- Spark Streaming---UpdatestateBykey(java)
- spark-streaming 编程(五)updateStateByKey
- spark updateStateByKey用法更新状态
- Spark Streaming之updateStateByKey和mapWithState比较
- spark streaming - kafka updateStateByKey 统计用户消费金额
- Spark Streaming updateStateByKey案例实战和内幕源码解密
- Spark Streaming累加器与广播的简单应用
- Spark累加器
- 第110课: Spark Streaming电商广告点击综合案例通过updateStateByKey等实现广告点击流量的在线更新统计
- 第110讲: Spark Streaming电商广告点击综合案例通过updateStateByKey等实现广告点击流量的在线更新统计
- Spark 定制版:014~Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密
- MI(mutal information)and Entropy
- HDU 2083 简易版之最短距离
- ubuntu16.04 安装 Eric6
- JZOJ1332. 正方形内的计数
- uva 11077 Find the Permutations 置换+递推
- spark-streaming-[2]-累加器(更新器)操作(updateStateByKey)
- C#编程实现获取当前计算机的名字
- POJ 2553
- poj1474 Video Surveillance【半平面交】
- 步循环
- 提高篇项目1-统计数组中某个数出现的有几次
- java进制与字节之间的转换
- poj 1068
- LintCode 50 数组剔除元素后的乘积