spark-streaming-[2]-累加器（更新器）操作（updateStateByKey)

来源：互联网发布：数据分析 excel 编辑：程序博客网时间：2024/05/29 14:28

多谢分享，参考引用：【Spark八十八】Spark Streaming累加器操作（updateStateByKey)

updateStateByKey(func)

Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key
1. Define the state - The state can be an arbitrary data type.
2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.
Spark Streaming的解决方案是累加器，工作原理是，定义一个类似全局的可更新的变量，每个时间窗口内得到的统计值都累加(updated 更新更为确切)到上个时间窗口得到的值，这样这个累加值就是横跨多个时间间隔

更新（K,S）pairRDD的例子

/**   * Return a new "state" DStream where the state for each key is updated by applying   * the given function on the previous state of the key and the new values of each key.   * org.apache.spark.Partitioner is used to control the partitioning of each RDD.   * @param updateFunc State update function. Note, that this function may generate a different   *                   tuple with a different key than the input key. Therefore keys may be removed   *                   or added in this way. It is up to the developer to decide whether to   *                   remember the  partitioner despite the key being changed.   * @param partitioner Partitioner for controlling the partitioning of each RDD in the new   *                    DStream   * @param rememberPartitioner Whether to remember the partitioner object in the generated RDDs.   * @param initialRDD initial state value of each key.   * @tparam S State type   */  def updateStateByKey[S: ClassTag](      updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],      partitioner: Partitioner,      rememberPartitioner: Boolean,      initialRDD: RDD[(K, S)]): DStream[(K, S)] = ssc.withScope {    val cleanedFunc = ssc.sc.clean(updateFunc)    val newUpdateFunc = (_: Time, it: Iterator[(K, Seq[V], Option[S])]) => {      cleanedFunc(it)    }    new StateDStream(self, newUpdateFunc, partitioner, rememberPartitioner, Some(initialRDD))  }

(Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)]如何解读？

入参：三元组迭代器，三元组中K表示Key，Seq[V]表示一个时间间隔中产生的Key对应的Value集合(Seq类型,需要对这个集合定义累加函数逻辑进行累加),Option[S]表示上个时间间隔的累加值(表示这个Key上个时间点的状态)

出参：二元组迭代器，二元组中K表示Key，S表示当前时间点执行结束后，得到的累加值(即最新状态)

package com.dt.spark.main.Streamingimport org.apache.log4j.{Level, Logger}import org.apache.spark.{HashPartitioner, SparkConf}import org.apache.spark.streaming.{Seconds, StreamingContext}/**  * Created by hjw on 17/4/28.  *  * updateStateByKey(func)  * Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key  *1. Define the state - The state can be an arbitrary data type.   2. Define the state update function - Specify with a function how to update the state using the previous state and the new values from an input stream.  * Spark Streaming的解决方案是累加器，工作原理是，定义一个类似全局的可更新的变量，每个时间窗口内得到的统计值都累加(updated 更新 更为确切)到上个时间窗口得到的值，这样这个累加值就是横跨多个时间间隔  */object UpdateStateByKey {  def main(args: Array[String]) {    Logger.getLogger("org").setLevel(Level.ERROR)    ///函数常量定义，返回类型是Some(Int)，表示的含义是最新状态    ///函数的功能是将当前时间间隔内产生的Key的value集合，加到上一个状态中，得到最新状态    val updateFunc = (values: Seq[Int], state: Option[Int]) => {      val currentCount = values.sum      val previousCount = state.getOrElse(0)      Some(currentCount + previousCount)    }    ///入参是三元组遍历器，三个元组分别表示Key、当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态    ///newUpdateFunc的返回值要求是iterator[(String,Int)]类型的    val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {      ///对每个Key调用updateFunc函数(入参是当前时间间隔内产生的对应于Key的Value集合、上一个时间点的状态）得到最新状态      ///然后将最新状态映射为Key和最新状态      iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))    }    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount").setMaster("local[3]")    // Create the context with a 5 second batch size    val ssc = new StreamingContext(sparkConf, Seconds(5))    ssc.checkpoint(".")    // Initial RDD input to updateStateByKey    val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))    // Create a ReceiverInputDStream on target ip:port and count the    // words in input stream of \n delimited test (eg. generated by 'nc')    val lines = ssc.socketTextStream("localhost", 9999)    val words = lines.flatMap(_.split(" "))    val wordDstream = words.map(x => (x, 1))    // Update the cumulative count using updateStateByKey    // This will give a Dstream made of state (which is the cumulative count of the words)    //注意updateStateByKey的四个参数，第一个参数是状态更新函数    val stateDstream = wordDstream.updateStateByKey[Int](newUpdateFunc,      new HashPartitioner(ssc.sparkContext.defaultParallelism), true, initialRDD)    stateDstream.print()    ssc.start()    ssc.awaitTermination()  }}

1：终端

MacBook-Pro:~ hjw$ nc -lk 9999

2: run 程序

会打印：

-------------------------------------------
Time: 1493377725000 ms
-------------------------------------------
(world,1)
(hello,1)

3：终端继续键入

-Pro:~ hjw$ nc -lk 9999
test program
world

4：程序输出

-------------------------------------------
Time: 1493377770000 ms
-------------------------------------------
(world,2)
(hello,1)
(test,1)
(program,1)

可见initalRDD被更新

0 0