第14课：Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

来源：互联网发布：西方哲学框架知乎编辑：程序博客网时间：2024/05/14 14:15

第14课：Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密

本节课讲解Spark Streaming中一个非常重要的内容：状态管理。为了说明这个状态管理，我们拿两个具体的方法updateStateByKey和mapWithState来说明Spark Streaming到底如何实现状态管理的。整个Spark Streaming按照Batch Duration划分Job，但是有时候我们想计算过去一小时，或者过去1天，或者过去一周的数据，在这么长的大于Batch Duration的时间实现符合我们业务的操作，不可避免的一定要发生的事情是进行状态维护。

我们的Spark Streaming在每个Batch Duration会产生一个Job，Job里面都是RDD，我们现在面临的一个问题就是：对于每个Batch Duration中的RDD，怎么对它的状态进行维护？例如updateStateByKey 计算一整天的商品的点击量或者一整天的商品排名，这个时候就需要类似updateStateByKey和mapWithState帮助你完成核心的步骤。

Spark本身博大精深，在Spark中可以看到IT界的大多数的内容，例如：通过Spark去研究JVM，通过Spark去研究分布式，通过Spark去研究机器学习、图计算这些内容，通过Spark也可以去研究架构设计，通过Spark也可以研究很多软件工程的内容。所以我们以Spark为载体可以做非常多的事情。

updateStateByKey和mapWithState 可不可以在Dstream中找到这2个方法？在Dstream中找不到。updateStateByKey和mapWithState 都是针对Key-Value的类型的数据进行操作，都是Pair类型的，跟我们前面的RDD是一样的，RDD并不会直接对Key-Value类型进行操作，这个时候要借助Scala的语法进行隐式转换。

这里是DStream的object，最佳实践是将隐式转换放到Object的静态区域，就是伴生对象区域toPairDStreamFunctions，在Spark 1.3版本之前，使用import StreamingContext._的方式。现在不需要import，因为这里是隐式转换，如使用一个方法updateStateByKey在Dstream中找不到，就会进行隐式转换，发现 toPairDStreamFunctions的签名符合DStream，又是implicit级别的，然后就进行隐式转换，转换成PairDStreamFunctions。我们比喻为从地狱中召唤出来的功能，使用过后又回到地狱。

DStream.scala的源代码：

1. object DStream {

3. // `toPairDStreamFunctions` wasin SparkContext before 1.3 and users had to

4. // `import StreamingContext._`to enable it. Now we move it here to make the compiler find

5. // it automatically. However, westill keep the old function in StreamingContext for backward

6. // compatibility and forward tothe following function directly.

8. implicit deftoPairDStreamFunctions[K, V](stream: DStream[(K, V)])

9. (implicit kt: ClassTag[K],vt: ClassTag[V], ord: Ordering[K] = null):

10. PairDStreamFunctions[K, V] = {

11. new PairDStreamFunctions[K,V](stream)

12. }

13.

14. /** Get the creation site of aDStream from the stack trace of when the DStream is created. */

15. private[streaming] defgetCreationSite(): CallSite = {

16. val SPARK_CLASS_REGEX ="""^org\.apache\.spark""".r

17. valSPARK_STREAMING_TESTCLASS_REGEX ="""^org\.apache\.spark\.streaming\.test""".r

18. val SPARK_EXAMPLES_CLASS_REGEX= """^org\.apache\.spark\.examples""".r

19. val SCALA_CLASS_REGEX ="""^scala""".r

20.

21. /** Filtering function thatexcludes non-user classes for a streaming application */

22. def streamingExclustionFunction(className:String): Boolean = {

23. def doesMatch(r: Regex):Boolean = r.findFirstIn(className).isDefined

24. val isSparkClass =doesMatch(SPARK_CLASS_REGEX)

25. val isSparkExampleClass =doesMatch(SPARK_EXAMPLES_CLASS_REGEX)

26. val isSparkStreamingTestClass= doesMatch(SPARK_STREAMING_TESTCLASS_REGEX)

27. val isScalaClass =doesMatch(SCALA_CLASS_REGEX)

28.

29. // If the class is a sparkexample class or a streaming test class then it is considered

30. // as a streamingapplication class and don't exclude. Otherwise, exclude any

31. // non-Spark and non-Scalaclass, as the rest would streaming application classes.

32. (isSparkClass ||isScalaClass) && !isSparkExampleClass &&!isSparkStreamingTestClass

33. }

34. org.apache.spark.util.Utils.getCallSite(streamingExclustionFunction)

35. }

36. }

PairDStreamFunctions中的updateStateByKey有很多重载方法，也有mapWithState。这两个方法使用完毕，要回到Dstream级别的操作。updateStateByKey是在历史已有基础之上使用updateFunc函数对数据进行更新操作，然后返回一个DStream。

PairDStreamFunctions.scala的源代码：

1. class PairDStreamFunctions[K,V](self: DStream[(K, V)])

2. (implicit kt: ClassTag[K], vt:ClassTag[V], ord: Ordering[K])

3. extends Serializable {

4. ……

5. @Experimental

6. def mapWithState[StateType: ClassTag,MappedType: ClassTag](

7. spec: StateSpec[K, V, StateType,MappedType]

8. ): MapWithStateDStream[K, V, StateType,MappedType] = {

9. new MapWithStateDStreamImpl[K, V,StateType, MappedType](

10. self,

11. spec.asInstanceOf[StateSpecImpl[K, V,StateType, MappedType]]

12. )

13. }

14. ……

15. /**

16. 返回一个新的“状态”DStream ，每个 key的状态按给定的函数根据key的前一状态和每个key的新值进行更新。哈希分区是用numPartitions分区数来生成 RDDs。

17. @param updateFunc 状态更新功能。如果这个函数没有返回，那么对应 key-value会被清除掉。

18. @param numPartitions 新的DStream每个RDD分区的numPartitions数。

19. @tparam S 状态类型

20. */

21. def updateStateByKey[S:ClassTag](

22. updateFunc: (Seq[V], Option[S]) =>Option[S]

23. ): DStream[(K, S)] = ssc.withScope {

24. updateStateByKey(updateFunc,defaultPartitioner())

25. }

26. …….

27. def mapWithState[StateType: ClassTag,MappedType: ClassTag](

28. spec: StateSpec[K, V, StateType,MappedType]

29. ): MapWithStateDStream[K, V, StateType,MappedType] = {

30. new MapWithStateDStreamImpl[K, V,StateType, MappedType](

31. self,

32. spec.asInstanceOf[StateSpecImpl[K, V,StateType, MappedType]]

33. )

34. }

在updateFunc中声明类型，Seq[V]是历史数据, Option[S]是Option。无论是基于状态的计算，还是基于BatchDuration的计算都是基于RDD的。RDD中都需要partition。defaultPartitioner默认采用了HashPartitioner，HashPartitioner对应Hash的计算方式，我们采用Shuffle的时候为什么要采用Hash？我们一定要想明白，在性能调优或者负载均衡或数据倾斜的时候，有很好的理念支撑你去做优化。Hash很重要，Hash的一个特点是效率高，Spark 1.2.x之前采用Shuffle的方式就是因为效率高，不需要排序等。这里并行度使用的defaultParallelism，和Spark Core完全是一样的。自己写代码也可以这么使用，操作的时候注意包名的命名。

PairDStreamFunctions.scala的defaultPartitioner源代码：

1. private[streaming] defdefaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism) = {

2. newHashPartitioner(numPartitions)

3. }

回到updateStateByKey，updateStateByKey的传入参数有一个partitioner参数。类似的，SparkSQL如果操作Hive中的表的时候，你自己设置了一个并行度，使用Spark SQL On Hive的方式是否会生效？封装partition基于Hive去控制是否受自定义的控制？不会！这是Spark SQL比较特殊的地方，不会的时候有时会导致结果的并行度太低，后面的RDD继承前面RDD的partition，假如并行度为3，本来可以为300的，就影响数据的计算，造成GC等。这个时候就要使用repartition，而不是使用coalesce的方式。

PairDStreamFunctions.scala的源代码：

1. def updateStateByKey[S:ClassTag](

2. updateFunc: (Seq[V],Option[S]) => Option[S],

3. partitioner: Partitioner,

4. initialRDD: RDD[(K, S)]

5. ): DStream[(K, S)] =ssc.withScope {

6. val cleanedUpdateF =sparkContext.clean(updateFunc)

7. val newUpdateFunc = (iterator:Iterator[(K, Seq[V], Option[S])]) => {

8. iterator.flatMap(t =>cleanedUpdateF(t._2, t._3).map(s => (t._1, s)))

9. }

10. updateStateByKey(newUpdateFunc,partitioner, true, initialRDD)

11. }

这里用了newUpdateFunc，然后将函数newUpdateFunc传给updateStateByKey，其中传进来的参数rememberPartitioner为True。

PairDStreamFunctions.scala的源代码：

1. def updateStateByKey[S:ClassTag](

2. updateFunc: (Iterator[(K,Seq[V], Option[S])]) => Iterator[(K, S)],

3. partitioner: Partitioner,

4. rememberPartitioner:Boolean,

5. initialRDD: RDD[(K, S)]):DStream[(K, S)] = ssc.withScope {

6. val cleanedFunc =ssc.sc.clean(updateFunc)

7. val newUpdateFunc = (_: Time,it: Iterator[(K, Seq[V], Option[S])]) => {

8. cleanedFunc(it)

9. }

10. new StateDStream(self,newUpdateFunc, partitioner, rememberPartitioner, Some(initialRDD))

11. }

关键的地方是new出来一个StateDStream。

StateDStream.scala的源代码：

1. private[streaming]

2. class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](

3. parent: DStream[(K, V)],

4. updateFunc: (Time,Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

5. partitioner: Partitioner,

6. preservePartitioning: Boolean,

7. initialRDD: Option[RDD[(K,S)]]

8. ) extends DStream[(K,S)](parent.ssc) {

9. super.persist(StorageLevel.MEMORY_ONLY_SER)

10.

11. override def dependencies: List[DStream[_]] =List(parent)

12.

13. override def slideDuration: Duration =parent.slideDuration

14.

15. override val mustCheckpoint = true

StateDStream本身继承至DStream，例如广告点击系统如果计算一整天的数据，数据都持久化到内存MEMORY_ONLY_SER，如果进行updateStateByKey，将不断的产生StateDStream。

StateDStream.scala的compute源代码：

1. override def compute(validTime:Time): Option[RDD[(K, S)]] = {

3. // Try to get the previousstate RDD

4. getOrCompute(validTime -slideDuration) match {

6. case Some(prevStateRDD)=> // If previous state RDD exists

7. // Try to get the parentRDD

8. parent.getOrCompute(validTime) match {

9. case Some(parentRDD)=> // If parent RDD exists, thencompute as usual

10. computeUsingPreviousRDD (validTime,parentRDD, prevStateRDD)

11. case None => // If parent RDD does not exist

12. // Re-apply the updatefunction to the old state RDD

13. val updateFuncLocal =updateFunc

14. val finalFunc =(iterator: Iterator[(K, S)]) => {

15. val i =iterator.map(t => (t._1, Seq[V](), Option(t._2)))

16. updateFuncLocal(validTime, i)

17. }

18. val stateRDD =prevStateRDD.mapPartitions(finalFunc, preservePartitioning)

19. Some(stateRDD)

20. }

21.

22. case None => // If previous session RDD does not exist(first input data)

23. // Try to get the parentRDD

24. parent.getOrCompute(validTime) match {

25. case Some(parentRDD)=> // If parent RDD exists, thencompute as usual

26. initialRDD match {

27. case None =>

28. // Define thefunction for the mapPartition operation on grouped RDD;

29. // first map thegrouped tuple to tuples of required type,

30. // and then applythe update function

31. valupdateFuncLocal = updateFunc

32. val finalFunc =(iterator: Iterator[(K, Iterable[V])]) => {

33. updateFuncLocal(validTime,

34. iterator.map(tuple => (tuple._1, tuple._2.toSeq, None)))

35. }

36.

37. val groupedRDD =parentRDD.groupByKey(partitioner)

38. val sessionRDD =groupedRDD.mapPartitions(finalFunc, preservePartitioning)

39. //logDebug("Generating state RDD for time " + validTime + "(first)")

40. Some (sessionRDD)

41. case Some(initialStateRDD) =>

42. computeUsingPreviousRDD(validTime,parentRDD, initialStateRDD)

43. }

44. case None => // Ifparent RDD does not exist, then nothing to do!

45. // logDebug("Notgenerating state RDD (no previous state, no parent)")

46. None

47. }

48. }

49. }

compute是StateDStream复写的方法，计算的时候有parentRDD，parentRDD会调用computeUsingPreviousRDD。重磅的地方：从中可以看出updateStateByKey的弱点。在computeUsingPreviousRDD 中通过updateFunc将函数传进来，然后通过val cogroupedRDD =parentRDD.cogroup(prevStateRDD, partitioner)计算，里面的核心逻辑是cogroup，cogroup就是对所有的数据按照key对Value进行聚合，每次计算的时候都要这样做，这样做的好处是对RDD进行计算，RDD怎么计算，cogroup就怎么计算。不好的地方是性能问题：cogroup要对所有的数据进行重新扫描，每一次都要扫描，随着时间的流失，要扫描的规模越来越大，性能也越来越低。

StateDStream.scala的computeUsingPreviousRDD源代码：

1. private [this] def computeUsingPreviousRDD(

2. batchTime: Time,

3. parentRDD: RDD[(K, V)],

4. prevStateRDD: RDD[(K, S)]) ={

5. // Define the function for themapPartition operation on cogrouped RDD;

6. // first map the cogroupedtuple to tuples of required type,

7. // and then apply the updatefunction

8. val updateFuncLocal =updateFunc

9. val finalFunc = (iterator:Iterator[(K, (Iterable[V], Iterable[S]))]) => {

10. val i = iterator.map { t=>

11. val itr = t._2._2.iterator

12. val headOption = if(itr.hasNext) Some(itr.next()) else None

13. (t._1, t._2._1.toSeq, headOption)

14. }

15. updateFuncLocal(batchTime,i)

16. }

17. val cogroupedRDD =parentRDD.cogroup(prevStateRDD, partitioner)

18. val stateRDD =cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)

19. Some(stateRDD)

20. }

cogroup的RDD和另外一个RDD计算的结果是个tuple，value是(Iterable[V], Iterable[W])。例如一个学生有学号和姓名被RDD封装；也有学号和成绩被RDD封装，两个进行cogroup，它的key就是学号，Value就是姓名和成绩；数据量比较少，或者updateStateByKey的时间比较短，时间如果太长，可以考虑定时，这里是基于磁盘进行持久化的操作，可能还不是太大的关系。但是每次都要进行全部数据的扫描，这是无法让人承受的事情。例如如果计算几天之后，就会发现越算越慢，原先计算1分钟的时候不慢，计算一段时间就变慢了。

PairRDDFunctions.scala的源代码：

1. def cogroup[W](other: RDD[(K,W)], partitioner: Partitioner)

2. : RDD[(K, (Iterable[V],Iterable[W]))] = self.withScope {

3. if(partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {

4. throw newSparkException("HashPartitioner cannot partition array keys.")

5. }

6. val cg = newCoGroupedRDD[K](Seq(self, other), partitioner)

7. cg.mapValues { case Array(vs,w1s) =>

8. (vs.asInstanceOf[Iterable[V]],w1s.asInstanceOf[Iterable[W]])

9. }

10. }

mapWithState现在还是实验状态，实验状态的意思是还不稳定。不过我们过去的实验结果表明mapWithState还是可以的。

PairDStreamFunctions.scala的源代码：

1. /**

2. ：实验性的：：通过应用函数到this Stream的每一个Key-Value元素，返回 [[MapWithStateDStream]] ，同时为每个唯一Key维护一些状态数据。映射函数和其他规范（如分区、超时、初始状态数据等），转换可以使用StateSpec类指定。状态数据在映射函数中作为一个参数类型State访问。使用mapWithState的例子：

4. {{{ //一个映射函数，它维护一个整数状态并返回一个字符串。

5. def mappingFunction(key: String,value: Option[Int], state: State[Int]): Option[String] = {

6. //使用 state.exists(), state.get(),state.update() 及 state.remove() 管理 state,返回需要的字符串

7. }

8. val spec =StateSpec.function(mappingFunction).numPartitions(10)

9. val mapWithStateDStream= keyValueDStream.mapWithState[StateType, MappedType](spec)

10. }}}

11.

12. @param spec 转换的表示

13. @tparam StateType state 数据类型

14. @tparam MappedType 映射的类型

15. */

16. @Experimental

17. def mapWithState[StateType:ClassTag, MappedType: ClassTag](

18. spec: StateSpec[K, V,StateType, MappedType]

19. ): MapWithStateDStream[K, V,StateType, MappedType] = {

20. new MapWithStateDStreamImpl[K,V, StateType, MappedType](

21. self,

22. spec.asInstanceOf[StateSpecImpl[K, V,StateType, MappedType]]

23. )

24. }

mapWithState方法返回MapWithStateDStream，使用一个函数不断对我们的key-Value类型的元素进行状态维护和更新，这里面有一个历史状态，基于Key进行更新，具体更新的函数由来StateSpec来说明。mapWithState函数接受的参数spec是一个StateSpec类型的参数，StateSpec参数不是一个函数，但是在StateSpec里面封装了一个函数。

state就是历史状态，state相当于就是一个数据库，也可以想象成是一个内存数据表， state.exists(),state.get(), state.update() 及 state.remove() 判断是否存在，获取这个值，更新这个值，删除这个值，其实可以理解为相应的表操作，如删除这张表。state就是一张表，这张表中记录了状态维护中的所有历史状态，mappingFunction提出对这张表中的哪个key进行操作，基于key进行操作输入的value值是什么，通过key可以查询这张表查询值，至于value怎么操作，就是mappingFunction中的业务逻辑，state就相当于key-value中的一张表，包括key，value两列。所有的历史状态都放在这张表中，这张表的名称就叫state。

在进行更新的时候，state可以认为是表的索引，根据key在state的基础上更新它的value。内存表是从抽象的角度考虑的，这里看到state，例如单词计数，不断的累积计数，上面注释例子中的state类型也是State[Int]整数类型。如果从内存数据表的角度讲，这里是状态的标记：删除标记，超时时间等，都在内存中, 就是对一张表的增、删、改。

State.scala的源代码：

1. /**

2. ::实验性的 ::

3. 获取和更新状态映射函数用于mapWithState操作的[[org.apache.spark.streaming.dstream.PairDStreamFunctionspair DStream]]（Scala）或者

4. [[org.apache.spark.streaming.api.java.JavaPairDStreamJavaPairDStream]]（java）。

6. Scala中使用State的例子:

7. {{{

8. // 维护整数状态并返回字符串的映射函数。

9. def mappingFunction(key:String, value: Option[Int], state: State[Int]): Option[String] = {

10. // 检查状态是否存在

11. if (state.exists) {

12. val existingState =state.get //获取存在的状态

13. val shouldRemove =... // 决定是否删除状态

14. if (shouldRemove) {

15. state.remove() //删除状态

16. } else {

17. val newState = ...

18. state.update(newState) // 设置新的状态

19. }

20. } else {

21. val initialState = ...

22. state.update(initialState) // 设置初始值

23. }

24. ... // 返回值

25. }

26.

27. }}}

28.

29. Java 使用 State的例子:

30. {{{

31. // 维护整数状态并返回字符串的映射函数。

32. Function3<String,Optional<Integer>, State<Integer>, String> mappingFunction =

33. new Function3<String,Optional<Integer>, State<Integer>, String>() {

34.

35. @Override

36. public Stringcall(String key, Optional<Integer> value, State<Integer> state) {

37. if (state.exists()) {

38. int existingState =state.get(); // 获取存在的状态

39. boolean shouldRemove= ...; // 决定是否删除状态

40. if (shouldRemove) {

41. state.remove(); //删除状态

42. } else {

43. int newState =...;

44. state.update(newState); // 设置新的状态

45. }

46. } else {

47. int initialState =...; // 设置初始状态

48. state.update(initialState);

49. }

50. //返回值

51. }

52. };

53. }}}

54. **/

55. * @tparam S 状态类

56. @Experimental

57. sealed abstract class State[S]{

58. ……..

59. private[streaming] classStateImpl[S] extends State[S] {

60.

61. private var state: S = null.asInstanceOf[S]

62. private var defined: Boolean = false

63. private var timingOut: Boolean = false

64. private var updated: Boolean = false

65. private var removed: Boolean = false

mapWithState方法中的StateSpecImpl将传进来的数据进行封装，这里面有key-Value，StateType, MappedType。StateSpecImpl是一个case class，里面的参数是函数function， StateSpecImpl中有一个很重要的方法getFunction，将函数Function返回。使用一个数据结构封装了函数的内容。这里还有getPartitioner、getInitialStateRDD、getTimeoutInterval，这也是一个很好的编程的技巧，从框架的角度封装成一个数据结构。

StateSpecImpl.scala的源代码

1. private[streaming]

2. case class StateSpecImpl[K, V, S, T](

3. function: (Time, K, Option[V],State[S]) => Option[T]) extends StateSpec[K, V, S, T] {

5. require(function != null)

7. @volatile private varpartitioner: Partitioner = null

8. @volatile private varinitialStateRDD: RDD[(K, S)] = null

9. @volatile private vartimeoutInterval: Duration = null

10.

11. override def initialState(rdd:RDD[(K, S)]): this.type = {

12. this.initialStateRDD = rdd

13. this

14. }

15. …..

16. private[streaming] defgetFunction(): (Time, K, Option[V], State[S]) => Option[T] = function

17.

18. private[streaming] def getInitialStateRDD():Option[RDD[(K, S)]] = Option(initialStateRDD)

19.

20. private[streaming] def getPartitioner():Option[Partitioner] = Option(partitioner)

21.

22. private[streaming] def getTimeoutInterval():Option[Duration] = Option(timeoutInterval)

MapWithStateDStreamImpl传入dataStream、StateSpecImpl。在compute方法中将业务逻辑交给internalStream，InternalMapWithStateDStream是一个内部类。MapWithStateDStreamImpl.scala的源代码：

1. private[streaming] classMapWithStateDStreamImpl[

2. KeyType: ClassTag, ValueType:ClassTag, StateType: ClassTag, MappedType: ClassTag](

3. dataStream: DStream[(KeyType,ValueType)],

4. spec: StateSpecImpl[KeyType,ValueType, StateType, MappedType])

5. extendsMapWithStateDStream[KeyType, ValueType, StateType,MappedType](dataStream.context) {

6. ……

7. private val internalStream =

8. new InternalMapWithStateDStream[KeyType,ValueType, StateType, MappedType](dataStream, spec)

10. override def slideDuration: Duration =internalStream.slideDuration

11.

12. override def dependencies: List[DStream[_]] =List(internalStream)

13.

14. override def compute(validTime: Time):Option[RDD[MappedType]] = {

15. internalStream.getOrCompute(validTime).map{ _.flatMap[MappedType] { _.mappedData } }

16. }

InternalMapWithStateDStream类在历史的基础上进行更新， persist是MEMORY_ONLY的方式，不断的更新内部数据结构，而不是创建一个新的数据结构对象。

MapWithStateDStream.scala的源代码：

1. private[streaming]

2. class InternalMapWithStateDStream[K: ClassTag, V: ClassTag, S: ClassTag,E: ClassTag](

3. parent: DStream[(K, V)], spec:StateSpecImpl[K, V, S, E])

4. extendsDStream[MapWithStateRDDRecord[K, S, E]](parent.context) {

5. persist(StorageLevel.MEMORY_ONLY)

7. private val partitioner =spec.getPartitioner().getOrElse(

8. newHashPartitioner(ssc.sc.defaultParallelism))

10. private val mappingFunction =spec.getFunction()

11.

12. override def slideDuration: Duration =parent.slideDuration

MapWithStateDStream的compute创建一个新的RDD，新的RDD基于BatchDuration时间窗口，当前传进来一个时间，数据可能为空。如果时间里面没有数据，就获取emptyRDD。关键的一行代码是Some(new MapWithStateRDD( prevStateRDD, partitionedDataRDD,mappingFunction, validTime, timeoutThresholdTime)), 这里有prevStateRDD,partitionedDataRDD，没有看到历史数据，可以看到mappingFunction, 在伴生对象InternalMapWithStateDStream中有CHECKPOINT的时间。

MapWithStateDStream.scala的源代码：

1. override def compute(validTime:Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {

2. // Get the previous state orcreate a new empty state RDD

3. val prevStateRDD =getOrCompute(validTime - slideDuration) match {

4. case Some(rdd) =>

5. if (rdd.partitioner !=Some(partitioner)) {

6. // If the RDD is notpartitioned the right way, let us repartition it using the

7. // partition index asthe key. This is to ensure that state RDD is always partitioned

8. // before creatinganother state RDD using it

9. MapWithStateRDD.createFromRDD[K, V,S, E](

10. rdd.flatMap {_.stateMap.getAll() }, partitioner, validTime)

11. } else {

12. rdd

13. }

14. case None =>

15. MapWithStateRDD.createFromPairRDD[K, V,S, E](

16. spec.getInitialStateRDD().getOrElse(newEmptyRDD[(K, S)](ssc.sparkContext)),

17. partitioner,

18. validTime

19. )

20. }

21.

22.

23. // Compute the new state RDDwith previous state RDD and partitioned data RDD

24. // Even if there is no dataRDD, use an empty one to create a new state RDD

25. val dataRDD =parent.getOrCompute(validTime).getOrElse {

26. context.sparkContext.emptyRDD[(K, V)]

27. }

28. val partitionedDataRDD =dataRDD.partitionBy(partitioner)

29. val timeoutThresholdTime =spec.getTimeoutInterval().map { interval =>

30. (validTime -interval).milliseconds

31. }

32. Some(new MapWithStateRDD(

33. prevStateRDD,partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))

34. }

35. }

36. …..

37. private[streaming] objectInternalMapWithStateDStream {

38. private valDEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10

39. }

所有精彩的内容从MapWithStateRDD开始，MapWithStateRDD是一个RDD，作为一个RDD，包含了MapWithState中的数据及具体怎么操作，每个分区被MapWithStateRDDRecord代表，这里面有数据结构StateMap，维护的是数据的状态。

MapWithStateRDD.scala的源代码：

1. /** RDD存储mapWithState的Key的状态和相应的映射数据。RDD每个分区具有[[MapWithStateRDDRecord]]数据类型的记录。这包含了StateMap（包含key的状态）和记录的顺序，通过mapWithState函数返回。

2. @param prevStateRDD 以前的MapWithStateRDD on whose StateMap data`this` RDD

3. will becreated

4. @param partitionedDataRDD Thepartitioned data RDD which is used update the previous StateMaps

5. in the`prevStateRDD` to create `this` RDD

6. @param mappingFunction The function that will be used to updatestate and return new data

7. @param batchTime The time of the batch to which this RDDbelongs to. Use to update

8. @param timeoutThresholdTime Thetime to indicate which keys are timeout

9. **/

10. private[streaming] classMapWithStateRDD[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

11. private var prevStateRDD:RDD[MapWithStateRDDRecord[K, S, E]],

12. private varpartitionedDataRDD: RDD[(K, V)],

13. mappingFunction: (Time, K,Option[V], State[S]) => Option[E],

14. batchTime: Time,

15. timeoutThresholdTime:Option[Long]

16. ) extendsRDD[MapWithStateRDDRecord[K, S, E]](

17. partitionedDataRDD.sparkContext,

18. List(

19. newOneToOneDependency[MapWithStateRDDRecord[K, S, E]](prevStateRDD),

20. newOneToOneDependency(partitionedDataRDD))

21. ) {

MapWithStateRDD中的重点是compute，获取RDD的迭代器prevStateRDDIterator、dataIterator，最后返回一个 Iterator(newRecord)

MapWithStateRDD.scala的源代码：

1. override def compute(

2. partition: Partition,context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {

4. val stateRDDPartition =partition.asInstanceOf[MapWithStateRDDPartition]

5. val prevStateRDDIterator =prevStateRDD.iterator(

6. stateRDDPartition.previousSessionRDDPartition,context)

7. val dataIterator =partitionedDataRDD.iterator(

8. stateRDDPartition.partitionedDataRDDPartition,context)

10. val prevRecord = if(prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None

11. val newRecord =MapWithStateRDDRecord.updateRecordWithData(

12. prevRecord,

13. dataIterator,

14. mappingFunction,

15. batchTime,

16. timeoutThresholdTime,

17. removeTimedoutData =doFullScan // remove timedout data only when full scan is enabled

18. )

19. Iterator(newRecord)

20. }

MapWithStateRDDRecord中有2个关键的数据结构：mappedData、wrappedState。newStateMap的数据结构中先对StateMap进行copy，这里copy还是很高效的，然后是dataIterator.foreach 循环遍历，不断的给wrappedState赋值，mappedData是最后返回的值，每次操作之后，判断是否要删除，进行删除操作。对当前的Batch的数据进行计算，对newStateMap进行更新，newStateMap的数据结构保存了整个历史数据，可进行删除、更新操作，有没有对历史数据重新计算或者遍历？没有，没有cogroup的操作，对当前数据进行操作，只在内存中更新数据结构。效率会高很多。

MapWithStateRDD.scala的源代码：

1. private[streaming] objectMapWithStateRDDRecord {

2. def updateRecordWithData[K: ClassTag, V:ClassTag, S: ClassTag, E: ClassTag](

3. prevRecord: Option[MapWithStateRDDRecord[K,S, E]],

4. dataIterator: Iterator[(K, V)],

5. mappingFunction: (Time, K, Option[V],State[S]) => Option[E],

6. batchTime: Time,

7. timeoutThresholdTime: Option[Long],

8. removeTimedoutData: Boolean

9. ): MapWithStateRDDRecord[K, S, E] = {

10. // Create a new state map by cloning theprevious one (if it exists) or by creating an empty one

11. val newStateMap = prevRecord.map {_.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }

12.

13. val mappedData = new ArrayBuffer[E]

14. val wrappedState = new StateImpl[S]()

15.

16. // Call the mapping function on each recordin the data iterator, and accordingly

17. // update the states touched, and collectthe data returned by the mapping function

18. dataIterator.foreach { case (key, value)=>

19. wrappedState.wrap(newStateMap.get(key))

20. val returned = mappingFunction(batchTime,key, Some(value), wrappedState)

21. if (wrappedState.isRemoved) {

22. newStateMap.remove(key)

23. } else if (wrappedState.isUpdated

24. || (wrappedState.exists &&timeoutThresholdTime.isDefined)) {

25. newStateMap.put(key,wrappedState.get(), batchTime.milliseconds)

26. }

27. mappedData ++= returned

28. }

29.

30. // Get the timed out state records, callthe mapping function on each and collect the

31. // data returned

32. if (removeTimedoutData &&timeoutThresholdTime.isDefined) {

33. newStateMap.getByTime(timeoutThresholdTime.get).foreach{ case (key, state, _) =>

34. wrappedState.wrapTimingOutState(state)

35. val returned =mappingFunction(batchTime, key, None, wrappedState)

36. mappedData ++= returned

37. newStateMap.remove(key)

38. }

39. }

40.

41. MapWithStateRDDRecord(newStateMap,mappedData)

42. }

43. }

最终返回MapWithStateRDDRecord。从RDD的角度讲，partition没有变，但是内部变了。原来的RDD直接指向一条数据，数据不可以修改。现在也是指向一条数据，但是数据进行了封装，可以改变里面的内容，但是从RDD的角度讲，这里的数据并没有变，这里的设计很巧妙的。MapWithStateRDDRecord就代表了当前的partition，Dstream操作是RDD，MapWithStateRDDRecord没变，但里面的内容变了就管不了，借助了RDD的不变性，又整合MapWithStateRDDRecord的可变性，高效的完成了整个过程。

一个额外的结论：RDD本身不可以变，不可变的RDD也可以处理变化的数据，自定义的RDD的数据结构要注意一下。RDD是不可变的，这是对的；RDD处理的数据也不可变，当然是错的！通过MapWithStateRDDRecord非常清楚的看见这一点。RDD是不可变是没有问题的，但RDD只能处理数据源不变的数据呢？当然不是，数据源可以变化，在RDD里面你自己负责这里的变化并维护里面的数据，这是一个非常重要的结论。

阅读全文

0 0