Spark学习笔记（14）State管理之updateStateByKey解密

来源：互联网发布：python hex to ascii 编辑：程序博客网时间：2024/05/29 02:42

从这节课开始，简介Spark Streaming的状态管理。

　　Spark Streaming 是按Batch Duration来划分Job的，但我们有时需要根据业务要求按照另外的时间周期（比如说，对过去24小时、或者过去一周的数据，等等这些大于Batch Duration的周期），对数据进行处理（比如计算最近24小时的销售额排名、今年的最新销售量等）。这需要根据之前的计算结果和新时间周期的数据，计算出新的计算结果。

　　updateStateByKey和mapWithState都是针对<key,value>类型的数据进行操作，而RDD类本身并不对 <key,value>类型的数据进行操作，所以要借助隐式转换。隐式转换放在了DStream伴生对象的区域。

object DStream {

// `toPairDStreamFunctions` was in SparkContext before 1.3 and users had to

// `import StreamingContext._` to enable it. Now we move it here to make the compiler find

// it automatically. However, we still keep the old function in StreamingContext for backward

// compatibility and forward to the following function directly.

implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])

(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):

PairDStreamFunctions[K, V] = {

new PairDStreamFunctions[K, V](stream)

}

...

}

　　生成了

PairDStreamFunctions对象。

　　PairDStreamFunctions类中有updateStateByKey、mapWithState这些功能。

本期内容：

1. updateStateByKey解密

2. mapWithState解密

1. updateStateByKey解密

　　先看updateStateByKey：

/**

* Return a new "state" DStream where the state for each key is updated by applying

* the given function on the previous state of the key and the new values of each key.

* Hash partitioning is used to generate the RDDs with Spark's default number of partitions.

* @param updateFunc State update function. If `this` function returns None, then

* corresponding state key-value pair will be eliminated.

* @tparam S State type

def updateStateByKey[S: ClassTag](

updateFunc: (Seq[V], Option[S]) => Option[S]

): DStream[(K, S)] = ssc.withScope {

updateStateByKey(updateFunc, defaultPartitioner())

}

　　updateStateByKey返回的都是DStream类型。

　　根据updateFunc这个函数来更新状态。其中参数：Seq[V]是本次的数据类型，Option[S]是前次计算结果类型，本次计算结果类型也是Option[S]。

　　计算肯定需要Partitioner。因为Hash高效率且不做排序，默认Partitioner是HashPartitoner。

　　PairDStreamFunction.defaultPartitioner：

private[streaming] def defaultPartitioner(numPartitions: Int = self.ssc.sc.defaultParallelism) = {

new HashPartitioner(numPartitions)

}

　　看其中返回值类型为StateDStream的updateStateByKey：

/**

* Return a new "state" DStream where the state for each key is updated by applying

* the given function on the previous state of the key and the new values of each key.

* org.apache.spark.Partitioner is used to control the partitioning of each RDD.

* @param updateFunc State update function. Note, that this function may generate a different

* tuple with a different key than the input key. Therefore keys may be removed

* or added in this way. It is up to the developer to decide whether to

* remember the partitioner despite the key being changed.

* @param partitioner Partitioner for controlling the partitioning of each RDD in the new

* DStream

* @param rememberPartitioner Whether to remember the paritioner object in the generated RDDs.

* @tparam S State type

def updateStateByKey[S: ClassTag](

updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

partitioner: Partitioner,

rememberPartitioner: Boolean

): DStream[(K, S)] = ssc.withScope {

new StateDStream(self, ssc.sc.clean(updateFunc), partitioner, rememberPartitioner, None)

}

　　看看这个StateDStream：

class StateDStream[K: ClassTag, V: ClassTag, S: ClassTag](

parent: DStream[(K, V)],

updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)],

partitioner: Partitioner,

preservePartitioning: Boolean,

initialRDD : Option[RDD[(K, S)]]

) extends DStream[(K, S)](parent.ssc) {

super.persist(StorageLevel.MEMORY_ONLY_SER)

...

　　是基于内存的。

　　所有像StateDStream这样的DStream子类都要覆写compute方法。

　　StateDStream.compute：

override def compute(validTime: Time): Option[RDD[(K, S)]] = {

// Try to get the previous state RDD

getOrCompute(validTime - slideDuration) match {

case Some(prevStateRDD) => { // If previous state RDD exists

// Try to get the parent RDD

parent.getOrCompute(validTime) match {

case Some(parentRDD) => { // If parent RDD exists, then compute as usual

computeUsingPreviousRDD (parentRDD, prevStateRDD)

}

case None => { // If parent RDD does not exist

// Re-apply the update function to the old state RDD

val updateFuncLocal = updateFunc

val finalFunc = (iterator: Iterator[(K, S)]) => {

val i = iterator.map(t => (t._1, Seq[V](), Option(t._2)))

updateFuncLocal(i)

}

val stateRDD = prevStateRDD.mapPartitions(finalFunc, preservePartitioning)

Some(stateRDD)

}

case None => { // If previous session RDD does not exist (first input data)

// Try to get the parent RDD

parent.getOrCompute(validTime) match {

case Some(parentRDD) => { // If parent RDD exists, then compute as usual

initialRDD match {

case None => {

// Define the function for the mapPartition operation on grouped RDD;

// first map the grouped tuple to tuples of required type,

// and then apply the update function

val updateFuncLocal = updateFunc

val finalFunc = (iterator : Iterator[(K, Iterable[V])]) => {

updateFuncLocal (iterator.map (tuple => (tuple._1, tuple._2.toSeq, None)))

}

val groupedRDD = parentRDD.groupByKey (partitioner)

val sessionRDD = groupedRDD.mapPartitions (finalFunc, preservePartitioning)

// logDebug("Generating state RDD for time " + validTime + " (first)")

Some (sessionRDD)

}

case Some (initialStateRDD) => {

computeUsingPreviousRDD(parentRDD, initialStateRDD)

}

case None => { // If parent RDD does not exist, then nothing to do!

// logDebug("Not generating state RDD (no previous state, no parent)")

None

}

　　其中会用到computeUsingPreviousRDD方法。去看看。

　　StateDStream.computeUsingPreviousRDD：

private [this] def computeUsingPreviousRDD (

parentRDD : RDD[(K, V)], prevStateRDD : RDD[(K, S)]) = {

// Define the function for the mapPartition operation on cogrouped RDD;

// first map the cogrouped tuple to tuples of required type,

// and then apply the update function

val updateFuncLocal = updateFunc

val finalFunc = (iterator: Iterator[(K, (Iterable[V], Iterable[S]))]) => {

val i = iterator.map(t => {

val itr = t._2._2.iterator

val headOption = if (itr.hasNext) Some(itr.next()) else None

(t._1, t._2._1.toSeq, headOption)

})

updateFuncLocal(i)

}

val cogroupedRDD = parentRDD.cogroup(prevStateRDD, partitioner)

val stateRDD = cogroupedRDD.mapPartitions(finalFunc, preservePartitioning)

Some(stateRDD)

}

　　由于cogroup会对所有数据进行扫描，再按key进行分组，所以性能上会有问题。特别是随着时间的推移，这样的计算到后面会越算越慢。

　　所以数据量大的计算、复杂的计算，都不建议使用updateStateByKey。

2. mapWithState解密

　　虽然有人使用mapWithState后感觉效果还可以，但源码中仍表明，mapWithState还在试验状态。

　　mapWithState方法有多个。先看第一个。

　　PairDStreamFunctions.mapWithState：

/**

* :: Experimental ::

* Return a [[MapWithStateDStream]] by applying a function to every key-value element of

* `this` stream, while maintaining some state data for each unique key. The mapping function

* and other specification (e.g. partitioners, timeouts, initial state data, etc.) of this

* transformation can be specified using [[StateSpec]] class. The state data is accessible in

* as a parameter of type [[State]] in the mapping function.

* Example of using `mapWithState`:

* {{{

* // A mapping function that maintains an integer state and return a String

* def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] = {

* // Use state.exists(), state.get(), state.update() and state.remove()

* // to manage state, and return the necessary string

* }

* val spec = StateSpec.function(mappingFunction).numPartitions(10)

* val mapWithStateDStream = keyValueDStream.mapWithState[StateType, MappedType](spec)

* }}}

* @param spec Specification of this transformation

* @tparam StateType Class type of the state data

* @tparam MappedType Class type of the mapped data

@Experimental

def mapWithState[StateType: ClassTag, MappedType: ClassTag](

spec: StateSpec[K, V, StateType, MappedType]

): MapWithStateDStream[K, V, StateType, MappedType] = {

new MapWithStateDStreamImpl[K, V, StateType, MappedType](

self,

spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]

)

}

　　注释中给出了一个

mapWithState

使用实例。先要定义一个mappingFunction。mappingFunction的参数中，State类型的state是历史数据，相当于一个内存数据表；key指明是对state中的哪个键进行操作；value指明键值。

　　StateSpec类型的参数中封装了mapping功能和转换的相应配置（例如：partitioners、超时设定、初始状态数据等）。

mapWithState

返回的是MapWithStateDStream类型。

　　来看看State类。其中的注释有例子参考。

/**

* :: Experimental ::

* Abstract class for getting and updating the state in mapping function used in the `mapWithState`

* operation of a [[org.apache.spark.streaming.dstream.PairDStreamFunctions pair DStream]] (Scala)

* or a [[org.apache.spark.streaming.api.java.JavaPairDStream JavaPairDStream]] (Java).

* Scala example of using `State`:

* {{{

* // A mapping function that maintains an integer state and returns a String

* def mappingFunction(key: String, value: Option[Int], state: State[Int]): Option[String] = {

* // Check if state exists

* if (state.exists) {

* val existingState = state.get // Get the existing state

* val shouldRemove = ... // Decide whether to remove the state

* if (shouldRemove) {

* state.remove() // Remove the state

* } else {

* val newState = ...

* state.update(newState) // Set the new state

* }

* } else {

* val initialState = ...

* state.update(initialState) // Set the initial state

* }

* ... // return something

* }

* }}}

...

sealed abstract class State[S] {

　　State中有exists、get、update、remove、isTimingOut等需要在子类中覆写的方法。

　　State中还有个内部实现类StateImpl：

/** Internal implementation of the [[State]] interface */

private[streaming] class StateImpl[S] extends State[S] {

private var state: S = null.asInstanceOf[S]

private var defined: Boolean = false

private var timingOut: Boolean = false

private var updated: Boolean = false

private var removed: Boolean = false

// ========= Public API =========

override def exists(): Boolean = {

defined

}

...

　　StateImpl有一些状态变量，并且覆写了State中的方法。

　　回去再看

PairDStreamFunctions中的其它mapWithState方法。

@Experimental

def mapWithState[StateType: ClassTag, MappedType: ClassTag](

spec: StateSpec[K, V, StateType, MappedType]

): MapWithStateDStream[K, V, StateType, MappedType] = {

new MapWithStateDStreamImpl[K, V, StateType, MappedType](

self,

spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]

)

}

　　先看看StateSpecImpl。StateSpecImpl是StateSpec类中的case class。

/** Internal implementation of [[org.apache.spark.streaming.StateSpec]] interface. */

private[streaming]

case class StateSpecImpl[K, V, S, T](

function: (Time, K, Option[V], State[S]) => Option[T]) extends StateSpec[K, V, S, T] {

　　其参数是一个函数。

　　StateSpecImpl中的代码片段：

require(function != null)

@volatile private var partitioner: Partitioner = null

@volatile private var initialStateRDD: RDD[(K, S)] = null

@volatile private var timeoutInterval: Duration = null

...

// ================= Private Methods =================

private[streaming] def getFunction(): (Time, K, Option[V], State[S]) => Option[T] = function

private[streaming] def getInitialStateRDD(): Option[RDD[(K, S)]] = Option(initialStateRDD)

private[streaming] def getPartitioner(): Option[Partitioner] = Option(partitioner)

private[streaming] def getTimeoutInterval(): Option[Duration] = Option(timeoutInterval)

　　有一些私有变量，及其变量的获取方法。特别是有一个函数的获取方法。

　　再看看MapWithStateDStream的子类MapWithStateDStreamImpl：

/** Internal implementation of the [[MapWithStateDStream]] */

private[streaming] class MapWithStateDStreamImpl[

KeyType: ClassTag, ValueType: ClassTag, StateType: ClassTag, MappedType: ClassTag](

dataStream: DStream[(KeyType, ValueType)],

spec: StateSpecImpl[KeyType, ValueType, StateType, MappedType])

extends MapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream.context) {

private val internalStream =

new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)

override def slideDuration: Duration = internalStream.slideDuration

override def dependencies: List[DStream[_]] = List(internalStream)

override def compute(validTime: Time): Option[RDD[MappedType]] = {

internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }

}

　　其中生成了一个DStream子类InternalMapWithStateDStream的对象。

　　InternalMapWithStateDStream类：

private[streaming]

class InternalMapWithStateDStream[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

parent: DStream[(K, V)], spec: StateSpecImpl[K, V, S, E])

extends DStream[MapWithStateRDDRecord[K, S, E]](parent.context) {

persist(StorageLevel.MEMORY_ONLY)

　　InternalMapWithStateDStream.compute：

/** Method that generates a RDD for the given time */

override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {

// Get the previous state or create a new empty state RDD

val prevStateRDD = getOrCompute(validTime - slideDuration) match {

case Some(rdd) =>

if (rdd.partitioner != Some(partitioner)) {

// If the RDD is not partitioned the right way, let us repartition it using the

// partition index as the key. This is to ensure that state RDD is always partitioned

// before creating another state RDD using it

MapWithStateRDD.createFromRDD[K, V, S, E](

rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)

} else {

rdd

}

case None =>

MapWithStateRDD.createFromPairRDD[K, V, S, E](

spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),

partitioner,

validTime

)

}

// Compute the new state RDD with previous state RDD and partitioned data RDD

// Even if there is no data RDD, use an empty one to create a new state RDD

val dataRDD = parent.getOrCompute(validTime).getOrElse {

context.sparkContext.emptyRDD[(K, V)]

}

val partitionedDataRDD = dataRDD.partitionBy(partitioner)

val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>

(validTime - interval).milliseconds

}

Some(new MapWithStateRDD(

prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))

}

　　生成了RDD子类MapWithStateRDD的对象。

　　MapWithStateRDD：

private[streaming] class MapWithStateRDD[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

private var prevStateRDD: RDD[MapWithStateRDDRecord[K, S, E]],

private var partitionedDataRDD: RDD[(K, V)],

mappingFunction: (Time, K, Option[V], State[S]) => Option[E],

batchTime: Time,

timeoutThresholdTime: Option[Long]

) extends RDD[MapWithStateRDDRecord[K, S, E]](

partitionedDataRDD.sparkContext,

List(

new OneToOneDependency[MapWithStateRDDRecord[K, S, E]](prevStateRDD),

new OneToOneDependency(partitionedDataRDD))

) {

　　每个RDD partition是被一个MapWithStateRDDRecord类型的记录所代表，

　　MapWithStateRDD.compute：

override def compute(

partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {

val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]

val prevStateRDDIterator = prevStateRDD.iterator(

stateRDDPartition.previousSessionRDDPartition, context)

val dataIterator = partitionedDataRDD.iterator(

stateRDDPartition.partitionedDataRDDPartition, context)

val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None

val newRecord = MapWithStateRDDRecord.updateRecordWithData(

prevRecord,

dataIterator,

mappingFunction,

batchTime,

timeoutThresholdTime,

removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled

)

Iterator(newRecord)

}

　　MapWithStateRDDRecord有伴生对象：

private[streaming] object MapWithStateRDDRecord {

def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](

prevRecord: Option[MapWithStateRDDRecord[K, S, E]],

dataIterator: Iterator[(K, V)],

mappingFunction: (Time, K, Option[V], State[S]) => Option[E],

batchTime: Time,

timeoutThresholdTime: Option[Long],

removeTimedoutData: Boolean

): MapWithStateRDDRecord[K, S, E] = {

// Create a new state map by cloning the previous one (if it exists) or by creating an empty one

val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }

val mappedData = new ArrayBuffer[E]

val wrappedState = new StateImpl[S]()

// Call the mapping function on each record in the data iterator, and accordingly

// update the states touched, and collect the data returned by the mapping function

// 此处是mapWithState性能较好的核心代码之所在。

dataIterator.foreach { case (key, value) =>

wrappedState.wrap(newStateMap.get(key))

val returned = mappingFunction(batchTime, key, Some(value), wrappedState)

if (wrappedState.isRemoved) {

newStateMap.remove(key)

} else if (wrappedState.isUpdated

|| (wrappedState.exists && timeoutThresholdTime.isDefined)) {

newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)

}

mappedData ++= returned

}

// Get the timed out state records, call the mapping function on each and collect the

// data returned

if (removeTimedoutData && timeoutThresholdTime.isDefined) {

newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>

wrappedState.wrapTimingOutState(state)

val returned = mappingFunction(batchTime, key, None, wrappedState)

mappedData ++= returned

newStateMap.remove(key)

}

MapWithStateRDDRecord(newStateMap, mappedData)

}

　　借助了RDD的不变性，同时也借助了可变化特征，完成了高效的处理过程。

　　所以不可变的RDD也可用于处理变化的数据。

阅读全文

0 0