spark structured streaming GroupState setTimeoutDuration触发机制

来源:互联网 发布:吉他软件效果器正版 编辑:程序博客网 时间:2024/05/22 13:15

在对于spark streaming或者structured streaming编程时,如果用到state,那么一般需要为state 设置个timeout,经研究这个timeout不是简单的根据时间触发的,还会根据当前的输入情况,比如设置的timeout是10s,但是我在10s内并没有新的stream进来,那么state还是有效的。如果在10s内有新的 stream,那么会触发state的计时


先看下mapGroupsWithState方法

def mapGroupsWithState[S: Encoder, U: Encoder](      timeoutConf: GroupStateTimeout)(      func: (K, Iterator[V], GroupState[S]) => U)
timeoutConf有两种,一种是Processing Time Based(GroupStateTimeout.ProcessingTimeTimeout),这个是类似于Spark Streaming里的StateSpec.setTimeout;另外一种是

Event Time Based (GroupStateTimeout.EventTimeTimeout)取决于用户定义的event timeout和watermark


再看看例子:

Scala example of using GroupState in mapGroupsWithState:

 // A mapping function that maintains an integer state for string keys and returns a string. // Additionally, it sets a timeout to remove the state if it has not received data for an hour. def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = {   if (state.hasTimedOut) {                // If called when timing out, remove the state     state.remove()   } else if (state.exists) {              // If state exists, use it for processing     val existingState = state.get         // Get the existing state     val shouldRemove = ...                // Decide whether to remove the state     if (shouldRemove) {       state.remove()                      // Remove the state     } else {       val newState = ...       state.update(newState)              // Set the new state       state.setTimeoutDuration("1 hour")  // Set the timeout     }   } else {     val initialState = ...     state.update(initialState)            // Set the initial state     state.setTimeoutDuration("1 hour")    // Set the timeout   }   ...   // return something } dataset   .groupByKey(...)   .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction) 

Java example of using GroupState:

 // A mapping function that maintains an integer state for string keys and returns a string. // Additionally, it sets a timeout to remove the state if it has not received data for an hour. MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =    new MapGroupsWithStateFunction<String, Integer, Integer, String>() {      @Override      public String call(String key, Iterator<Integer> value, GroupState<Integer> state) {        if (state.hasTimedOut()) {            // If called when timing out, remove the state          state.remove();        } else if (state.exists()) {            // If state exists, use it for processing          int existingState = state.get();      // Get the existing state          boolean shouldRemove = ...;           // Decide whether to remove the state          if (shouldRemove) {            state.remove();                     // Remove the state          } else {            int newState = ...;            state.update(newState);             // Set the new state            state.setTimeoutDuration("1 hour"); // Set the timeout          }        } else {          int initialState = ...;               // Set the initial state          state.update(initialState);          state.setTimeoutDuration("1 hour");   // Set the timeout        }        ...         // return something      }    }; dataset     .groupByKey(...)     .mapGroupsWithState(         mappingFunction, Encoders.INT, Encoders.STRING, GroupStateTimeout.ProcessingTimeTimeout); 

Since:
2.2.0


ref:https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/GroupState.html