spark structured streaming GroupState setTimeoutDuration触发机制
来源:互联网 发布:吉他软件效果器正版 编辑:程序博客网 时间:2024/05/22 13:15
在对于spark streaming或者structured streaming编程时,如果用到state,那么一般需要为state 设置个timeout,经研究这个timeout不是简单的根据时间触发的,还会根据当前的输入情况,比如设置的timeout是10s,但是我在10s内并没有新的stream进来,那么state还是有效的。如果在10s内有新的 stream,那么会触发state的计时
先看下mapGroupsWithState方法
def mapGroupsWithState[S: Encoder, U: Encoder]( timeoutConf: GroupStateTimeout)( func: (K, Iterator[V], GroupState[S]) => U)
timeoutConf有两种,一种是Processing Time Based(GroupStateTimeout.ProcessingTimeTimeout),这个是类似于Spark Streaming里的StateSpec.setTimeout;另外一种是Event Time Based (GroupStateTimeout.EventTimeTimeout)取决于用户定义的event timeout和watermark
再看看例子:
Scala example of using GroupState in mapGroupsWithState
:
// A mapping function that maintains an integer state for string keys and returns a string. // Additionally, it sets a timeout to remove the state if it has not received data for an hour. def mappingFunction(key: String, value: Iterator[Int], state: GroupState[Int]): String = { if (state.hasTimedOut) { // If called when timing out, remove the state state.remove() } else if (state.exists) { // If state exists, use it for processing val existingState = state.get // Get the existing state val shouldRemove = ... // Decide whether to remove the state if (shouldRemove) { state.remove() // Remove the state } else { val newState = ... state.update(newState) // Set the new state state.setTimeoutDuration("1 hour") // Set the timeout } } else { val initialState = ... state.update(initialState) // Set the initial state state.setTimeoutDuration("1 hour") // Set the timeout } ... // return something } dataset .groupByKey(...) .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)(mappingFunction)
Java example of using GroupState
:
// A mapping function that maintains an integer state for string keys and returns a string. // Additionally, it sets a timeout to remove the state if it has not received data for an hour. MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction = new MapGroupsWithStateFunction<String, Integer, Integer, String>() { @Override public String call(String key, Iterator<Integer> value, GroupState<Integer> state) { if (state.hasTimedOut()) { // If called when timing out, remove the state state.remove(); } else if (state.exists()) { // If state exists, use it for processing int existingState = state.get(); // Get the existing state boolean shouldRemove = ...; // Decide whether to remove the state if (shouldRemove) { state.remove(); // Remove the state } else { int newState = ...; state.update(newState); // Set the new state state.setTimeoutDuration("1 hour"); // Set the timeout } } else { int initialState = ...; // Set the initial state state.update(initialState); state.setTimeoutDuration("1 hour"); // Set the timeout } ... // return something } }; dataset .groupByKey(...) .mapGroupsWithState( mappingFunction, Encoders.INT, Encoders.STRING, GroupStateTimeout.ProcessingTimeTimeout);
- Since:
- 2.2.0
ref:https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/streaming/GroupState.html
阅读全文
1 0
- spark structured streaming GroupState setTimeoutDuration触发机制
- Spark 2.1 structured streaming
- Spark Structured Streaming、Kafak整合
- spark 2.0.0 Structured Streaming Programming
- 「Spark-2.2.0」Structured Streaming
- Spark Structured Streaming入门编程指南
- Structured Streaming
- 谷歌Dataflow编程模型和spark 2.0 structured streaming
- google Dataflow编程模型和spark 2.0 structured streaming对比
- spark structured streaming的source解析与自定义
- Spark Structured Streaming框架(1)之基本用法
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Structured Streaming框架(2)之数据输入源详解
- Spark Streaming 数据清理机制
- Spark Streaming缓存、Checkpoint机制
- spark streaming 检查点机制(checkpoint)
- Spark Streaming之checkpoint机制
- CentOS 6.2最小化安装后再安装图形界面
- 贝叶斯分类器
- HDU 6205 card card card (2017沈阳网赛
- Mac下Java JNI 调C
- Google Test 安装
- spark structured streaming GroupState setTimeoutDuration触发机制
- 微擎框架函数
- Poj2262
- html5之获取当前时间进行提示
- ubuntu14.04更换阿里源
- HDDNetworking网络组件
- android中怎么禁止ViewPager的预加载
- sql 时间比较 查询语句
- ReadWriteLock是什么?