Structured Streaming Programming Guide 部分图备注(后期整理博文)
来源:互联网 发布:简明python教程 豆瓣 编辑:程序博客网 时间:2024/05/01 04:43
Handling Late Data and Watermarking部分
However, to run this query for days, its necessary for the system to bound the amount of intermediate in-memory state it accumulates. This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced watermarking, which let’s the engine automatically track the current event time in the data and and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected be in terms of event time. For a specific window starting at time T
, the engine will maintain state and allow late data to be update the state until (max event time seen by the engine - late threshold > T)
. In other words, late data within the threshold will be aggregated, but data later than the threshold will be dropped. Let’s understand this with an example. We can easily define watermarking on the previous example using withWatermark()
as shown below.
水印简短描述就是:保存计算完结果的时间阈值,一旦过了这个时间阈值的话,再有迟到的数据也不会更新原来的状态了,如果在阈值时间范围内的话,可以继续更新原来计算完毕的状态。
import spark.implicits._val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }// Group the data by window and word and compute the count of each groupval windowedCounts = words .withWatermark("timestamp", "10 minutes") .groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word") .count()
In this example, we are defining the watermark of the query on the value of the column “timestamp”, and also defining “10 minutes” as the threshold of how late is the data allowed to be. If this query is run in Append output mode (discussed later in Output Modes section), the engine will track the current event time from the column “timestamp” and wait for additional “10 minutes” in event time before finalizing the windowed counts and adding them to the Result Table. Here is an illustration.
- Structured Streaming Programming Guide 部分图备注(后期整理博文)
- Structured Streaming Programming Guide官方文档再次阅读理解强化学习
- Spark Streaming Programming Guide(翻译)
- Spark Streaming Programming Guide
- spark 2.0.0 Structured Streaming Programming
- Structured Streaming Programming[结构化流式编程]
- Structured Streaming
- Spark1.1.0 Spark Streaming Programming Guide
- <<Spark Streaming Programming Guide>> - Part 1 综述
- <<Spark Streaming Programming Guide>> - Part 2 基本概念
- spark streaming programming guide 综述(一)
- spark streaming programming guide 快速开始(二)
- Spark 2.1.0 -- Spark Streaming Programming Guide
- Spark2.0 Structured Streaming
- Spark2.0 Structured Streaming
- Spark 2.1 structured streaming
- Structured Streaming 输入输出
- Spark2.0: Structured Streaming
- windows环境下android studio 2.3 NDK编译FFmpeg
- 【bzoj 1625】[Usaco2007 Dec]宝石手镯
- linux中的stat介绍
- 1067. 试密码(20)
- 强制去除'输入信号超出范围 调整为1600*900@60HZ'
- Structured Streaming Programming Guide 部分图备注(后期整理博文)
- ios 不采用三方框架热更新
- java的线程池ExecutorService简单介绍
- 三次握手和四次挥手
- 【bzoj 1606】[Usaco2008 Dec]Hay For Sale 购买干草
- MQ消息队列系列(2)什么时候使用MQ
- 江西理工大学南昌校区cool code竞赛
- GNU计划
- 【BZOJ1415】】聪聪和可可 记忆化搜索的概率dp