Structured Streaming Programming Guide 部分图备注（后期整理博文）

来源：互联网发布：简明python教程豆瓣编辑：程序博客网时间：2024/05/01 04:43

Handling Late Data and Watermarking部分

However, to run this query for days, its necessary for the system to bound the amount of intermediate in-memory state it accumulates. This means the system needs to know when an old aggregate can be dropped from the in-memory state because the application is not going to receive late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced watermarking, which let’s the engine automatically track the current event time in the data and and attempt to clean up old state accordingly. You can define the watermark of a query by specifying the event time column and the threshold on how late the data is expected be in terms of event time. For a specific window starting at time T, the engine will maintain state and allow late data to be update the state until (max event time seen by the engine - late threshold > T). In other words, late data within the threshold will be aggregated, but data later than the threshold will be dropped. Let’s understand this with an example. We can easily define watermarking on the previous example using withWatermark() as shown below.

水印简短描述就是：保存计算完结果的时间阈值，一旦过了这个时间阈值的话，再有迟到的数据也不会更新原来的状态了，如果在阈值时间范围内的话，可以继续更新原来计算完毕的状态。

import spark.implicits._val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }// Group the data by window and word and compute the count of each groupval windowedCounts = words    .withWatermark("timestamp", "10 minutes")    .groupBy(window($"timestamp", "10 minutes", "5 minutes"),$"word")    .count()

In this example, we are defining the watermark of the query on the value of the column “timestamp”, and also defining “10 minutes” as the threshold of how late is the data allowed to be. If this query is run in Append output mode (discussed later in Output Modes section), the engine will track the current event time from the column “timestamp” and wait for additional “10 minutes” in event time before finalizing the windowed counts and adding them to the Result Table. Here is an illustration.

0 0