Batch interval, window length and slide duration on Spark Streaming

来源：互联网发布：日剧的价值观知乎编辑：程序博客网时间：2024/06/04 17:43

Spark Streaming中Batch interval, window length和slide duration的关系

Batch interval的含义

“Batch interval” is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 2 second, then any input DStream will generate RDDs of received data at 2 second intervals.

Length of window and slide duration的含义

A window operator is defined by two parameters -
- - the length of the window
- Slide duration - the interval at which the window will slide or move forward
Its a bit hard to explain the sliding of a window in words, so slides may be more useful. Take a look at slides 27 - 29 in the attached slides.

三者的关系

Both the window duration and the slide duration must be multiples of the batch interval, as received data is divided into batches of duration “batch interval”. Lets take an example. Suppose we have a batch interval of 2 seconds and we have defined an input stream.

val inputsStream = ssc.socketStream(...)

This inputStream will generate RDDs every 2 seconds, containing last 2 second of data. Now say we define a few window operation on this. The window operation is defined as DStream.window(, )

代码示例

val windowStream1 = inputStream.window(Seconds(4))val windowStream2 = inputStream.window(Seconds(4), Seconds(2))val windowStream3 = inputStream.window(Seconds(10), Seconds(4))val windowStream4 = inputStream.window(Seconds(10), Seconds(10)val windowStream5 = inputStream.window(Seconds(2), Seconds(2))    // same as inputStreamval windowStream6 = inputStream.window(Seconds(11), Seconds(2))   // invalidval windowStream7 = inputStream.window(Seconds(4), Seconds(1))    // invalid

用法解释

Both, windowStream1 and windowStream2 will generate RDDs containing data from last 4 seconds. And the RDDs will be generated every 2 seconds (if the slide duration is not specified as in windowStream1, then the slide duration was assumed to be inputStream’s batch duration = 2 sec). Note that each of these windows of data are overlapping. Window RDD at time 10 will contain data from times 6 to 10 (i.e. slightly after 6 to end of 10), and window RDD at time 12 will contain data from 8 to 12.

Similarly, windowStream3 will generate RDDs every 4 seconds, each containing data from last 10 seconds. And windowStream4 will generate non-overlapping windows, that is, RDDs every 10 seconds, containing data from last 10 seconds. windowStream5 is essentially same as the inputStream.

windowStream6 and windowStream7 are invalid because one of the two parameters is not a multiple of the batch interval, that is, 2 seconds. This is how the three are related.

Hope that helped. Note that I did simplify a few details that are important when you want to define window operations over windowed streams. I am ignoring them for now. Feel free to ask more specific questions.

总结

Batch interval为Spark Streaming中对源数据划分的最小时间单位，在使用window时，window length和slide duration必须是batch interval的整数倍。Window length决定运算时数据的跨度（总量），slide duration决定何时触发运算。

// Suppose batch interval = Seconds(1)val windowStream = inputStream.window(Seconds(4), Seconds(2))

如上从0秒开始，2，4，6，8…秒时都会将window中的数据运算一次。而2秒时，用0-2秒的数据运算；4秒时，用0-4秒的数据运算；6秒，用2-6秒的数据运算……

0 0