Batch interval, window length and slide duration on Spark Streaming
来源:互联网 发布:日剧的价值观知乎 编辑:程序博客网 时间:2024/06/04 17:43
Spark Streaming中Batch interval, window length和slide duration的关系
Batch interval的含义
“Batch interval” is the basic interval at which the system with receive the data in batches. This is the interval set when creating a StreamingContext. For example, if you set the batch interval as 2 second, then any input DStream will generate RDDs of received data at 2 second intervals.
Length of window and slide duration的含义
A window operator is defined by two parameters -
- - the length of the window
- Slide duration - the interval at which the window will slide or move forward
Its a bit hard to explain the sliding of a window in words, so slides may be more useful. Take a look at slides 27 - 29 in the attached slides.
三者的关系
Both the window duration and the slide duration must be multiples of the batch interval, as received data is divided into batches of duration “batch interval”. Lets take an example. Suppose we have a batch interval of 2 seconds and we have defined an input stream.
val inputsStream = ssc.socketStream(...)
This inputStream will generate RDDs every 2 seconds, containing last 2 second of data. Now say we define a few window operation on this. The window operation is defined as DStream.window(, )
代码示例
val windowStream1 = inputStream.window(Seconds(4))val windowStream2 = inputStream.window(Seconds(4), Seconds(2))val windowStream3 = inputStream.window(Seconds(10), Seconds(4))val windowStream4 = inputStream.window(Seconds(10), Seconds(10)val windowStream5 = inputStream.window(Seconds(2), Seconds(2)) // same as inputStreamval windowStream6 = inputStream.window(Seconds(11), Seconds(2)) // invalidval windowStream7 = inputStream.window(Seconds(4), Seconds(1)) // invalid
用法解释
Both, windowStream1 and windowStream2 will generate RDDs containing data from last 4 seconds. And the RDDs will be generated every 2 seconds (if the slide duration is not specified as in windowStream1, then the slide duration was assumed to be inputStream’s batch duration = 2 sec). Note that each of these windows of data are overlapping. Window RDD at time 10 will contain data from times 6 to 10 (i.e. slightly after 6 to end of 10), and window RDD at time 12 will contain data from 8 to 12.
Similarly, windowStream3 will generate RDDs every 4 seconds, each containing data from last 10 seconds. And windowStream4 will generate non-overlapping windows, that is, RDDs every 10 seconds, containing data from last 10 seconds. windowStream5 is essentially same as the inputStream.
windowStream6 and windowStream7 are invalid because one of the two parameters is not a multiple of the batch interval, that is, 2 seconds. This is how the three are related.
Hope that helped. Note that I did simplify a few details that are important when you want to define window operations over windowed streams. I am ignoring them for now. Feel free to ask more specific questions.
总结
Batch interval为Spark Streaming中对源数据划分的最小时间单位,在使用window时,window length和slide duration必须是batch interval的整数倍。Window length决定运算时数据的跨度(总量),slide duration决定何时触发运算。
// Suppose batch interval = Seconds(1)val windowStream = inputStream.window(Seconds(4), Seconds(2))
如上从0秒开始,2,4,6,8…秒时都会将window中的数据运算一次。而2秒时,用0-2秒的数据运算;4秒时,用0-4秒的数据运算;6秒,用2-6秒的数据运算……
- Batch interval, window length and slide duration on Spark Streaming
- Spark streaming and flume
- spark streaming与spring batch批处理
- spark-streaming-[4]-Window Operations
- e on colour and length
- Spark-Streaming之window滑动窗口应用
- Spark-Streaming之window滑动窗口应用
- Spark-Streaming之window滑动窗口应用
- Stop Spark Streaming On YARN Gracefully
- Spark定制班第20课:Spark Streaming中动态Batch Size实现初探
- Spark定制班第21课:Spark Streaming中动态Batch Size深入及RateController解析
- Spark学习笔记(20)Spark Streaming中动态Batch Size实现初探
- Spark Streaming 实战案例(三) DStream Window操作
- hadoop on yarn and spark on yarn
- Spark Streaming on Kafka解析和安装实战
- Spark Streaming on Kafka解析和安装实战
- spark-streaming-[7]-Output Operations on DStreams-foreachRDD写Mysql
- 第20课:Spark Streaming中动态Batch Size实现初探
- static和构造函数运行顺序及次数区别
- CAN总线开发实例
- 线程池ThreadPoolExecutor的使用方法
- 常用汇编指令
- Mybatis 与 JDBC 比较
- Batch interval, window length and slide duration on Spark Streaming
- golang的服务控制实践
- 头一次CSDN写博客
- EJB3 学习笔记二
- 数据结构实验之查找五:平方之哈希表
- Linq 与数据库操作
- Summary Bochs bochsrc.bxrc Configure File
- java实现socket的编程
- html5 菜单实例