WindowOperations

来源:互联网 发布:i.t是什么牌子 知乎 编辑:程序博客网 时间:2024/05/19 23:13

在SparkStreaming中提供了window操作,通过window操作,操作者可以对一个滑动的窗口内的数据进行转换操作,如下图所示:



如上图所示,这个窗口每次在DStream上进行滑动,这里存在两个变量

1、window length   窗口长度

2、slid length          窗口滑动长度

这里的长度均是以batchinterval为单位,因此以上两个参数均需要时batchinterval的整数倍。

示例如下:

// Reduce last 30 seconds of data, every 10 secondsval windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(30), Seconds(10))

窗口大小为30S,每隔10S窗口滑动一次。通俗地解释就是,每隔10s统计前30s内的数据。

以下是一些window操作中常用的算子:

TransformationMeaningwindow(windowLengthslideInterval)Return a new DStream which is computed based on windowed batches of the source DStream.countByWindow(windowLength,slideInterval)Return a sliding window count of elements in the stream.reduceByWindow(funcwindowLength,slideInterval)Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative and commutative so that it can be computed correctly in parallel.reduceByKeyAndWindow(func,windowLengthslideInterval, [numTasks])When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config propertyspark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.reduceByKeyAndWindow(funcinvFunc,windowLengthslideInterval, [numTasks])

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and “inverse reducing” the old data that leaves the window. An example would be that of “adding” and “subtracting” counts of keys as the window slides. However, it is applicable only to “invertible reduce functions”, that is, those reduce functions which have a corresponding “inverse reduce” function (taken as parameterinvFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

countByValueAndWindow(windowLength,slideInterval, [numTasks])When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.


0 0
原创粉丝点击