SparkStreaming中Tanformations和状态管理

来源:互联网 发布:南山空同 知乎 编辑:程序博客网 时间:2024/05/01 19:27

1.TransFormation操作

map(func)

DStream中的各个元素进行func函数操作,然后返回一个新的DStream.

 

flatMap(func)

map方法类似,只不过各个输入项可以被输出为零个或多个输出项

 

filter(func)

过滤出所有函数func返回值为trueDStream元素并返回一个新的DStream

 

repartition(numPartitions)

增加或减少DStream中的分区数,从而改变DStream的并行度

 

union(otherStream)

将源DStream和输入参数为otherDStream的元素合并,并返回一个新的DStream.

读取不同的数据来源或者不同的topic时可以合并

 

count()

通过对DStreaim中的各个RDD中的元素进行计数,然后返回只有一个元素的RDD构成的DStream

对当前的dstream进行个数统计

 

reduce(func)

对源DStream中的各个RDD中的元素利用func进行聚合操作,然后返回只有一个元素的RDD构成的新的DStream.

否合交换律和结合律

 

countByValue()

对于元素类型为KDStream,返回一个元素为(K,Long)键值对形式的新的DStreamLong对应的值为源DStream中各个RDDkey出现的次数

实际上是统计每个元素出现的次数

 

reduceByKey(func, [numTasks])

利用func函数对源DStream中的key进行聚合操作,然后返回新的(KV)对构成的DStream

 

join(otherStream, [numTasks])

输入为(K,V)、(K,W)类型的DStream,返回一个新的(K,(VW)类型的DStream

 

cogroup(otherStream, [numTasks])

输入为(K,V)、(K,W)类型的DStream,返回一个新的(K, Seq[V], Seq[W]) 元组类型的DStream

 

transform(func)

通过RDD-to-RDD函数作用于源码DStream中的各个RDD,可以是任意的RDD操作,从而返回一个新的RDD

 

updateStateByKey(func)

根据于key的前置状态和key的新值,对key进行更新,返回一个新状态的DStream

 

mapToPairfunc

在单词拆分的基础上对每个单词计数为1.

JavaPairDStream pairs=words.mapToPair(new PairFunction<String, String, Integer>() {

private static final long serialVersionUID = 1L;

@Override

public Tuple2<String,Integer> call(String word) throws Exception {

return new Tuple2<String, Integer>(word, 1);

}

});

 

 

Window Operations

window(windowLength, slideInterval)

Return a new DStream which is computed based on windowed batches of the source DStream.

 

countByWindow(windowLength, slideInterval)

Return a sliding window count of elements in the stream.

 

reduceByWindow(func, windowLength, slideInterval)

Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.

 

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

 

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, andinverse reducingthe old data that leaves the window. An example would be that ofaddingand subtractingcounts of keys as the window slides. However, it is applicable only toinvertible reduce functions, that is, those reduce functions which have a correspondinginverse reducefunction (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

 

countByValueAndWindow(windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

 

 

Output Operations on DStreams

 

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

Python API This is called pprint() in the Python API.

 

saveAsTextFiles(prefix, [suffix])

Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

 

saveAsObjectFiles(prefix, [suffix])

Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

Python API This is not available in the Python API.

 

saveAsHadoopFiles(prefix, [suffix])

Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

Python API This is not available in the Python API.

 

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

0 0