程序博客网 > 南山空同知乎

SparkStreaming中Tanformations和状态管理

来源：互联网发布：南山空同知乎编辑：程序博客网时间：2024/05/01 19:27

1.TransFormation操作

map(func)

对DStream中的各个元素进行func函数操作，然后返回一个新的DStream.

flatMap(func)

与map方法类似，只不过各个输入项可以被输出为零个或多个输出项

filter(func)

过滤出所有函数func返回值为true的DStream元素并返回一个新的DStream

repartition(numPartitions)

增加或减少DStream中的分区数，从而改变DStream的并行度

union(otherStream)

将源DStream和输入参数为otherDStream的元素合并，并返回一个新的DStream.

读取不同的数据来源或者不同的topic时可以合并

count()

通过对DStreaim中的各个RDD中的元素进行计数，然后返回只有一个元素的RDD构成的DStream

对当前的dstream进行个数统计

reduce(func)

对源DStream中的各个RDD中的元素利用func进行聚合操作，然后返回只有一个元素的RDD构成的新的DStream.

否合交换律和结合律

countByValue()

对于元素类型为K的DStream，返回一个元素为（K,Long）键值对形式的新的DStream，Long对应的值为源DStream中各个RDD的key出现的次数

实际上是统计每个元素出现的次数

reduceByKey(func, [numTasks])

利用func函数对源DStream中的key进行聚合操作，然后返回新的（K，V）对构成的DStream

join(otherStream, [numTasks])

输入为（K,V)、（K,W）类型的DStream，返回一个新的（K，（V，W）类型的DStream

cogroup(otherStream, [numTasks])

输入为（K,V)、（K,W）类型的DStream，返回一个新的(K, Seq[V], Seq[W]) 元组类型的DStream

transform(func)

通过RDD-to-RDD函数作用于源码DStream中的各个RDD，可以是任意的RDD操作，从而返回一个新的RDD

updateStateByKey(func)

根据于key的前置状态和key的新值，对key进行更新，返回一个新状态的DStream

mapToPair（func）

在单词拆分的基础上对每个单词计数为1.

JavaPairDStream pairs=words.mapToPair(new PairFunction<String, String, Integer>() {

private static final long serialVersionUID = 1L;

@Override

public Tuple2<String,Integer> call(String word) throws Exception {

return new Tuple2<String, Integer>(word, 1);

}

});

Window Operations

window(windowLength, slideInterval)

Return a new DStream which is computed based on windowed batches of the source DStream.

countByWindow(windowLength, slideInterval)

Return a sliding window count of elements in the stream.

reduceByWindow(func, windowLength, slideInterval)

Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.

reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.

reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])

A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and“inverse reducing”the old data that leaves the window. An example would be that of“adding”and “subtracting”counts of keys as the window slides. However, it is applicable only to“invertible reduce functions”, that is, those reduce functions which have a corresponding“inverse reduce”function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.

countByValueAndWindow(windowLength, slideInterval, [numTasks])

When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.

Output Operations on DStreams

print()

Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.

Python API This is called pprint() in the Python API.

saveAsTextFiles(prefix, [suffix])

Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

saveAsObjectFiles(prefix, [suffix])

Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

Python API This is not available in the Python API.

saveAsHadoopFiles(prefix, [suffix])

Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".

Python API This is not available in the Python API.

foreachRDD(func)

The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.

0 0

南山空同知乎

南山空同知乎

原创粉丝点击

热门问题 老师的惩罚人脸识别我在镇武司摸鱼那些年重生之率土为王我在大康的咸鱼生活盘龙之生命进化天生仙种凡人之先天五行春回大明朝姑娘不必设防，我是瞎子阿傩傩的读音中国傩城傩怎么读阿傩作品傩文化傩城在哪里傩面具中国傩城景区傩送的翠翠作品贵州傩城古镇中国傩文化桔乡傩缘江边傩送作品傩送饰演者石磊图片中国傩城门票是多少贵州傩城有什么玩的傩字怎么读邻家雪姨阿傩婚餐露宿阿傩翠翠和傩送的结局续写道真傩城2018年要门票吗边城结局为什么傩送不回来了离舞晴空舞 now傩雨催吐催猕猴桃怎么催熟男催乳师猕猴桃催熟手上有个催月经穴催组词催雪莉催相眠眼镜15篇全集催雪莉事件柿子催熟催相眠眼镜全集大拇指催经图催产针催控制眠女明星