SparkStreaming中Tanformations和状态管理
来源:互联网 发布:南山空同 知乎 编辑:程序博客网 时间:2024/05/01 19:27
1.TransFormation操作
map(func)
对DStream中的各个元素进行func函数操作,然后返回一个新的DStream.
flatMap(func)
与map方法类似,只不过各个输入项可以被输出为零个或多个输出项
filter(func)
过滤出所有函数func返回值为true的DStream元素并返回一个新的DStream
repartition(numPartitions)
增加或减少DStream中的分区数,从而改变DStream的并行度
union(otherStream)
将源DStream和输入参数为otherDStream的元素合并,并返回一个新的DStream.
读取不同的数据来源或者不同的topic时可以合并
count()
通过对DStreaim中的各个RDD中的元素进行计数,然后返回只有一个元素的RDD构成的DStream
对当前的dstream进行个数统计
reduce(func)
对源DStream中的各个RDD中的元素利用func进行聚合操作,然后返回只有一个元素的RDD构成的新的DStream.
否合交换律和结合律
countByValue()
对于元素类型为K的DStream,返回一个元素为(K,Long)键值对形式的新的DStream,Long对应的值为源DStream中各个RDD的key出现的次数
实际上是统计每个元素出现的次数
reduceByKey(func, [numTasks])
利用func函数对源DStream中的key进行聚合操作,然后返回新的(K,V)对构成的DStream
join(otherStream, [numTasks])
输入为(K,V)、(K,W)类型的DStream,返回一个新的(K,(V,W)类型的DStream
cogroup(otherStream, [numTasks])
输入为(K,V)、(K,W)类型的DStream,返回一个新的(K, Seq[V], Seq[W]) 元组类型的DStream
transform(func)
通过RDD-to-RDD函数作用于源码DStream中的各个RDD,可以是任意的RDD操作,从而返回一个新的RDD
updateStateByKey(func)
根据于key的前置状态和key的新值,对key进行更新,返回一个新状态的DStream
mapToPair(func)
在单词拆分的基础上对每个单词计数为1.
JavaPairDStream pairs=words.mapToPair(new PairFunction<String, String, Integer>() {
private static final long serialVersionUID = 1L;
@Override
public Tuple2<String,Integer> call(String word) throws Exception {
return new Tuple2<String, Integer>(word, 1);
}
});
Window Operations
window(windowLength, slideInterval)
Return a new DStream which is computed based on windowed batches of the source DStream.
countByWindow(windowLength, slideInterval)
Return a sliding window count of elements in the stream.
reduceByWindow(func, windowLength, slideInterval)
Return a new single-element stream, created by aggregating elements in the stream over a sliding interval using func. The function should be associative so that it can be computed correctly in parallel.
reduceByKeyAndWindow(func, windowLength, slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function func over batches in a sliding window. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
reduceByKeyAndWindow(func, invFunc, windowLength, slideInterval, [numTasks])
A more efficient version of the above reduceByKeyAndWindow() where the reduce value of each window is calculated incrementally using the reduce values of the previous window. This is done by reducing the new data that enters the sliding window, and“inverse reducing”the old data that leaves the window. An example would be that of“adding”and “subtracting”counts of keys as the window slides. However, it is applicable only to“invertible reduce functions”, that is, those reduce functions which have a corresponding“inverse reduce”function (taken as parameter invFunc). Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument. Note that checkpointing must be enabled for using this operation.
countByValueAndWindow(windowLength, slideInterval, [numTasks])
When called on a DStream of (K, V) pairs, returns a new DStream of (K, Long) pairs where the value of each key is its frequency within a sliding window. Like in reduceByKeyAndWindow, the number of reduce tasks is configurable through an optional argument.
Output Operations on DStreams
print()
Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
Python API This is called pprint() in the Python API.
saveAsTextFiles(prefix, [suffix])
Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])
Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
Python API This is not available in the Python API.
saveAsHadoopFiles(prefix, [suffix])
Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
Python API This is not available in the Python API.
foreachRDD(func)
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
- SparkStreaming中Tanformations和状态管理
- 大数据IMF传奇行动绝密课程第92课:SparkStreaming中Transformations和状态管理解密
- 整合sparkstreaming和kafka,手动管理kafka的offsets(重点)
- 异常和状态管理
- sparkStreaming带状态更新(scala)
- 带有状态的SparkStreaming单词计数程序
- linux中系统状态管理
- sparkStreaming
- sparkStreaming
- sparkstreaming
- SparkStreaming
- SparkStreaming中DStream的概念
- 监测和管理Xcache状态
- SparkStreaming和Kafka的整合
- SparkStreaming运行机制和架构详解
- 2 SparkStreaming运行机制和架构
- 解密SparkStreaming运行机制和架构
- SparkStreaming之Accumulators和Broadcast
- 用unordered_map代替hash_map
- 魔兽世界任务制作教学,已经本人自己服务器中测试
- 一、OOP概念
- CocoPods的安装步骤
- 身份证号码生成
- SparkStreaming中Tanformations和状态管理
- 业务服务化给团队、技术带来的影响
- 浅谈Hibernate缓存机制:一级缓存、二级缓存
- 维基百科中凸函数的定义 Convex function
- 20160527 数据分析与SAS7 筛选数据集
- Java集合类详解
- SparkStreaming updateStateByKey 基本操作
- 百度定位+广播检测网络(小白版)
- 常用的字符串和数字之间的转换函数