Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)
来源:互联网 发布:表格数据不多 格式大 编辑:程序博客网 时间:2024/05/22 11:46
一、 Spark Streaming介绍
1. SparkStreaming概述
1.1. 什么是Spark Streaming
Spark Streaming类似于Apache Storm,用于流式数据的处理。根据其官方文档介绍,Spark Streaming有高吞吐量和容错能力强等特点。SparkStreaming支持的数据输入源很多,例如:Kafka、Flume、Twitter、ZeroMQ和简单的TCP套接字等等。数据输入后可以用Spark的高度抽象原语如:map、reduce、join、window等进行运算。而结果也能保存在很多地方,如HDFS,数据库等。另外Spark Streaming也能和MLlib(机器学习)以及Graphx完美融合。
1.2. 为什么要学习Spark Streaming
1.易用
2.容错
3.易整合到Spark体系
1.3. Spark与Storm的对比
Spark
Storm
开发语言:Scala
开发语言:Clojure
编程模型:DStream
编程模型:Spout/Bolt
二、 DStream
1. 什么是DStream
Discretized Stream是Spark Streaming的基础抽象,代表持续性的数据流和经过各种Spark原语操作后的结果数据流。在内部实现上,DStream是一系列连续的RDD来表示。每个RDD含有一段时间间隔内的数据,如下图:
对数据的操作也是按照RDD为单位来进行的
计算过程由Spark engine来完成
2. DStream相关操作
DStream上的原语与RDD的类似,分为Transformations(转换)和Output Operations(输出)两种,此外转换操作中还有一些比较特殊的原语,如:updateStateByKey()、transform()以及各种Window相关的原语。
2.1. Transformationson DStreams
Transformation
Meaning
map(func)
Return a new DStream by passing each element of the source DStream through a function func.
flatMap(func)
Similar to map, but each input item can be mapped to 0 or more output items.
filter(func)
Return a new DStream by selecting only the records of the source DStream on which func returns true.
repartition(numPartitions)
Changes the level of parallelism in this DStream by creating more or fewer partitions.
union(otherStream)
Return a new DStream that contains the union of the elements in the source DStream and otherDStream.
count()
Return a new DStream of single-element RDDs by counting the number of elements in each RDD of the source DStream.
reduce(func)
Return a new DStream of single-element RDDs by aggregating the elements in each RDD of the source DStream using a function func (which takes two arguments and returns one). The function should be associative so that it can be computed in parallel.
countByValue()
When called on a DStream of elements of type K, return a new DStream of (K, Long) pairs where the value of each key is its frequency in each RDD of the source DStream.
reduceByKey(func, [numTasks])
When called on a DStream of (K, V) pairs, return a new DStream of (K, V) pairs where the values for each key are aggregated using the given reduce function. Note: By default, this uses Spark's default number of parallel tasks (2 for local mode, and in cluster mode the number is determined by the config property spark.default.parallelism) to do the grouping. You can pass an optional numTasks argument to set a different number of tasks.
join(otherStream, [numTasks])
When called on two DStreams of (K, V) and (K, W) pairs, return a new DStream of (K, (V, W)) pairs with all pairs of elements for each key.
cogroup(otherStream, [numTasks])
When called on a DStream of (K, V) and (K, W) pairs, return a new DStream of (K, Seq[V], Seq[W]) tuples.
transform(func)
Return a new DStream by applying a RDD-to-RDD function to every RDD of the source DStream. This can be used to do arbitrary RDD operations on the DStream.
updateStateByKey(func)
Return a new "state" DStream where the state for each key is updated by applying the given function on the previous state of the key and the new values for the key. This can be used to maintain arbitrary state data for each key.
特殊的Transformations
1.UpdateStateByKeyOperation
UpdateStateByKey原语用于记录历史记录,上文中Word Count示例中就用到了该特性。若不用UpdateStateByKey来更新状态,那么每次数据进来后分析完成后,结果输出后将不在保存
2.TransformOperation
Transform原语允许DStream上执行任意的RDD-to-RDD函数。通过该函数可以方便的扩展Spark API。此外,MLlib(机器学习)以及Graphx也是通过本函数来进行结合的。
3.WindowOperations
Window Operations有点类似于Storm中的State,可以设置窗口的大小和滑动窗口的间隔来动态的获取当前Steaming的允许状态
2.2. OutputOperations on DStreams
Output Operations可以将DStream的数据输出到外部的数据库或文件系统,当某个Output Operations原语被调用时(与RDD的Action相同),streaming程序才会开始真正的计算过程。
Output Operation
Meaning
print()
Prints the first ten elements of every batch of data in a DStream on the driver node running the streaming application. This is useful for development and debugging.
saveAsTextFiles(prefix, [suffix])
Save this DStream's contents as text files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsObjectFiles(prefix, [suffix])
Save this DStream's contents as SequenceFiles of serialized Java objects. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
saveAsHadoopFiles(prefix, [suffix])
Save this DStream's contents as Hadoop files. The file name at each batch interval is generated based on prefix and suffix: "prefix-TIME_IN_MS[.suffix]".
foreachRDD(func)
The most generic output operator that applies a function, func, to each RDD generated from the stream. This function should push the data in each RDD to an external system, such as saving the RDD to files, or writing it over the network to a database. Note that the function func is executed in the driver process running the streaming application, and will usually have RDD actions in it that will force the computation of the streaming RDDs.
- Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)
- Spark Streaming——DStream Transformation操作
- spark streaming源码分析4 DStream相关API
- spark streaming源码分析4 DStream相关API
- spark streaming 中对DStream 的两个操作
- Spark Streaming 实战案例(三) DStream Window操作
- Spark Streaming之二:DStream解析
- [spark streaming] DStream 和 DStreamGraph 解析
- Spark组件之Spark Streaming学习6--如何调用Dstream里面的getOrCompute方法?
- 图解StreamingContext、DStream、Receiver 第三讲spark streaming
- Spark Streaming自定义数据源-实现自定义输入DStream和接收器
- 6.Spark Streaming:输入DStream和Receiver详解
- Spark源码解析:DStream
- Spark学习笔记(26)在DStream的Action操作之外也可能产生Job操作
- spark学习五 DStream(spark流式数据处理)
- Spark修炼之道(进阶篇)——Spark入门到精通:第十一节 Spark Streaming—— DStream Transformation操作
- Spark修炼之道(进阶篇)——Spark入门到精通:第十二节 Spark Streaming—— DStream Window操作
- spark学习六 DStream的运行原理解析
- python子类调用父类的方法
- 基本套接字编程(1) -- tcp篇
- [leetcode]95. Unique Binary Search Trees II(Java)
- html文件下载时的header设置
- BAT路线图1.0
- Spark Streaming介绍,DStream,DStream相关操作(来自学习资料)
- hdu 1171 Big Event in HDU
- C++之红黑树(一)
- Spring 实现数据库读写分离
- java多线程
- Java Collection复习整理
- 数据库(第一范式,第二范式,第三范式)
- 进程间通信与线程间通信
- Handler机制分析(2)