Spark Streaming Programming Guide

来源：互联网发布：软件测试的发展编辑：程序博客网时间：2024/05/14 10:08

Spark Streaming Programming Guide

word笔记下载：下载地址

原文：地址

[1]Overview

Spark Streaming receives live input datastreams and divides the data into batches(a high-level abstraction called discretizedstream or DStream), which are then processed by the Sparkengine to generate the final stream of results in batches.

[2]Basic Concepts

[1]Points to remember

Spark Streaming应用程序的core数要大于receivers的数目，才能有剩余的core用于处理数据

· whenrunning locally, always use “local[n]”as the master URL, where n >number of receivers to run. Otherwise, leaving no thread for processing thereceived data

· Extendingthe logic to running on a cluster, the number of cores allocated to the SparkStreaming application must be more than the number of receivers. Otherwise thesystem will receive data, but not be able to process it.

[2]Caching / Persistence

l using the persist() method on a DStream willautomatically persist every RDD of that DStream in memory

l For window-based operations like reduceByWindow and reduceByKeyAndWindow and state-based operationslike updateStateByKey, this is implicitly true. Hence, DStreams generated bywindow-based operations are automatically persisted in memory, without thedeveloper calling persist().

l unlike RDDs, the default persistence level of DStreamskeeps the data serialized in memory. This is further discussed in the Performance Tuning

两种数据需要checkpoint

· Metadatacheckpointing :

o Configuration –创建streaming application的配置信息

o DStreamoperations –定义streaming application的操作集合

o Incompletebatches – Batches未完成的

· Datacheckpointing –将生成的RDDs存储到稳定的存储中。这在一些状态性（stateful ）跨batches的数据聚合transformations中十分有必要，该类transformations依赖前序batches随时间依赖链会不断增长，适当加checkpointing 打断依赖链，减少失败恢复的代价

[3]When to enable Checkpointing

· Usageof stateful transformations –例如updateStateByKey 和reduceByKeyAndWindow

· Recoveringfrom failures of the driver running the application - Metadata checkpoints are used to recover withprogress information.

[4]How to configure Checkpointing

streamingContext.checkpoint(checkpointDirectory)

Rewrite 应用程序:

· Whenthe program is being started for the first time, it will create a newStreamingContext, set up all the streams and then call start().

· Whenthe program is being restarted after failure, it will re-create aStreamingContext from the checkpoint data in the checkpoint directory.

已StreamingContext.getOrCreate即可简单实现：

// Function to create and setup a new StreamingContext

def functionToCreateContext(): StreamingContext = {

val ssc = new StreamingContext(...) // new context

val lines = ssc.socketTextStream(...) // create DStreams

...

ssc.checkpoint(checkpointDirectory) // set checkpoint directory

ssc

}

// Get StreamingContext from checkpoint data or create a new one

val context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)

// Do additional setup on context that needs to be done,

// irrespective of whether it is being started or restarted

context. ...

// Start the context

context.start()

context.awaitTermination()

实例：

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/RecoverableNetworkWordCount.scala

[5]Interval of checkpointing

有存储成本，所以：

1. 默认一般多个interval 至少10seconds,典型一般5 - 10 sliding intervals

2. 设置dstream.checkpoint(checkpointInterval).

[6]Accumulators, Broadcast Variables, and Checkpoints

Accumulators, Broadcast Variables不能从checkpoint中恢复，为此需要建立单例对象重新手动初始化

object WordBlacklist {

@volatile private var instance: Broadcast[Seq[String]] = null

def getInstance(sc: SparkContext): Broadcast[Seq[String]] = {

if (instance == null) {

synchronized {

if (instance == null) {

val wordBlacklist = Seq("a", "b", "c")

instance = sc.broadcast(wordBlacklist)

}

instance

}

object DroppedWordsCounter {

@volatile private var instance: LongAccumulator = null

def getInstance(sc: SparkContext): LongAccumulator = {

if (instance == null) {

synchronized {

if (instance == null) {

instance = sc.longAccumulator("WordsInBlacklistCounter")

}

instance

}

wordCounts.foreachRDD { (rdd: RDD[(String, Int)], time: Time) =>

// Get or register the blacklist Broadcast

val blacklist = WordBlacklist.getInstance(rdd.sparkContext)

// Get or register the droppedWordsCounter Accumulator

val droppedWordsCounter = DroppedWordsCounter.getInstance(rdd.sparkContext)

// Use blacklist to drop words and use droppedWordsCounter to count them

val counts = rdd.filter { case (word, count) =>

if (blacklist.value.contains(word)) {

droppedWordsCounter.add(count)

false

} else {

true

}

}.collect().mkString("[", ", ", "]")

val output = "Counts at time " + time + " " + counts

})

[7]Monitoring Applications

The following two metrics in web UI are particularlyimportant:

· ProcessingTime - The time to process each batch ofdata.

· SchedulingDelay - the time a batch waits in a queuefor the processing of previous batches to finish.

[3]Performance Tuning

1. Reducing the processing time ofeach batch of data by efficiently using cluster resources.

2. Setting the right batch size suchthat the batches of data can be processed as fast as they are received (thatis, data processing keeps up with the data ingestion).

[1]Reducing the Batch Processing Times

[重要]Level of Parallelism in Data Receiving

每个DStream将建立一个receiver接收数据，因此对于多个datastreams可以建个多个Dstream来接收不同分区的数据流。例如建立不同的Dstream接收同一个kafka的不同topic

val numStreams = 5

val kafkaStreams = (1 to numStreams).map { i => KafkaUtils.createStream(...) }

val unifiedStream = streamingContext.union(kafkaStreams)

unifiedStream.print()

重点参数：receiver’s block intervalconfigurationparameterspark.streaming.blockInterval

blocks决定了后面的task数，

per receiver per batch的tasks数 =batch interval / blockinterval

例如：block interval of 200 mswill create 10 tasks per 2 second batches.

【1】如果tasks数过低(that is, less than the number of cores per machine),不能充分利用集群资源。可以通过减小block interval增加task数，但block interval建议最小为50 ms(否则可能出现task launching overheads问题, tasks launched的速度过高（如50个/s），向slaves发送task很难保证压秒的延迟)

【2】inputStream.repartition(<number of partitions>)直接重分区

Level of Parallelism in Data Processing

Cluster resources参数：

spark.default.parallelism

[2]Setting the Right Batch Interval

web UI中可监控:Batch processing time < batchinterval

【1】首先利用保守的batch interval (say, 5-10 seconds) 测试，可结合driver log4j logs, or use the StreamingListener interface

【2】优化

[3]Memory Tuning

【1】所需内存的多少直接和transformations的类型有关，例如：window operation(处理10分钟数据)或者updateStateByKey (有很多keys),相反simple map-filter-storeoperation只需要很少的内存。

【2】Receivers接收的数据默认是已存储级别StorageLevel.MEMORY_AND_DISK_SER_2来存储的，超出内存部分需要向磁盘溢出，会减低性能。所以要对内存做适当性能评估

【3】减少GC

[4][重要]Important points to remember

[1]DStream和receiver一一对应，为获得多并发的receivers,要建立多个DStreams。

[2]每个receiver在一个executor中占据一个core。因此为留出足够多的cores处理数据，在申请spark.cores.max时receiver数要考虑进去

[3]数据àreceiverà blocksà BlockManageràtasks

N(blocks数) =batchInterval/blockInterval

[4]batchInterval对应一个RDD，batchInterval为单位产生的blocks对应RDD的partitions,每一个partition对应一个task。所以batchInterval决定中task数（并发度），当blockInterval== batchinterval,只有个partition

[5] 可以不依赖batchInterval和 blockInterval对Dstream直接重分区（会多起一个job进行数据的shuffle）

[6]两个Dstreams将有两个血缘的RDDs,将有两个jobs,按次序排序被scheduled。避免这种情况，对这两个Dstreams合并union形成一个unionRDD。这个unionRDD将按照一个job处理，RDDs的分区不受影响

[7]Batchinterval持续小于batch处理时间，内存将不足，最终抛出异常(most probably BlockNotFoundException)。目前没有方法暂停receiver,可利用spark.streaming.receiver.maxRate降低receiver的摄入速度

· Themap tasks on the blocks are processed in the executors (one that received theblock, and another where the block was replicated) that has the blocksirrespective of block interval, unless non-local scheduling kicks in. Havingbigger blockinterval means bigger blocks. A high value of spark.locality.wait increasesthe chance of processing a block on the local node. A balance needs to be foundout between these two parameters to ensure that the bigger blocks are processedlocally.

[4]实践

http://blog.csdn.net/hjw199089/article/category/6905134

阅读全文

0 0