Spark Streaming官方文档复习笔记-4

来源：互联网发布：usm锐化算法编辑：程序博客网时间：2024/06/05 14:38

Spark Streaming Memory Tuning

Memory Tuning

Tuning the memory usage and GC behavior of Spark applications has been discussed in great detail in the Tuning Guide. It is strongly recommended that you read that. In this section, we discuss a few tuning parameters specifically in the context of Spark Streaming applications.

内存使用和GC回收行为将会在Tuning Guide中讨论，强烈建议阅读以下，我们讨论一些spark Streaming调整参数。

The amount of cluster memory required by a Spark Streaming application depends heavily on the type of transformations used. For example, if you want to use a window operation on the last 10 minutes of data, then your cluster should have sufficient memory to hold 10 minutes worth of data in memory. Or if you want to use updateStateByKey with a large number of keys, then the necessary memory will be high. On the contrary, if you want to do a simple map-filter-store operation, then the necessary memory will be low.

Spark Streaming需要使用集群的内存和你所使用的算子有关，例如你对10分钟的数据进行处理的话，那么你要保证集群有足够的内存来容纳这些数据。或者你在大量的key上面使用updateStateByKey 算子的话也需要大量的内存。相反，如果你是用的一些简单算子例如map或者filter的话需要的内存就少。

In general, since the data received through receivers is stored with StorageLevel.MEMORY_AND_DISK_SER_2, the data that does not fit in memory will spill over to the disk. This may reduce the performance of the streaming application, and hence it is advised to provide sufficient memory as required by your streaming application. Its best to try and see the memory usage on a small scale and estimate accordingly.

通常，因为数据来至于Reciver，被缓存的级别是MEMORY_AND_DISK_SER_2,如果内存空间不足的话会被溢写到磁盘，这将会降低Spark Streaming作业的性能。因此强烈建议你提供足够内存为Spark Streaming作业。最好做法是现在小规模数据做实验看内存使用情况然后评估做决定。

Another aspect of memory tuning is garbage collection. For a streaming application that requires low latency, it is undesirable to have large pauses caused by JVM Garbage Collection.

另一个方面是垃圾回收器，对于一个Streaming作业来说我们需要低延迟，如果JVM垃圾回收器占用太多时间是影响实时性的，无法接受的。

There are a few parameters that can help you tune the memory usage and GC overheads:

下面有一些参数可以帮助你调整内存使用和垃圾回收器：

Persistence Level of DStreams: As mentioned earlier in the Data Serialization section, the input data and RDDs are by default persisted as serialized bytes. This reduces both the memory usage and GC overheads, compared to deserialized persistence. Enabling Kryo serialization further reduces serialized sizes and memory usage. Further reduction in memory usage can be achieved with compression (see the Spark configuration spark.rdd.compress), at the cost of CPU time.

DStream的缓存级别：正如数据序列化部分提及的，输入数据和RDD是以bytes缓存的，减少了内存使用和GC，对比不序列化的数据缓存，
开启Kryo 序列化会进一步减少序列化后数据的大小和内存使用。再进一步节约内存可以对数据进行压缩，但是这带来了CPU的耗时。
Clearing old data: By default, all input data and persisted RDDs generated by DStream transformations are automatically cleared. Spark Streaming decides when to clear the data based on the transformations that are used. For example, if you are using a window operation of 10 minutes, then Spark Streaming will keep around the last 10 minutes of data, and actively throw away older data. Data can be retained for a longer duration (e.g. interactively querying older data) by setting streamingContext.remember.

清除旧数据：默认所有由DStream通过算子产生的输入数据和缓存的RDD将会被自动清除，何时清除数据由所使用的算子决定，例如：
如果你的一个窗口算子是基于10分钟的数据进行的，那么Spark Streaming会保存上个10分钟的数据，并清除再以前的数据。数据可以被保存更长时间，通过StreamingContext.remember设置
CMS Garbage Collector: Use of the concurrent mark-and-sweep GC is strongly recommended for keeping GC-related pauses consistently low. Even though concurrent GC is known to reduce the overall processing throughput of the system, its use is still recommended to achieve more consistent batch processing times. Make sure you set the CMS GC on both the driver (using --driver-java-options in spark-submit) and the executors (using Spark configuration spark.executor.extraJavaOptions).

CMS（concurrent Mark Sweep）垃圾回收器：强烈建议使用并发标记清除JVM回收器降低GC带来的暂时延迟，尽管并发GC会降低吞吐量，但是还是建议使用是因为保证大量持续的处理时间，确保你再Driver和Executor上都设置CMS GC。

Other tips: To further reduce GC overheads, here are some more tips to try.

Persist RDDs using the OFF_HEAP storage level. See more detail in the Spark Programming Guide.
Use more executors with smaller heap sizes. This will reduce the GC pressure within each JVM heap.
其他技巧：为了进一步降低GC影响，还有如下技巧可以尝试：
持久化RDD使用OFF_HEAP级别，更多技术参考文档
使用较多堆内存较小的Executor，这样会降低JVM内部的GC压力。

补充OFF_HEAP:
对于这个问题，一种解决方案就是使用堆外内存(OFF_HEAP)。堆外内存意味着把内存对象分配在Java虚拟机的堆以外的内存，这些内存直接受操作系统管理（而不是虚拟机）。这样做的结果就是能保持一个较小的堆，以减少垃圾收集对应用的影响。

Important points to remember:

A DStream is associated with a single receiver. For attaining read parallelism multiple receivers i.e. multiple DStreams need to be created. A receiver is run within an executor. It occupies one core. Ensure that there are enough cores for processing after receiver slots are booked i.e. spark.cores.max should take the receiver slots into account. The receivers are allocated to executors in a round robin fashion.

一个DStream和单一一个Receiver关联，为了达到并行读多个Recevier，多个DStream需要被创建，一个Recevier占用一个Core，因此需要分配足够的core，receiver的core也包含在spark.cores.max中。
When data is received from a stream source, receiver creates blocks of data. A new block of data is generated every blockInterval milliseconds. N blocks of data are created during the batchInterval where N = batchInterval/blockInterval. These blocks are distributed by the BlockManager of the current executor to the block managers of other executors. After that, the Network Input Tracker running on the driver is informed about the block locations for further processing.

计算公式： block数目=批次间隔/block间隔
运行在Driver上的Tracker会被通知block的位置进一步进行数据处理
An RDD is created on the driver for the blocks created during the batchInterval. The blocks generated during the batchInterval are partitions of the RDD. Each partition is a task in spark. blockInterval== batchinterval would mean that a single partition is created and probably it is processed locally.

如果block间隔和批次间隔相等的话，以为着只有单一分区，这样每一个分区都会local（本地）处理。

The map tasks on the blocks are processed in the executors (one that received the block, and another where the block was replicated) that has the blocks irrespective of block interval, unless non-local scheduling kicks in. Having bigger blockinterval means bigger blocks. A high value of spark.locality.wait increases the chance of processing a block on the local node. A balance needs to be found out between these two parameters to ensure that the bigger blocks are processed locally.

spark.locality.wait 3000（毫秒）数据本地性降级的等待时间
处理block数据的task运行在executor上，executor处理的block不会顾及其block间隔，除非本地节点没有参数计算。拥有大的block间隔的话代表着block比较大，提高spark.locality.wait的数值会提高本地处理的机会，需要在block大小和本地化处理之间平衡设置这个参数

Instead of relying on batchInterval and blockInterval, you can define the number of partitions by calling inputDstream.repartition(n). This reshuffles the data in RDD randomly to create n number of partitions. Yes, for greater parallelism. Though comes at the cost of a shuffle. An RDD’s processing is scheduled by driver’s jobscheduler as a job. At a given point of time only one job is active. So, if one job is executing the other jobs are queued.

除了对批次间隔和block间隔的依赖，你还可以自定义从新分区，例如调用inputDstream.repartition(n)，他将导致shuffle随机产生n个分区
。是的，对于更好并行度，尽管有shuffle代价。RDD的处理被 driver’s jobscheduler调度成一个Job，因此一个时间点只有一个作业在运行，其余在队列中。

If you have two dstreams there will be two RDDs formed and there will be two jobs created which will be scheduled one after the another. To avoid this, you can union two dstreams. This will ensure that a single unionRDD is formed for the two RDDs of the dstreams. This unionRDD is then considered as a single job. However the partitioning of the RDDs is not impacted.

如果DStream由两个RDD组成的话，那么将会有两个job被创建，当一个job被调度完成之后调度另外一个，为了避免这种情况，你可以将两个RDD进行合并，这将会保证由一个单独的unionRDD 代表两个RDD，这个unionRDD 被视为一个Job，然而RDD的分区没有变。

If the batch processing time is more than batchinterval then obviously the receiver’s memory will start filling up and will end up in throwing exceptions (most probably BlockNotFoundException). Currently there is no way to pause the receiver. Using SparkConf configuration spark.streaming.receiver.maxRate, rate of receiver can be limited.

如果批处理时间大于批次间隔时间的话，很明显内存将会被填满，然后报异常，目前没有方法可以暂停Reciever接受数据，可以使用SparkConf配置
spark.streaming.receiver.maxRate限制接受数据的速率。

0 0