SparkStringApplication进行升级时保证零丢失

来源：互联网发布：自建邮件服务器 linux 编辑：程序博客网时间：2024/06/06 11:40

升级SparkStreaming Application代码

在对StreamingApplication项目进行升级时，此时如果代码发生改变的话，有两种方式可以做到。

1. 升级的代码和旧的代码同时运行起来，（接收同样的数据）直到新程序能够稳定的运行。旧的程序就可以停掉。

注意: 这种方式只支持能够将数据发送到两个不同的地址（新程序和老程序）的数据源，比如kafka。

2.采用优雅关闭context方式：see StreamingContext.stop(...) or JavaStreamingContext.stop(...) for graceful shutdown options)

这将会保证一批接收到的数据被处理完全之后将程序退出。这样的话升级的程序就能重新启动起来，并从上次程序处理到的地方开始运行。

这种方式只能支持数据源带有数据缓存机制的数据源(like Kafka, and Flume) ，因为在停掉老程序的时候，数据能够被缓存起来不被丢失,直到新的升级程序启动起来。此时从旧的checkpoint目录进行恢复的时候会导致报错，因为The checkpoint information 包含代码的序列化的对象信息，这在反序列化的时候是必须得。如果代码修改的话将会导致错误。这种情况必须将升级之前的 checkpoint directory清空，或者是设置新的checkpoint directory.

这两种方案利弊：

第一种方案

可以避免数据丢失，但是这样的话，经会导致数据重复，这样就要求代码有处理重复的数据的能力。这个可能需要经数据加上label，比如唯一的序列。在进行数据处理的时候进行数据的检查判断。这就就要求有额外的数据字段，也增加了业务的判断逻辑。

第二种方案

此时可以采用检查某一标识进行程序的有条件性stop，比如设置数据库中某一字段进行标识，或者是在hdfs上设置一个标记文件 marker file，在程序中启动一个后台线程每隔一小时查询一次标识（或者自定义事件），根据标识进行条件性终止程序。这种方式相比第一种方式可以很方便的实现，无需改动太多的逻辑。注意点:更改新的checkpoint目录，否则导致序列化等错误。

Either your application will start but will run the old version of the application (without letting you know!)

Or you will get deserialization errors and your application will not start at all.

第三种方案：

采用完全基于zookeeper的一次语义，手动commit offset，这种方式和第一种方式亦一样,代码需要处理重复数据的逻辑。

此逻辑只有在，一种情况下起作用，就是当接收到的数据已经被处理掉，但是尚未更新kafka的offset。在下次恢复的时候，就要重新消费上次的已经处理掉的

但是尚未同步偏移量的数据，这种情况适合kafka数据源。

在research很久之后发现只有Streaming kafka 的一次性语义支持的最好。

如果数据源是flume的话，更新代码的话，已经接受但是还未未处理的数据如何保证一次性语义，这个还是没有什么好的方案。

def stop(stopSparkContext: Boolean, stopGracefully: Boolean): Unit

Stop the execution of the streams, with option of ensuring all received data has been processed.

stopSparkContext

if true, stops the associated SparkContext. The underlying SparkContext will be stopped regardless of whether this StreamingContext has been started.

stopGracefully

if true, stops gracefully by waiting for the processing of all received data to be completed

def stop(stopSparkContext: Boolean = ...): Unit

Stop the execution of the streams immediately (does not wait for all received data to be processed).

详见：http://spark.apache.org/docs/1.6.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving

Upgrading Application Code

If a running Spark Streaming application needs to be upgraded with new application code, then there are two possible mechanisms.

The upgraded Spark Streaming application is started and run in parallel to the existing application. Once the new one (receiving the same data as the old one) has been warmed up and is ready for prime time, the old one be can be brought down. Note that this can be done for data sources that support sending the data to two destinations (i.e., the earlier and upgraded applications).
The existing application is shutdown gracefully (see StreamingContext.stop(...) or JavaStreamingContext.stop(...) for graceful shutdown options) which ensure data that has been received is completely processed before shutdown. Then the upgraded application can be started, which will start processing from the same point where the earlier application left off. Note that this can be done only with input sources that support source-side buffering (like Kafka, and Flume) as data needs to be buffered while the previous application was down and the upgraded application is not yet up. And restarting from earlier checkpoint information of pre-upgrade code cannot be done. The checkpoint information essentially contains serialized Scala/Java/Python objects and trying to deserialize objects with new, modified classes may lead to errors. In this case, either start the upgraded app with a different checkpoint directory, or delete the previous checkpoint directory.

0 0