Spark Streaming——Checkpoint

来源：互联网发布：2016年11月网络编辑：程序博客网时间：2024/06/05 23:02

转载：源文连接
一个Streaming Application往往需要7*24不间断的跑，所以需要有抵御意外的能力（比如机器或系统挂掉，JVM crash等）。为了让这成为可能，Spark Streaming需要checkpoint足够多信息至一个具有容错设计的存储系统才能让Application从失败中恢复。Spark Streaming会checkpoint两种类型的数据。

Metadata（元数据）checkpointing-保存定义了Streaming计算逻辑至类似HDFS的支持容错的存储系统。用来恢复driver，元数据包括
1.配置-用于创建该Streaming Application的所有配置
2.DStream操作-DStream一系列的操作
3.未完成的batches-哪些提交了job但尚未执行或未完成的batches
Data checkpointing-保存已生成的RDD至可靠的存储。这在某些stateful转换中是需要的，在这种转换中，生成RDD需要依赖前面的batches，会导致依赖链随着时间而变长。为了避免这种没有尽头的变长，要定期将中间生成的RDD保存到可靠存储来切断依赖链
总之，metadata checkpoint主要用来恢复driver；而RDD数据的checkpointing对于stateful转换操作是必要的。

什么时候需要启动checkpoint

什么时候启动checkpoint呢？满足一下任一条件：
- 使用了staeful转换-如果application中使用了updateStateByKey或reduceByKeyAndWindows等stateful操作，必须提供checkpoint目录来允许定时的RDD checkpoint
- 希望能从意外中恢复driver
如果Streaming App没有stateful操作，也允许driver挂掉后再次重启的进度丢失，就没有启动checkpoint的必要了。

如何使用checkpoint

启用checkpoint，需要设置一个支持容错的、可靠的文件系统（如HDFS，S3等）目录来保存checkpoint数据。通过调用streamingContext.checkpoint(checkpointDirectory)来完成。另外，如果你想让你的application能从driver失败中恢复，你的application要满足：
- 若application为首次重启，将创建一个新的StreamContext实例
- 如果application是从失败中重启，将会从checkpoint目录导入checkpoint数据来重新创建StreamingContext实例
通过streamingContext.getOrCreate可以达到目的：

// Function to create and setup a new StreamingContextdef functionToCreateContext(): StreamingContext = {    val ssc = new StreamingContext(...)   // new context    val lines = ssc.socketTextStream(...) // create DStreams    ...    ssc.checkpoint(checkpointDirectory)   // set checkpoint directory    ssc}// Get StreamingContext from checkpoint data or create a new oneval context = StreamingContext.getOrCreate(checkpointDirectory, functionToCreateContext _)// Do additional setup on context that needs to be done,// irrespective of whether it is being started or restartedcontext. ...// Start the contextcontext.start()context.awaitTermination()

如果checkpointDirectory存在，那么context将导入checkpoint数据。如果目录不存在，函数functionToCreateContext将被调用并创建新的context

除调用getOrCreate外还需要你的集群模式支持driver挂掉后重启。例如，在yarn模式下，driver是运行在ApplicationMaster中，若ApplicationMaster挂掉，yarn会自动在另外一个节点启动一个新的ApplicationMaster。

需要注意的是，随着Streaming Application的持续运行，checkpoint数据占用的存储空间会不断变大。因此，需要小心设置checkpoint的时间间隔。设置得越小，checkpoint次数会越多，占用空间会越大；如果设置越大，会导致恢复时丢失的数据和进度越多。一般推荐设置为batch duration的5~10倍。

导出checkpoint数据

上文提到，checkpoint数据会定时导出到可靠的存储系统，那么
1.在什么时机进行checkpoint
2.checkpoint的形式是怎么样的

checkpoint的时机
在Spark Streaming中，JobGenerator用于生成每个batch对应的jobs，它有一个定时器，定时器的周期即初始化StreamingContext时设置的batchDuration。这个周期一到，JobGenerator将调用generateJobs方法来生成并提交jobs，这之后调用doCheckpoint方法来进行checkpoint。doCheckpoint方法中，会判断当前时间与streaming application start的时间之差是否是checkpoint duration的倍数，只有在是的情况下才进行checkpoint。

checkpoint的形式
最终checkpoint的形式是将类Checkpoint的实例序列化后写入外部存储，值得一提的是，有专门的的一条线程来做序列化后的checkpoint写入外部存储。类Checkp包涵一下数据
这里写图片描述
除了Checkpoint类，还有CheckpointWriter类用来导出checkpoint，CheckpointReader用来导入checkpoint

Checkpoint 的局限

Spark Streaming 的 checkpoint 机制看起来很美好，却有一个硬伤。上文提到最终刷到外部存储的是类 Checkpoint 对象序列化后的数据。那么在 Spark Streaming application 重新编译后，再去反序列化 checkpoint 数据就会失败。这个时候就必须新建 StreamingContext。

针对这种情况，在我们结合 Spark Streaming + kafka 的应用中，我们自行维护了消费的 offsets，这样一来及时重新编译 application，还是可以从需要的 offsets 来消费数据，这里只是举个例子，不详细展开了。

阅读全文

0 0