Recent Evolution of Zero Data Loss Guarantee in Spark Streaming With Kafka

来源：互联网发布：淘宝哪里看直播编辑：程序博客网时间：2024/06/15 03:37

When properly deployed, Spark Streaming 1.2 provides zero data loss guarantee. To enjoy this mission-critical feature, you need to fulfil following prerequisites:

The input data comes from reliable source and reliable receivers
Application metadata is checkpointed by the application driver
Write ahead log is enabled

Let’s briefly describe these prerequisites. In this blog post, we focus on Kafka integration and try to cover as little details as possible to understand them.

Reliable sources and reliable receivers

For some input sources (including Kafka), Spark Streaming can acknowledge the received data. The input data is first received by receivers and then stored in Spark with replication (by default, data is replicated to two Spark executors for fault-tolerance). Once data is replicated in Spark, the receiver can acknowledge it (e.g. by updating offsets in ZooKeeper in case of consuming from Kafka). This ensures that no input data is lost when the receiver suddenly crashes while receiving input data – the received and not yet replicated data simply won’t be acknowledged, so it will be resent by the source when the receiver eventually recovers.
这里写图片描述

Metadata checkpointing

The reliable sources and reliable receivers allow us to recover from the receiver failure (or the executor or the server where the receiver process runs). The much more tricky thing, however, is to recover from the driver failure. Several techniques are introduced to make it possible. One of them is checkpointing application metadata. With this feature, the driver saves the application-critical metadata to a fault-tolerant storage like HDFS or S3 to restore them later in the event of recovery from failure. The metadata includes:

configuration
code
a list of queued but not yet completed batches (just the metadata, not data)

这里写图片描述

Thanks to the metadata checkpointing, the driver is able to reconstruct the application and determine how far the application had gone before the driver failure.

Possible data loss scenario

Surprisingly or not, reliable sources, reliable receivers and metadata checkpointing are still not enough to prevent from potential data loss. Imagine a following bad scenario:

Two executors successfully receive input data from the receiver and buffer it in their memory
Receiver acknowledges the input source
Executors start processing the buffered data according to the application code
The driver suddenly fails
By design, when the driver fails, all of its executors are killed too
Since the executors are killed, all data buffered in their memory gets lost. In consequence, the acknowledged and buffered but yet not processed data is lost
Buffered data can’t be recovered, because it’s stored only in executors memory, so the data is lost forever

It sounds really bad for mission-critical applications, doesn’t it?

Write ahead log

To address the unfortunate scenario described in the previous section, the Write Ahead Log (WAL) was introduced in Spark Streaming 1.2.

With the WAL enabled, all received data is additionally written by receiver to the checkpoint directory in a fault-tolerant file system such as HDFS or S3. Thanks to that it can be re-read from there when the driver is recovered and buffered data in executors memory is lost. In this simple way, Spark Streaming provides a mechanism to avoid data loss on driver recovery!
这里写图片描述

At-least-once semantics

Although the WAL ensures the zero data loss, it doesn’t guarantee an exactly-once semantics for all sources. Imagine the following bad scenario that might happen when integrating Spark Streaming with Kafka:

Receiver receives input data and stores it reliably in WAL
Receiver fails before updating the Kafka offsets in Zookeeper
Spark Streaming assumes that input data has been successfully received (because it’s written to WAL), while Kafka thinks that data hasn’t been yet consumed (because the corresponding offsets haven’t been updated in ZooKeeper)
After a moment, the receiver recovers from failure
The unprocessed, but saved data is read from the WAL
Once all data is read from the WAL, the receivers resume consumption from Kafka. Because the receiver uses the Kafka High-Level Consumer API, it starts consuming data from the place where offsets in ZooKeeper currently point to. Since the offsets hasn’t been updated when the receiver failed, some data is processed twice

WAL disadvantages

Apart from the scenario described above, WAL has two other non-negligible disadvantages:

WAL reduces the throughput of receivers as the received data must be stored reliably in distributed file-system (the workaround is to start more receivers)
For some input sources, it duplicates the same data. For instance, when reading from Kafka, you keep one copy of data in Kafka brokers (because other Kafka consumers might want to consume it) and second copy in Spark Streaming (in the form of WAL stored in Hadoop API compatible file system)

Kafka direct API

To address the performance penalty introduced by WAL and ensure the exactly-once semantics, the experimental feature, called Kafka direct API, has been implemented in Spark Streaming 1.3.

This idea for this feature is brilliant. The Spark driver simply calculates what ranges of Kafka’s offsets define the next computation batch and then it instructs Spark executors consume corresponding data directly from Kafka’s topics and partitions. In other words, this approach treats Kafka as a filesystem and consumes data from a topic like from a file.
这里写图片描述

In this simple, but powerful design:

there are no longer Kafka receivers needed – executors consume data directly from Kafka using Simple Consumer API
the WAL is not used – we can always re-consume data from Kafka if case recovering from failures
the exactly-once semantics is guaranteed – we won’t reread any duplicate data from the WAL

Please note that Kafka direct API is an experimental feature in Spark 1.3.

Summary

This blog post briefly describe how Spark Streaming has continuously added various features that together provide the zero data loss guarantee and the exactly-once semantics (especially when reading from Kafka). It’s nice to see that each version of Spark Streaming improves the existing design, makes it more robust and ready for more and more demanding production use-cases.

转载自：Recent Evolution of Zero Data Loss Guarantee in Spark Streaming With Kafka

0 0