第41课：Checkpoint彻底解密：Checkpoint的运行原理和源码实现彻底详解

来源：互联网发布：淘宝营销培训编辑：程序博客网时间：2024/05/29 02:38

一：Checkpoint到底是什么？

1， Spark在生产环境下经常会面临Tranformations的RDD非常多（例如一个Job中包含1万个RDD）或者具体Tranformation产生的RDD本身计算特别复杂和耗时（例如计算时常超过1个小时），此时我们必须考虑对计算结果数据的持久化；

2， Spark是擅长多步骤迭代，同时擅长基于Job的复用，这个时候如果能够对曾经计算的过程产生的数据进行复用，就可以极大的提升效率；

3，如果采用persist把数据放在内存中的话，虽然是最快速的但是也是最不可靠的；如果放在磁盘上也不是完全可靠的！例如磁盘会损坏，管理员可能清空磁盘等。

4， Checkpoint的产生就是为了相对而言更加可靠的持久化数据，在Checkpoint可以指定把数据放在本地并且是多副本的方式，但是在正常的生产环境下是放在HDFS，这就天然的借助了HDFS高容错的高可靠的特征来完成了最大化的可靠的持久化数据的方式；

5， Checkpoint是为了最大程度保证绝度可靠的复用RDD计算数据的Spark的高级功能，通过Checkpoint我们通过把数据持久化的HDFS来保证数据最大程度的安全性；

6， Checkpoint就是针对整个RDD计算链条中特别需要数据持久化的环节（后面会反复使用当前环节的RDD）开始基于HDFS等的数据持久化复用策略，通过对RDD启动checkpoint机制来实现容错和高可用。

RDD进行计算前需先看一下是否有Checkpoint，如果有Checkpoint，就不需要再进行计算。

RDD.scala的iterator源码方法：

1. final def iterator(split: Partition, context:TaskContext): Iterator[T] = {

2. if (storageLevel != StorageLevel.NONE) {

3. getOrCompute(split, context)

4. } else {

5. computeOrReadCheckpoint(split, context)

6. }

7. }

进入RDD.scala的getOrCompute方法，源码如下：

1. private[spark] defgetOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {

2. val blockId = RDDBlockId(id,partition.index)

3. var readCachedBlock = true

4. // This method is called on executors, sowe need call SparkEnv.get instead of sc.env.

5. SparkEnv.get.blockManager.getOrElseUpdate(blockId,storageLevel, elementClassTag, () => {

6. readCachedBlock = false

7. computeOrReadCheckpoint(partition,context)

8. }) match {

getOrCompute方法的getOrElseUpdate方法传入的第四个参数是匿名函数，调用computeOrReadCheckpoint(partition, context)检查Checkpoint中是否有数据。

RDD.scala的computeOrReadCheckpoint源码如下：

1. private[spark] defcomputeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =

2. {

3. if (isCheckpointedAndMaterialized) {

4. firstParent[T].iterator(split, context)

5. } else {

6. compute(split, context)

7. }

8. }

computeOrReadCheckpoint方法中的isCheckpointedAndMaterialized是一个布尔值，判断这个RDD是否checkpointed和被物化，Spark 2.0 Checkpoint中有二种方式：reliably或者locally。computeOrReadCheckpoint作为 `isCheckpointed`语义的别名返回值。

isCheckpointedAndMaterialized方法源码：

1. private[spark] defisCheckpointedAndMaterialized: Boolean =

2. checkpointData.exists(_.isCheckpointed)

回到RDD.scala的computeOrReadCheckpoint，如果已经持久化及物化isCheckpointedAndMaterialized，就调用firstParent[T]的iterator。如果没有持久化，则进行compute。

二：Checkpoint原理机制

1，通过调用SparkContext.setCheckpointDir方法来指定进行Checkpoint操作的RDD把数据放在哪里，在生产集群中是放在HDFS上的，同时为了提高效率，在进行checkpoint的使用时可以指定很多目录。

我们看一下SparkContext，SparkContext为即将计算的RDD设置Checkpoint保存的目录。如果在集群中运行，必须是HDFS的目录路径。

SparkContext.scala的setCheckpointDir源码：

1. def setCheckpointDir(directory: String) {

3. /*如果在集群上运行，如目录是本地的，则记录一个警告。否则，driver可能会试图从

4. 它自己的本地文件系统重建RDD 的checkpoint检测点，因为checkpoint检查点文件不正确。实际上是在executor机器上。*/

5. if (!isLocal &&Utils.nonLocalPaths(directory).isEmpty) {

6. logWarning("Spark is not running inlocal mode, therefore the checkpoint directory " +

7. s"must not be on the localfilesystem. Directory '$directory' " +

8. "appears to be on the localfilesystem.")

9. }

10.

11. checkpointDir = Option(directory).map { dir=>

12. val path = new Path(dir,UUID.randomUUID().toString)

13. val fs =path.getFileSystem(hadoopConfiguration)

14. fs.mkdirs(path)

15. fs.getFileStatus(path).getPath.toString

16. }

17. }

RDD.scala的checkpoint方法标记RDD的检查点checkpoint。它将保存到`SparkContext#setCheckpointDir`的目录检查点内的文件中，所有引用它的父RDDs将被移除。须在任何作业之前调用此函数。建议RDD在内存中缓存，否则保存在文件中时需要重新计算。

RDD.scala的checkpoint源码：

1. def checkpoint(): Unit = RDDCheckpointData.synchronized{

2. // NOTE: we use a global lock here due tocomplexities downstream with ensuring

3. // children RDD partitions point to thecorrect parent partitions. In the future

4. // we should revisit this consideration.

5. if (context.checkpointDir.isEmpty) {

6. throw new SparkException("Checkpointdirectory has not been set in the SparkContext")

7. } else if (checkpointData.isEmpty) {

8. checkpointData = Some(newReliableRDDCheckpointData(this))

9. }

10. }

其中的checkpointData是RDDCheckpointData：

1. private[spark] var checkpointData:Option[RDDCheckpointData[T]] = None

RDDCheckpointData是标识某个RDD要进行checkpoint。如果某个RDD要进行checkpoint，那在Spark框架内部就会生成RDDCheckpointData

1. private[spark] abstract classRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])

2. extends Serializable {

4. import CheckpointState._

6. // The checkpoint state of the associatedRDD.

7. protected var cpState = Initialized

9. // The RDD that contains our checkpointeddata

10. private var cpRDD: Option[CheckpointRDD[T]] =None

11.

12. // TODO: are we sure we need to use a globallock in the following methods?

13.

14. /**

15. * Return whether the checkpoint data forthis RDD is already persisted.

16. */

17. def isCheckpointed: Boolean =RDDCheckpointData.synchronized {

18. cpState == Checkpointed

19. }

20.

21. /**

22. * Materialize this RDD and persist itscontent.

23. * This is called immediately after the firstaction invoked on this RDD has completed.

24. */

25. final def checkpoint(): Unit = {

26. // Guard against multiple threadscheckpointing the same RDD by

27. // atomically flipping the state of thisRDDCheckpointData

28. RDDCheckpointData.synchronized {

29. if (cpState == Initialized) {

30. cpState = CheckpointingInProgress

31. } else {

32. return

33. }

34. }

35.

36. val newRDD = doCheckpoint()

37.

38. // Update our state and truncate the RDDlineage

39. RDDCheckpointData.synchronized {

40. cpRDD = Some(newRDD)

41. cpState = Checkpointed

42. rdd.markCheckpointed()

43. }

44. }

45.

46. /**

47. * Materialize this RDD and persist itscontent.

48. *

49. * Subclasses should override this method todefine custom checkpointing behavior.

50. * @return the checkpoint RDD created in theprocess.

51. */

52. protected def doCheckpoint():CheckpointRDD[T]

53.

54. /**

55. * Return the RDD that contains ourcheckpointed data.

56. * This is only defined if the checkpointstate is `Checkpointed`.

57. */

58. def checkpointRDD: Option[CheckpointRDD[T]] =RDDCheckpointData.synchronized { cpRDD }

59.

60. /**

61. * Return the partitions of the resultingcheckpoint RDD.

62. * For tests only.

63. */

64. def getPartitions: Array[Partition] =RDDCheckpointData.synchronized {

65. cpRDD.map(_.partitions).getOrElse {Array.empty }

66. }

67.

68. }

69.

70. /**

71. * Global lock for synchronizing checkpointoperations.

72. */

73. private[spark] objectRDDCheckpointData

2，在进行RDD的checkpoint的时候其所依赖的所有的RDD都会从计算链条中清空掉；

3，作为最佳实践，一般在进行checkpoint方法调用前通过都要进行persist来把当前RDD的数据持久化到内存或者磁盘上，这是因为checkpoint是Lazy级别，必须有Job的执行且在Job执行完成后才会从后往前回溯哪个RDD进行了Checkpoint标记，然后对该标记了要进行Checkpoint的RDD新启动一个Job执行具体的Checkpoint的过程；

4， Checkpoint改变了RDD的Lineage；

5，当我们调用了checkpoint方法要对RDD进行Checkpoint操作的话，此时框架会自动生成RDDCheckpointData，当RDD上运行过一个Job后就会立即触发RDDCheckpointData中的checkpoint方法，在其内部会调用doCheckpoint，实际上在生产环境下会调用ReliableRDDCheckpointData的doCheckpoint，在生产环境下会导致ReliableCheckpointRDD的writeRDDToCheckpointDirectory的调用，而在writeRDDToCheckpointDirectory方法内部会触发runJob来执行把当前的RDD中的数据写到Checkpoint的目录中，同时会产生ReliableCheckpointRDD实例；

RDDCheckpointData.scala的checkpoint方法进行真正的checkpoint：在RDDCheckpointData.synchronized同步块中先判断(cpState的状态，然后调用doCheckpoint()。

RDDCheckpointData.scala的checkpoint方法源码：

1. final def checkpoint(): Unit = {

2. // Guard against multiple threadscheckpointing the same RDD by

3. // atomically flipping the state of thisRDDCheckpointData

4. RDDCheckpointData.synchronized {

5. if (cpState == Initialized) {

6. cpState = CheckpointingInProgress

7. } else {

8. return

9. }

10. }

11.

12. val newRDD = doCheckpoint()

13.

14. // Update our state and truncate the RDDlineage

15. RDDCheckpointData.synchronized {

16. cpRDD = Some(newRDD)

17. cpState = Checkpointed

18. rdd.markCheckpointed()

19. }

20. }

其中的doCheckpoint方法是RDDCheckpointData.scala中的方法，这里没有具体的实现。

1. protected def doCheckpoint():CheckpointRDD[T]

RDDCheckpointData的子类包括：LocalRDDCheckpointData、ReliableRDDCheckpointData。ReliableRDDCheckpointData子类中doCheckpoint方法具体的实现，在方法中进行writeRDDToCheckpointDirectory的调用。

ReliableRDDCheckpointData.scala的doCheckpoint源码：

1. protected override def doCheckpoint():CheckpointRDD[T] = {

2. val newRDD =ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

4. // Optionally clean our checkpoint files ifthe reference is out of scope

5. if(rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints",false)) {

6. rdd.context.cleaner.foreach { cleaner=>

7. cleaner.registerRDDCheckpointDataForCleanup(newRDD,rdd.id)

8. }

9. }

10.

11. logInfo(s"Done checkpointing RDD${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")

12. newRDD

13. }

14.

15. }

writeRDDToCheckpointDirectory将RDD的数据写入到checkpoint的文件中，返回一个ReliableCheckpointRDD。

l 首先找到sparkContext，赋值给sc变量。

l 基于checkpointDir创建checkpointDirPath。

l fs 获取文件系统的内容。

l 然后是广播sc.broadcast，将路径信息广播给所有的Executor。

l 接下来是 sc.runJob，触发runJob执行把当前的RDD中的数据写到Checkpoint的目录中。

l 最后返回ReliableCheckpointRDD。无论是对哪个RDD进行checkpoint，最终会产生ReliableCheckpointRDD，以checkpointDirPath.toString中的数据为数据来源；以originalRDD.partitioner的分区器partitioner作为partitioner；这里的originalRDD就是要进行checkpoint的RDD。

writeRDDToCheckpointDirectory的源码如下：

1. defwriteRDDToCheckpointDirectory[T: ClassTag](

2. originalRDD: RDD[T],

3. checkpointDir: String,

4. blockSize: Int = -1):ReliableCheckpointRDD[T] = {

6. val sc = originalRDD.sparkContext

8. // Create the output path for thecheckpoint

9. val checkpointDirPath = newPath(checkpointDir)

10. valfs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)

11. if (!fs.mkdirs(checkpointDirPath)) {

12. throw new SparkException(s"Failed tocreate checkpoint path $checkpointDirPath")

13. }

14.

15. // Save to file, and reload it as an RDD

16. val broadcastedConf = sc.broadcast(

17. newSerializableConfiguration(sc.hadoopConfiguration))

18. // TODO: This is expensive because itcomputes the RDD again unnecessarily (SPARK-8582)

19. sc.runJob(originalRDD,

20. writePartitionToCheckpointFile[T](checkpointDirPath.toString,broadcastedConf) _)

21.

22. if (originalRDD.partitioner.nonEmpty) {

23. writePartitionerToCheckpointDir(sc,originalRDD.partitioner.get, checkpointDirPath)

24. }

25.

26. val newRDD = new ReliableCheckpointRDD[T](

27. sc, checkpointDirPath.toString,originalRDD.partitioner)

28. if (newRDD.partitions.length !=originalRDD.partitions.length) {

29. throw new SparkException(

30. s"Checkpoint RDD$newRDD(${newRDD.partitions.length}) has different " +

31. s"number of partitions fromoriginal RDD $originalRDD(${originalRDD.partitions.length})")

32. }

33. newRDD

34. }

ReliableCheckpointRDD是读取以前写入可靠存储系统checkpoint检查点文件数据的RDD。其中的partitioner是构建ReliableCheckpointRDD的时候传进来的。其中的getPartitions是构建一个一个的分片。其中getPreferredLocations获取数据本地性，fs.getFileBlockLocations获取文件在哪里的位置信息。其中compute方法通过ReliableCheckpointRDD.readCheckpointFile读取数据。

ReliableCheckpointRDD.scala

1. private[spark]class ReliableCheckpointRDD[T: ClassTag](

2. sc: SparkContext,

3. val checkpointPath: String,

4. _partitioner: Option[Partitioner] = None

5. ) extends CheckpointRDD[T](sc) {

7. @transient private val hadoopConf =sc.hadoopConfiguration

8. @transient private val cpath = newPath(checkpointPath)

9. @transient private val fs = cpath.getFileSystem(hadoopConf)

10. private val broadcastedConf =sc.broadcast(new SerializableConfiguration(hadoopConf))

11.

12. // Fail fast if checkpoint directory does notexist

13. require(fs.exists(cpath), s"Checkpointdirectory does not exist: $checkpointPath")

14.

15. /**

16. * Return the path of the checkpointdirectory this RDD reads data from.

17. */

18. override val getCheckpointFile:Option[String] = Some(checkpointPath)

19. override val partitioner: Option[Partitioner]= {

20. _partitioner.orElse {

21. ReliableCheckpointRDD.readCheckpointedPartitionerFile(context,checkpointPath)

22. }

23. }

24. /**

25. * Return partitions described by the filesin the checkpoint directory.

26. *

27. * Since the original RDD may belong to aprior application, there is no way to know a

28. * priori the number of partitions to expect.This method assumes that the original set of

29. * checkpoint files are fully preserved in areliable storage across application lifespans.

30. */

31. protected override def getPartitions:Array[Partition] = {

32. // listStatus can throw exception if pathdoes not exist.

33. val inputFiles = fs.listStatus(cpath)

34. .map(_.getPath)

35. .filter(_.getName.startsWith("part-"))

36. .sortBy(_.getName.stripPrefix("part-").toInt)

37. // Fail fast if input files are invalid

38. inputFiles.zipWithIndex.foreach { case(path, i) =>

39. if (path.getName !=ReliableCheckpointRDD.checkpointFileName(i)) {

40. throw new SparkException(s"Invalidcheckpoint file: $path")

41. }

42. }

43. Array.tabulate(inputFiles.length)(i =>new CheckpointRDDPartition(i))

44. }

45. /**

46. * Return the locations of the checkpointfile associated with the given partition.

47. */

48. protected override defgetPreferredLocations(split: Partition): Seq[String] = {

49. val status = fs.getFileStatus(

50. new Path(checkpointPath,ReliableCheckpointRDD.checkpointFileName(split.index)))

51. val locations =fs.getFileBlockLocations(status, 0, status.getLen)

52. locations.headOption.toList.flatMap(_.getHosts).filter(_!= "localhost")

53. }

54.

55. /**

56. * Read the content of the checkpoint fileassociated with the given partition.

57. */

58. override def compute(split: Partition,context: TaskContext): Iterator[T] = {

59. val file = new Path(checkpointPath,ReliableCheckpointRDD.checkpointFileName(split.index))

60. ReliableCheckpointRDD.readCheckpointFile(file,broadcastedConf, context)

61. }

62.

63. }

64. …….

看一下ReliableCheckpointRDD.scala中compute方法中的ReliableCheckpointRDD.readCheckpointFile，readCheckpointFile读取指定检查点文件checkpoint的内容。readCheckpointFile方法中通过deserializeStream反序列化fileInputStream文件输入流，然后将deserializeStream变成一个Iterator。

ReliableCheckpointRDD.scala的readCheckpointFile源码：

1. def readCheckpointFile[T](

2. path: Path,

3. broadcastedConf:Broadcast[SerializableConfiguration],

4. context: TaskContext): Iterator[T] = {

5. val env = SparkEnv.get

6. val fs =path.getFileSystem(broadcastedConf.value.value)

7. val bufferSize =env.conf.getInt("spark.buffer.size", 65536)

8. val fileInputStream = fs.open(path,bufferSize)

9. val serializer = env.serializer.newInstance()

10. val deserializeStream =serializer.deserializeStream(fileInputStream)

11.

12. // Register an on-task-completion callbackto close the input stream.

13. context.addTaskCompletionListener(context=> deserializeStream.close())

14.

15. deserializeStream.asIterator.asInstanceOf[Iterator[T]]

16. }

17.

18. }

ReliableRDDCheckpointData.scala的cleanCheckpoint方法，清理RDD数据相关的checkpoint文件：

1. defcleanCheckpoint(sc: SparkContext, rddId: Int): Unit = {

2. checkpointPath(sc, rddId).foreach { path=>

3. path.getFileSystem(sc.hadoopConfiguration).delete(path,true)

4. }

5. }

在生产环境中我们不使用LocalCheckpointRDD，LocalCheckpointRDD的getPartitions直接从toArray级别中new出来CheckpointRDDPartition；LocalCheckpointRDD的compute方法直接报异常。

LocalCheckpointRDD源码：

1. private[spark]class LocalCheckpointRDD[T: ClassTag](

2. sc: SparkContext,

3. rddId: Int,

4. numPartitions: Int)

5. extends CheckpointRDD[T](sc) {

6. ......

7. protected override defgetPartitions: Array[Partition] = {

8. (0 until numPartitions).toArray.map { i=> new CheckpointRDDPartition(i) }

9. }

10. …….

11. override def compute(partition: Partition,context: TaskContext): Iterator[T] = {

12. throw new SparkException(

13. s"Checkpoint block${RDDBlockId(rddId, partition.index)} not found! Either the executor " +

14. s"that originally checkpointed thispartition is no longer alive, or the original RDD is " +

15. s"unpersisted. If this problempersists, you may consider using `rdd.checkpoint()` " +

16. s"instead, which is slower thanlocal checkpointing but more fault-tolerant.")

17. }

18.

19. }

上士闻道，勤而行之；中士闻道，若存若亡；下士闻道，大笑之。不笑不足以为道。

阅读全文

0 0