第41课:Checkpoint彻底解密:Checkpoint的运行原理和源码实现彻底详解

来源:互联网 发布:淘宝营销培训 编辑:程序博客网 时间:2024/05/29 02:38

41课:Checkpoint彻底解密:Checkpoint的运行原理和源码实现彻底详解

一:Checkpoint到底是什么?

1,  Spark在生产环境下经常会面临Tranformations的RDD非常多(例如一个Job中包含1万个RDD)或者具体Tranformation产生的RDD本身计算特别复杂和耗时(例如计算时常超过1个小时),此时我们必须考虑对计算结果数据的持久化;

2,  Spark是擅长多步骤迭代,同时擅长基于Job的复用,这个时候如果能够对曾经计算的过程产生的数据进行复用,就可以极大的提升效率;

3,  如果采用persist把数据放在内存中的话,虽然是最快速的但是也是最不可靠的;如果放在磁盘上也不是完全可靠的!例如磁盘会损坏,管理员可能清空磁盘等。

4,  Checkpoint的产生就是为了相对而言更加可靠的持久化数据,在Checkpoint可以指定把数据放在本地并且是多副本的方式,但是在正常的生产环境下是放在HDFS,这就天然的借助了HDFS高容错的高可靠的特征来完成了最大化的可靠的持久化数据的方式;

5,  Checkpoint是为了最大程度保证绝度可靠的复用RDD计算数据的Spark的高级功能,通过Checkpoint我们通过把数据持久化的HDFS来保证数据最大程度的安全性;

6,  Checkpoint就是针对整个RDD计算链条中特别需要数据持久化的环节(后面会反复使用当前环节的RDD)开始基于HDFS等的数据持久化复用策略,通过对RDD启动checkpoint机制来实现容错和高可用。

RDD进行计算前需先看一下是否有Checkpoint,如果有Checkpoint,就不需要再进行计算。

RDD.scala的iterator源码方法:

1.               final def iterator(split: Partition, context:TaskContext): Iterator[T] = {

2.             if (storageLevel != StorageLevel.NONE) {

3.               getOrCompute(split, context)

4.             } else {

5.               computeOrReadCheckpoint(split, context)

6.             }

7.           }

 

进入RDD.scala的getOrCompute方法,源码如下:

1.         private[spark] defgetOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {

2.             val blockId = RDDBlockId(id,partition.index)

3.             var readCachedBlock = true

4.             // This method is called on executors, sowe need call SparkEnv.get instead of sc.env.

5.             SparkEnv.get.blockManager.getOrElseUpdate(blockId,storageLevel, elementClassTag, () => {

6.               readCachedBlock = false

7.               computeOrReadCheckpoint(partition,context)

8.             }) match {  

 

getOrCompute方法的getOrElseUpdate方法传入的第四个参数是匿名函数,调用computeOrReadCheckpoint(partition, context)检查Checkpoint中是否有数据。

RDD.scala的computeOrReadCheckpoint源码如下:

1.                 private[spark] defcomputeOrReadCheckpoint(split: Partition, context: TaskContext): Iterator[T] =

2.           {

3.             if (isCheckpointedAndMaterialized) {

4.               firstParent[T].iterator(split, context)

5.             } else {

6.               compute(split, context)

7.             }

8.           }

computeOrReadCheckpoint方法中的isCheckpointedAndMaterialized是一个布尔值,判断这个RDD是否checkpointed和被物化,Spark 2.0 Checkpoint中有二种方式:reliably或者locally。computeOrReadCheckpoint作为 `isCheckpointed`语义的别名返回值。

isCheckpointedAndMaterialized方法源码:

1.              private[spark] defisCheckpointedAndMaterialized: Boolean =

2.             checkpointData.exists(_.isCheckpointed)

 

回到RDD.scala的computeOrReadCheckpoint,如果已经持久化及物化isCheckpointedAndMaterialized,就调用firstParent[T]的iterator。如果没有持久化,则进行compute。

二:Checkpoint原理机制

1,  通过调用SparkContext.setCheckpointDir方法来指定进行Checkpoint操作的RDD把数据放在哪里,在生产集群中是放在HDFS上的,同时为了提高效率,在进行checkpoint的使用时可以指定很多目录。

         我们看一下SparkContext,SparkContext为即将计算的RDD设置Checkpoint保存的目录。如果在集群中运行,必须是HDFS的目录路径。

SparkContext.scala的setCheckpointDir源码:

1.            def setCheckpointDir(directory: String) {

2.          

3.         /*如果在集群上运行,如目录是本地的,则记录一个警告。否则,driver可能会试图从

4.         它自己的本地文件系统重建RDD 的checkpoint检测点,因为checkpoint检查点文件不正确。实际上是在executor机器上。*/

5.             if (!isLocal &&Utils.nonLocalPaths(directory).isEmpty) {

6.               logWarning("Spark is not running inlocal mode, therefore the checkpoint directory " +

7.                 s"must not be on the localfilesystem. Directory '$directory' " +

8.                 "appears to be on the localfilesystem.")

9.             }

10.       

11.          checkpointDir = Option(directory).map { dir=>

12.            val path = new Path(dir,UUID.randomUUID().toString)

13.            val fs =path.getFileSystem(hadoopConfiguration)

14.            fs.mkdirs(path)

15.            fs.getFileStatus(path).getPath.toString

16.          }

17.        }

 

RDD.scala的checkpoint方法标记RDD的检查点checkpoint。它将保存到`SparkContext#setCheckpointDir`的目录检查点内的文件中,所有引用它的父RDDs将被移除。须在任何作业之前调用此函数。建议RDD在内存中缓存,否则保存在文件中时需要重新计算。

RDD.scala的checkpoint源码:

1.         def checkpoint(): Unit = RDDCheckpointData.synchronized{

2.             // NOTE: we use a global lock here due tocomplexities downstream with ensuring

3.             // children RDD partitions point to thecorrect parent partitions. In the future

4.             // we should revisit this consideration.

5.             if (context.checkpointDir.isEmpty) {

6.               throw new SparkException("Checkpointdirectory has not been set in the SparkContext")

7.             } else if (checkpointData.isEmpty) {

8.               checkpointData = Some(newReliableRDDCheckpointData(this))

9.             }

10.        } 

 

其中的checkpointData是RDDCheckpointData:

1.              private[spark] var checkpointData:Option[RDDCheckpointData[T]] = None

RDDCheckpointData是标识某个RDD要进行checkpoint。如果某个RDD要进行checkpoint,那在Spark框架内部就会生成RDDCheckpointData

1.          private[spark] abstract classRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])

2.           extends Serializable {

3.          

4.           import CheckpointState._

5.          

6.           // The checkpoint state of the associatedRDD.

7.           protected var cpState = Initialized

8.          

9.           // The RDD that contains our checkpointeddata

10.        private var cpRDD: Option[CheckpointRDD[T]] =None

11.       

12.        // TODO: are we sure we need to use a globallock in the following methods?

13.       

14.        /**

15.         * Return whether the checkpoint data forthis RDD is already persisted.

16.         */

17.        def isCheckpointed: Boolean =RDDCheckpointData.synchronized {

18.          cpState == Checkpointed

19.        }

20.       

21.        /**

22.         * Materialize this RDD and persist itscontent.

23.         * This is called immediately after the firstaction invoked on this RDD has completed.

24.         */

25.        final def checkpoint(): Unit = {

26.          // Guard against multiple threadscheckpointing the same RDD by

27.          // atomically flipping the state of thisRDDCheckpointData

28.          RDDCheckpointData.synchronized {

29.            if (cpState == Initialized) {

30.              cpState = CheckpointingInProgress

31.            } else {

32.              return

33.            }

34.          }

35.       

36.          val newRDD = doCheckpoint()

37.       

38.          // Update our state and truncate the RDDlineage

39.          RDDCheckpointData.synchronized {

40.            cpRDD = Some(newRDD)

41.            cpState = Checkpointed

42.            rdd.markCheckpointed()

43.          }

44.        }

45.       

46.        /**

47.         * Materialize this RDD and persist itscontent.

48.         *

49.         * Subclasses should override this method todefine custom checkpointing behavior.

50.         * @return the checkpoint RDD created in theprocess.

51.         */

52.        protected def doCheckpoint():CheckpointRDD[T]

53.       

54.        /**

55.         * Return the RDD that contains ourcheckpointed data.

56.         * This is only defined if the checkpointstate is `Checkpointed`.

57.         */

58.        def checkpointRDD: Option[CheckpointRDD[T]] =RDDCheckpointData.synchronized { cpRDD }

59.       

60.        /**

61.         * Return the partitions of the resultingcheckpoint RDD.

62.         * For tests only.

63.         */

64.        def getPartitions: Array[Partition] =RDDCheckpointData.synchronized {

65.          cpRDD.map(_.partitions).getOrElse {Array.empty }

66.        }

67.       

68.      }

69.       

70.      /**

71.       * Global lock for synchronizing checkpointoperations.

72.       */

73.      private[spark] objectRDDCheckpointData

 

2,  在进行RDD的checkpoint的时候其所依赖的所有的RDD都会从计算链条中清空掉;

3,  作为最佳实践,一般在进行checkpoint方法调用前通过都要进行persist来把当前RDD的数据持久化到内存或者磁盘上,这是因为checkpoint是Lazy级别,必须有Job的执行且在Job执行完成后才会从后往前回溯哪个RDD进行了Checkpoint标记,然后对该标记了要进行Checkpoint的RDD新启动一个Job执行具体的Checkpoint的过程;

4,  Checkpoint改变了RDD的Lineage;

5,  当我们调用了checkpoint方法要对RDD进行Checkpoint操作的话,此时框架会自动生成RDDCheckpointData,当RDD上运行过一个Job后就会立即触发RDDCheckpointData中的checkpoint方法,在其内部会调用doCheckpoint,实际上在生产环境下会调用ReliableRDDCheckpointData的doCheckpoint,在生产环境下会导致ReliableCheckpointRDD的writeRDDToCheckpointDirectory的调用,而在writeRDDToCheckpointDirectory方法内部会触发runJob来执行把当前的RDD中的数据写到Checkpoint的目录中,同时会产生ReliableCheckpointRDD实例;



 

RDDCheckpointData.scala的checkpoint方法进行真正的checkpoint:在RDDCheckpointData.synchronized同步块中先判断(cpState的状态,然后调用doCheckpoint()。

RDDCheckpointData.scala的checkpoint方法源码:

1.            final def checkpoint(): Unit = {

2.             // Guard against multiple threadscheckpointing the same RDD by

3.             // atomically flipping the state of thisRDDCheckpointData

4.             RDDCheckpointData.synchronized {

5.               if (cpState == Initialized) {

6.                 cpState = CheckpointingInProgress

7.               } else {

8.                 return

9.               }

10.          }

11.       

12.          val newRDD = doCheckpoint()

13.       

14.          // Update our state and truncate the RDDlineage

15.          RDDCheckpointData.synchronized {

16.            cpRDD = Some(newRDD)

17.            cpState = Checkpointed

18.            rdd.markCheckpointed()

19.          }

20.        }

 

其中的doCheckpoint方法是RDDCheckpointData.scala中的方法,这里没有具体的实现。

1.             protected def doCheckpoint():CheckpointRDD[T]

RDDCheckpointData的子类包括:LocalRDDCheckpointData、ReliableRDDCheckpointData。ReliableRDDCheckpointData子类中doCheckpoint方法具体的实现,在方法中进行writeRDDToCheckpointDirectory的调用。

ReliableRDDCheckpointData.scala的doCheckpoint源码:

1.            protected override def doCheckpoint():CheckpointRDD[T] = {

2.             val newRDD =ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

3.          

4.             // Optionally clean our checkpoint files ifthe reference is out of scope

5.             if(rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints",false)) {

6.               rdd.context.cleaner.foreach { cleaner=>

7.                 cleaner.registerRDDCheckpointDataForCleanup(newRDD,rdd.id)

8.               }

9.             }

10.       

11.          logInfo(s"Done checkpointing RDD${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")

12.          newRDD

13.        }

14.       

15.      }

  writeRDDToCheckpointDirectory将RDD的数据写入到checkpoint的文件中,返回一个ReliableCheckpointRDD。

l  首先找到sparkContext,赋值给sc变量。

l  基于checkpointDir创建checkpointDirPath。

l  fs 获取文件系统的内容。

l  然后是广播sc.broadcast,将路径信息广播给所有的Executor。

l  接下来是 sc.runJob,触发runJob执行把当前的RDD中的数据写到Checkpoint的目录中。

l  最后返回ReliableCheckpointRDD。无论是对哪个RDD进行checkpoint,最终会产生ReliableCheckpointRDD,以checkpointDirPath.toString中的数据为数据来源;以originalRDD.partitioner的分区器partitioner作为partitioner;这里的originalRDD就是要进行checkpoint的RDD。

writeRDDToCheckpointDirectory的源码如下:

1.           defwriteRDDToCheckpointDirectory[T: ClassTag](

2.               originalRDD: RDD[T],

3.               checkpointDir: String,

4.               blockSize: Int = -1):ReliableCheckpointRDD[T] = {

5.          

6.             val sc = originalRDD.sparkContext

7.          

8.             // Create the output path for thecheckpoint

9.             val checkpointDirPath = newPath(checkpointDir)

10.          valfs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)

11.          if (!fs.mkdirs(checkpointDirPath)) {

12.            throw new SparkException(s"Failed tocreate checkpoint path $checkpointDirPath")

13.          }

14.       

15.          // Save to file, and reload it as an RDD

16.          val broadcastedConf = sc.broadcast(

17.            newSerializableConfiguration(sc.hadoopConfiguration))

18.          // TODO: This is expensive because itcomputes the RDD again unnecessarily (SPARK-8582)

19.          sc.runJob(originalRDD,

20.            writePartitionToCheckpointFile[T](checkpointDirPath.toString,broadcastedConf) _)

21.       

22.          if (originalRDD.partitioner.nonEmpty) {

23.            writePartitionerToCheckpointDir(sc,originalRDD.partitioner.get, checkpointDirPath)

24.          }

25.       

26.          val newRDD = new ReliableCheckpointRDD[T](

27.            sc, checkpointDirPath.toString,originalRDD.partitioner)

28.          if (newRDD.partitions.length !=originalRDD.partitions.length) {

29.            throw new SparkException(

30.              s"Checkpoint RDD$newRDD(${newRDD.partitions.length}) has different " +

31.                s"number of partitions fromoriginal RDD $originalRDD(${originalRDD.partitions.length})")

32.          }

33.          newRDD

34.        }

 

ReliableCheckpointRDD是读取以前写入可靠存储系统checkpoint检查点文件数据的RDD。其中的partitioner是构建ReliableCheckpointRDD的时候传进来的。其中的getPartitions是构建一个一个的分片。其中getPreferredLocations获取数据本地性,fs.getFileBlockLocations获取文件在哪里的位置信息。其中compute方法通过ReliableCheckpointRDD.readCheckpointFile读取数据。

ReliableCheckpointRDD.scala

1.           private[spark]class ReliableCheckpointRDD[T: ClassTag](

2.             sc: SparkContext,

3.             val checkpointPath: String,

4.             _partitioner: Option[Partitioner] = None

5.           ) extends CheckpointRDD[T](sc) {

6.          

7.           @transient private val hadoopConf =sc.hadoopConfiguration

8.           @transient private val cpath = newPath(checkpointPath)

9.           @transient private val fs = cpath.getFileSystem(hadoopConf)

10.        private val broadcastedConf =sc.broadcast(new SerializableConfiguration(hadoopConf))

11.       

12.        // Fail fast if checkpoint directory does notexist

13.        require(fs.exists(cpath), s"Checkpointdirectory does not exist: $checkpointPath")

14.       

15.        /**

16.         * Return the path of the checkpointdirectory this RDD reads data from.

17.         */

18.        override val getCheckpointFile:Option[String] = Some(checkpointPath)

19.        override val partitioner: Option[Partitioner]= {

20.          _partitioner.orElse {

21.            ReliableCheckpointRDD.readCheckpointedPartitionerFile(context,checkpointPath)

22.          }

23.        }

24.      /**

25.         * Return partitions described by the filesin the checkpoint directory.

26.         *

27.         * Since the original RDD may belong to aprior application, there is no way to know a

28.         * priori the number of partitions to expect.This method assumes that the original set of

29.         * checkpoint files are fully preserved in areliable storage across application lifespans.

30.         */

31.        protected override def getPartitions:Array[Partition] = {

32.          // listStatus can throw exception if pathdoes not exist.

33.          val inputFiles = fs.listStatus(cpath)

34.            .map(_.getPath)

35.            .filter(_.getName.startsWith("part-"))

36.            .sortBy(_.getName.stripPrefix("part-").toInt)

37.          // Fail fast if input files are invalid

38.          inputFiles.zipWithIndex.foreach { case(path, i) =>

39.            if (path.getName !=ReliableCheckpointRDD.checkpointFileName(i)) {

40.              throw new SparkException(s"Invalidcheckpoint file: $path")

41.            }

42.          }

43.          Array.tabulate(inputFiles.length)(i =>new CheckpointRDDPartition(i))

44.        }

45.      /**

46.         * Return the locations of the checkpointfile associated with the given partition.

47.         */

48.        protected override defgetPreferredLocations(split: Partition): Seq[String] = {

49.          val status = fs.getFileStatus(

50.            new Path(checkpointPath,ReliableCheckpointRDD.checkpointFileName(split.index)))

51.          val locations =fs.getFileBlockLocations(status, 0, status.getLen)

52.          locations.headOption.toList.flatMap(_.getHosts).filter(_!= "localhost")

53.        }

54.       

55.        /**

56.         * Read the content of the checkpoint fileassociated with the given partition.

57.         */

58.        override def compute(split: Partition,context: TaskContext): Iterator[T] = {

59.          val file = new Path(checkpointPath,ReliableCheckpointRDD.checkpointFileName(split.index))

60.          ReliableCheckpointRDD.readCheckpointFile(file,broadcastedConf, context)

61.        }

62.       

63.      }

64.      …….

 

看一下ReliableCheckpointRDD.scala中compute方法中的ReliableCheckpointRDD.readCheckpointFile,readCheckpointFile读取指定检查点文件checkpoint的内容。readCheckpointFile方法中通过deserializeStream反序列化fileInputStream文件输入流,然后将deserializeStream变成一个Iterator。

ReliableCheckpointRDD.scala的readCheckpointFile源码:

1.          def readCheckpointFile[T](

2.               path: Path,

3.               broadcastedConf:Broadcast[SerializableConfiguration],

4.               context: TaskContext): Iterator[T] = {

5.             val env = SparkEnv.get

6.             val fs =path.getFileSystem(broadcastedConf.value.value)

7.             val bufferSize =env.conf.getInt("spark.buffer.size", 65536)

8.             val fileInputStream = fs.open(path,bufferSize)

9.             val serializer = env.serializer.newInstance()

10.          val deserializeStream =serializer.deserializeStream(fileInputStream)

11.       

12.          // Register an on-task-completion callbackto close the input stream.

13.          context.addTaskCompletionListener(context=> deserializeStream.close())

14.       

15.          deserializeStream.asIterator.asInstanceOf[Iterator[T]]

16.        }

17.       

18.      }

 

ReliableRDDCheckpointData.scala的cleanCheckpoint方法,清理RDD数据相关的checkpoint文件:

1.            defcleanCheckpoint(sc: SparkContext, rddId: Int): Unit = {

2.             checkpointPath(sc, rddId).foreach { path=>

3.               path.getFileSystem(sc.hadoopConfiguration).delete(path,true)

4.             }

5.           }

 

 

在生产环境中我们不使用LocalCheckpointRDD,LocalCheckpointRDD的getPartitions直接从toArray级别中new出来CheckpointRDDPartition;LocalCheckpointRDD的compute方法直接报异常。

LocalCheckpointRDD源码:

1.           private[spark]class LocalCheckpointRDD[T: ClassTag](

2.             sc: SparkContext,

3.             rddId: Int,

4.             numPartitions: Int)

5.           extends CheckpointRDD[T](sc) {

6.         ......

7.         protected override defgetPartitions: Array[Partition] = {

8.             (0 until numPartitions).toArray.map { i=> new CheckpointRDDPartition(i) }

9.           }

10.      …….

11.        override def compute(partition: Partition,context: TaskContext): Iterator[T] = {

12.          throw new SparkException(

13.            s"Checkpoint block${RDDBlockId(rddId, partition.index)} not found! Either the executor " +

14.            s"that originally checkpointed thispartition is no longer alive, or the original RDD is " +

15.            s"unpersisted. If this problempersists, you may consider using `rdd.checkpoint()` " +

16.            s"instead, which is slower thanlocal checkpointing but more fault-tolerant.")

17.        }

18.       

19.      }

 

 


上士闻道,勤而行之;中士闻道,若存若亡;下士闻道,大笑之。不笑不足以为道。


阅读全文
0 0