spark源码阅读笔记RDD（五） RDD中的checkpoint原理

来源：互联网发布：机械设计3d软件编辑：程序博客网时间：2024/05/21 09:21

----------------------------目录----------------------------

为何需要checkpoint？

checkPoint作用

源码分析

------------------------------------------------------------

为何需要checkpoint？

大家知道checkpoint和persist都是把数据“保存起来”，persist保存的形式有：磁盘/内存/序列化/这三种，其中

tackyon功能还没有，存储类源码如下：

class StorageLevel private(    private var _useDisk: Boolean,//磁盘    private var _useMemory: Boolean,//内存    private var _useOffHeap: Boolean,//tachyon或叫alluxio    private var _deserialized: Boolean,//反序列化    private var _replication: Int = 1)  extends Externalizable {/*集合体*/}

（1）当persist把数据放到内存的时候，因为我们在处理大量数据的时候，我们进行persist的时候，可能把之前的数

据给挤掉。所以persist数据到内存中，虽然说非常快速，但是可以说是最不可靠的一种存储。

（2）当数据放在磁盘的时候，因为存储的数据放在磁盘的共同文件夹下，正如我们storageLevel设置默认为复制一份

也就是说当有我们存储数据的这个磁盘毁坏了，那么就说我们的数据丢失了。
（3）checkpoint就是为了解决（2）出现的问题：把数据放到HDFS中。借助hdfs的高容错、高可靠的特征来达到更

加可靠的数据持久化
note：集群模式下，hdfs存储数据一般是3份放三个节点中（a,b,c）,a在一个机架，bc在另一个机架
（一份数据复制成三份，放在三个节点二个机架中）

checkPoint作用
当checkpoint为当前RDD设置检查点的时候，该函数将会创建一个二进制的文件，并存储到checkpoint目录中，
该目录是用SparkContext.setCheckpointDir()设置的。在checkpoint的过程中，该RDD的所有依赖于父RDD中的信息

将全部被移出。对RDD进行checkpoint操作并不会马上被执行，必须执行Action操作才能触发。当需要checkpoint的

数据的时候，通过ReliableCheckpointRDD的readCheckpointFile方法来从file路径里面读出已经Checkpint的数据，然

后加以应该

源码分析

checkPoint设置的存储位置，这个存储路径必须是HDFS的路径

/**   * 设置目录，用于存储checkpoint的RDD，如果是集群，这个目录必须是HDFS路径   */  def setCheckpointDir(directory: String) {//我们运行在集群中，如果把目录设置为本地，那么提出警告//另外，driver 会尝试在本地重新构建checkpoint的RDD//由于文件其实是在executor上的，所以会提出警告    if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) {      logWarning("Spark is not running in local mode, therefore the checkpoint directory " +        s"must not be on the local filesystem. Directory '$directory' " +        "appears to be on the local filesystem.")    }    checkpointDir = Option(directory).map { dir =>      val path = new Path(dir, UUID.randomUUID().toString)//path增加一个随机码      val fs = path.getFileSystem(hadoopConfiguration)//加载一个hadoop文件系统的配置      fs.mkdirs(path)//创建一个hdfs下的文件      fs.getFileStatus(path).getPath.toString//把hdfs下的文件路径通过string返回    }  }

RDD下的checkpoint函数：首先检查checkpoint文件目录是否为空，如果不空，那么再检查我们想要的checkpoint

的数据是否为空如果不空

  def checkpoint(): Unit = RDDCheckpointData.synchronized {    // NOTE: we use a global lock here due to complexities downstream with ensuring    // children RDD partitions point to the correct parent partitions. In the future    // we should revisit this consideration.    if (context.checkpointDir.isEmpty) {      throw new SparkException("Checkpoint directory has not been set in the SparkContext")    } else if (checkpointData.isEmpty) {      checkpointData = Some(new ReliableRDDCheckpointData(this))    }  }

发现它是用ReliableRDDCheckpointData(this)来把数据进行存储的，在来看ReliableRDDCheckpointData类

 /** * An implementation of checkpointing that writes the RDD data to reliable storage. * This allows drivers to be restarted on failure with previously computed state. */private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T])  extends RDDCheckpointData[T](rdd) with Logging {  // The directory to which the associated RDD has been checkpointed to  // This is assumed to be a non-local path that points to some reliable storage  //找到文件存储位置的名字，并用cpDir这个（key，value）的RDD存储  private val cpDir: String =    ReliableRDDCheckpointData.checkpointPath(rdd.context, rdd.id)      .map(_.toString)      .getOrElse { throw new SparkException("Checkpoint dir must be specified.") }  /**   * Return the directory to which this RDD was checkpointed.   * If the RDD is not checkpointed yet, return None.   */   //查看我们需要checkpoint的RDD是否已经checkpoint过，如果没有checkpoint，那么返回None  def getCheckpointDir: Option[String] = RDDCheckpointData.synchronized {    if (isCheckpointed) {      Some(cpDir.toString)    } else {      None    }  }  /**   * Materialize this RDD and write its content to a reliable DFS.   * This is called immediately after the first action invoked on this RDD has completed.   *    */  protected override def doCheckpoint(): CheckpointRDD[T] = {    val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)    // Optionally clean our checkpoint files if the reference is out of scope    if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {      rdd.context.cleaner.foreach { cleaner =>        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)      }    }    logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}")    newRDD  }}

同时同名object有两个函数：checkpointPath和cleanCheckpoint

private[spark] object ReliableRDDCheckpointData extends Logging {  /** Return the path of the directory to which this RDD's checkpoint data is written. */  def checkpointPath(sc: SparkContext, rddId: Int): Option[Path] = {    sc.checkpointDir.map { dir => new Path(dir, s"rdd-$rddId") }  }  /** Clean up the files associated with the checkpoint data for this RDD. */  def cleanCheckpoint(sc: SparkContext, rddId: Int): Unit = {    checkpointPath(sc, rddId).foreach { path =>      val fs = path.getFileSystem(sc.hadoopConfiguration)      if (fs.exists(path)) {        if (!fs.delete(path, true)) {          logWarning(s"Error deleting ${path.toString()}")        }      }    }  }}

现在我们重点分析docheckpoint这个函数：通过ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd,cpDir)

产生一个新RDD。

 /**   * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD.   */  def writeRDDToCheckpointDirectory[T: ClassTag](      originalRDD: RDD[T],      checkpointDir: String,      blockSize: Int = -1): ReliableCheckpointRDD[T] = {    val sc = originalRDD.sparkContext    // Create the output path for the checkpoint//把checkpointDir设置我们checkpoint的目录    val checkpointDirPath = new Path(checkpointDir)    val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration)    if (!fs.mkdirs(checkpointDirPath)) {      throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath")    }    // Save to file, and reload it as an RDD//保存文件，同时把它作为一个RDD重新加载它    val broadcastedConf = sc.broadcast(      new SerializableConfiguration(sc.hadoopConfiguration))    // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)//发下這是非常消耗的，因为我们需要再次计算我们所需的这个RDD    sc.runJob(originalRDD,      writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _)    if (originalRDD.partitioner.nonEmpty) {      writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath)    }    val newRDD = new ReliableCheckpointRDD[T](      sc, checkpointDirPath.toString, originalRDD.partitioner)    if (newRDD.partitions.length != originalRDD.partitions.length) {      throw new SparkException(        s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " +          s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})")    }    newRDD  }

分析：runJob重新计算我们需要的RDD，也就是说我们的RDD会再次计算，如果我们在checkpoint之前先对

这个RDD进行persisit的话，能达到更好的效果。我们再来看看runJob中的 writePartitionToCheckpointFile函数

   /**   * Write a RDD partition's data to a checkpoint file.   */  def writePartitionToCheckpointFile[T: ClassTag](      path: String,      broadcastedConf: Broadcast[SerializableConfiguration],      blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) {    val env = SparkEnv.get    val outputDir = new Path(path)    val fs = outputDir.getFileSystem(broadcastedConf.value.value)    val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId())    val finalOutputPath = new Path(outputDir, finalOutputName)    val tempOutputPath =      new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}")    if (fs.exists(tempOutputPath)) {      throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists")    }    val bufferSize = env.conf.getInt("spark.buffer.size", 65536)    val fileOutputStream = if (blockSize < 0) {      fs.create(tempOutputPath, false, bufferSize)    } else {      // This is mainly for testing purpose      fs.create(tempOutputPath, false, bufferSize,        fs.getDefaultReplication(fs.getWorkingDirectory), blockSize)    }    val serializer = env.serializer.newInstance()    val serializeStream = serializer.serializeStream(fileOutputStream)    Utils.tryWithSafeFinally {      serializeStream.writeAll(iterator)    } {      serializeStream.close()    }    if (!fs.rename(tempOutputPath, finalOutputPath)) {      if (!fs.exists(finalOutputPath)) {        logInfo(s"Deleting tempOutputPath $tempOutputPath")        fs.delete(tempOutputPath, false)        throw new IOException("Checkpoint failed: failed to save output of task: " +          s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath")      } else {        // Some other copy of this task must've finished before us and renamed it        logInfo(s"Final output path $finalOutputPath already exists; not overwriting it")        if (!fs.delete(tempOutputPath, false)) {          logWarning(s"Error deleting ${tempOutputPath}")        }      }    }  }  /**   * Write a partitioner to the given RDD checkpoint directory. This is done on a best-effort   * basis; any exception while writing the partitioner is caught, logged and ignored.   */  private def writePartitionerToCheckpointDir(    sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit = {    try {      val partitionerFilePath = new Path(checkpointDirPath, checkpointPartitionerFileName)      val bufferSize = sc.conf.getInt("spark.buffer.size", 65536)      val fs = partitionerFilePath.getFileSystem(sc.hadoopConfiguration)      val fileOutputStream = fs.create(partitionerFilePath, false, bufferSize)      val serializer = SparkEnv.get.serializer.newInstance()      val serializeStream = serializer.serializeStream(fileOutputStream)      Utils.tryWithSafeFinally {        serializeStream.writeObject(partitioner)      } {        serializeStream.close()      }      logDebug(s"Written partitioner to $partitionerFilePath")    } catch {      case NonFatal(e) =>        logWarning(s"Error writing partitioner $partitioner to $checkpointDirPath")    }  }

writePartitionToCheckpointFile中的RDD加载了环境等信息。同时发RDD保存的信息放入这个RDD中。

最后我们来总结一下：

val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)

1、把之前originalRDD信息复制到newRDD中

2、同时把这个RDD进行存储到HDFS，目录在我们设置的setCheckpointDir(directory: String)中的directory

3、同时我们会对originalRDD的hadoopConfiguration信息进行广播

4、我们在runJob的时候会再次计算我们这个RDD，也就是说我们可以对它进行缓存，這样可以得到更好的优化。

现在我们再来看看docheckpoint的最后一个环节。

if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) {      rdd.context.cleaner.foreach { cleaner =>        cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id)      }    }

发现我docheckpoint会对originalRDD进行清洗，也就是说，之前的计算链全部清除，只留下我们的newRDD

也就是说这个newRDD称为最原始的父RDD（找源头只能找到這里）

如何获取我们的checkpointRDD，在后面的环节再说，但是可以肯定的是都会通过ReliableCheckpointRDD的

readCheckpointFile方法来从file路径里面读出已经Checkpint的数据

0 0