spark源码阅读笔记RDD(五) RDD中的checkpoint原理
来源:互联网 发布:机械设计3d软件 编辑:程序博客网 时间:2024/05/21 09:21
----------------------------目录----------------------------
为何需要checkpoint?
checkPoint作用
源码分析
------------------------------------------------------------
为何需要checkpoint?
大家知道checkpoint和persist都是把数据“保存起来”,persist保存的形式有:磁盘/内存/序列化/这三种,其中
tackyon功能还没有,存储类源码如下:
class StorageLevel private( private var _useDisk: Boolean,//磁盘 private var _useMemory: Boolean,//内存 private var _useOffHeap: Boolean,//tachyon或叫alluxio private var _deserialized: Boolean,//反序列化 private var _replication: Int = 1) extends Externalizable {/*集合体*/}
(1)当persist把数据放到内存的时候,因为我们在处理大量数据的时候,我们进行persist的时候,可能把之前的数
据给挤掉。所以persist数据到内存中,虽然说非常快速,但是可以说是最不可靠的一种存储。
(2)当数据放在磁盘的时候,因为存储的数据放在磁盘的共同文件夹下,正如我们storageLevel设置默认为复制一份
也就是说当有我们存储数据的这个磁盘毁坏了,那么就说我们的数据丢失了。
(3)checkpoint就是为了解决(2)出现的问题:把数据放到HDFS中。借助hdfs的高容错、高可靠的特征来达到更
加可靠的数据持久化
note:集群模式下,hdfs存储数据一般是3份放三个节点中(a,b,c),a在一个机架,bc在另一个机架
(一份数据复制成三份,放在三个节点二个机架中)
checkPoint作用
当checkpoint为当前RDD设置检查点的时候,该函数将会创建一个二进制的文件,并存储到checkpoint目录中,
该目录是用SparkContext.setCheckpointDir()设置的。在checkpoint的过程中,该RDD的所有依赖于父RDD中的信息
将全部被移出。对RDD进行checkpoint操作并不会马上被执行,必须执行Action操作才能触发。当需要checkpoint的
数据的时候,通过ReliableCheckpointRDD的readCheckpointFile方法来从file路径里面读出已经Checkpint的数据,然
后加以应该
源码分析
checkPoint设置的存储位置,这个存储路径必须是HDFS的路径
/** * 设置目录,用于存储checkpoint的RDD,如果是集群,这个目录必须是HDFS路径 */ def setCheckpointDir(directory: String) {//我们运行在集群中,如果把目录设置为本地,那么提出警告//另外,driver 会尝试在本地重新构建checkpoint的RDD//由于文件其实是在executor上的,所以会提出警告 if (!isLocal && Utils.nonLocalPaths(directory).isEmpty) { logWarning("Spark is not running in local mode, therefore the checkpoint directory " + s"must not be on the local filesystem. Directory '$directory' " + "appears to be on the local filesystem.") } checkpointDir = Option(directory).map { dir => val path = new Path(dir, UUID.randomUUID().toString)//path增加一个随机码 val fs = path.getFileSystem(hadoopConfiguration)//加载一个hadoop文件系统的配置 fs.mkdirs(path)//创建一个hdfs下的文件 fs.getFileStatus(path).getPath.toString//把hdfs下的文件路径通过string返回 } }RDD下的checkpoint函数:首先检查checkpoint文件目录是否为空,如果不空,那么再检查我们想要的checkpoint
的数据是否为空如果不空
def checkpoint(): Unit = RDDCheckpointData.synchronized { // NOTE: we use a global lock here due to complexities downstream with ensuring // children RDD partitions point to the correct parent partitions. In the future // we should revisit this consideration. if (context.checkpointDir.isEmpty) { throw new SparkException("Checkpoint directory has not been set in the SparkContext") } else if (checkpointData.isEmpty) { checkpointData = Some(new ReliableRDDCheckpointData(this)) } }发现它是用ReliableRDDCheckpointData(this)来把数据进行存储的,在来看ReliableRDDCheckpointData类
/** * An implementation of checkpointing that writes the RDD data to reliable storage. * This allows drivers to be restarted on failure with previously computed state. */private[spark] class ReliableRDDCheckpointData[T: ClassTag](@transient private val rdd: RDD[T]) extends RDDCheckpointData[T](rdd) with Logging { // The directory to which the associated RDD has been checkpointed to // This is assumed to be a non-local path that points to some reliable storage //找到文件存储位置的名字,并用cpDir这个(key,value)的RDD存储 private val cpDir: String = ReliableRDDCheckpointData.checkpointPath(rdd.context, rdd.id) .map(_.toString) .getOrElse { throw new SparkException("Checkpoint dir must be specified.") } /** * Return the directory to which this RDD was checkpointed. * If the RDD is not checkpointed yet, return None. */ //查看我们需要checkpoint的RDD是否已经checkpoint过,如果没有checkpoint,那么返回None def getCheckpointDir: Option[String] = RDDCheckpointData.synchronized { if (isCheckpointed) { Some(cpDir.toString) } else { None } } /** * Materialize this RDD and write its content to a reliable DFS. * This is called immediately after the first action invoked on this RDD has completed. * */ protected override def doCheckpoint(): CheckpointRDD[T] = { val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir) // Optionally clean our checkpoint files if the reference is out of scope if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) { rdd.context.cleaner.foreach { cleaner => cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id) } } logInfo(s"Done checkpointing RDD ${rdd.id} to $cpDir, new parent is RDD ${newRDD.id}") newRDD }}同时同名object有两个函数:checkpointPath和cleanCheckpoint
private[spark] object ReliableRDDCheckpointData extends Logging { /** Return the path of the directory to which this RDD's checkpoint data is written. */ def checkpointPath(sc: SparkContext, rddId: Int): Option[Path] = { sc.checkpointDir.map { dir => new Path(dir, s"rdd-$rddId") } } /** Clean up the files associated with the checkpoint data for this RDD. */ def cleanCheckpoint(sc: SparkContext, rddId: Int): Unit = { checkpointPath(sc, rddId).foreach { path => val fs = path.getFileSystem(sc.hadoopConfiguration) if (fs.exists(path)) { if (!fs.delete(path, true)) { logWarning(s"Error deleting ${path.toString()}") } } } }}
现在我们重点分析docheckpoint这个函数:通过ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd,cpDir)
产生一个新RDD。
/** * Write RDD to checkpoint files and return a ReliableCheckpointRDD representing the RDD. */ def writeRDDToCheckpointDirectory[T: ClassTag]( originalRDD: RDD[T], checkpointDir: String, blockSize: Int = -1): ReliableCheckpointRDD[T] = { val sc = originalRDD.sparkContext // Create the output path for the checkpoint//把checkpointDir设置我们checkpoint的目录 val checkpointDirPath = new Path(checkpointDir) val fs = checkpointDirPath.getFileSystem(sc.hadoopConfiguration) if (!fs.mkdirs(checkpointDirPath)) { throw new SparkException(s"Failed to create checkpoint path $checkpointDirPath") } // Save to file, and reload it as an RDD//保存文件,同时把它作为一个RDD重新加载它 val broadcastedConf = sc.broadcast( new SerializableConfiguration(sc.hadoopConfiguration)) // TODO: This is expensive because it computes the RDD again unnecessarily (SPARK-8582)//发下這是非常消耗的,因为我们需要再次计算我们所需的这个RDD sc.runJob(originalRDD, writePartitionToCheckpointFile[T](checkpointDirPath.toString, broadcastedConf) _) if (originalRDD.partitioner.nonEmpty) { writePartitionerToCheckpointDir(sc, originalRDD.partitioner.get, checkpointDirPath) } val newRDD = new ReliableCheckpointRDD[T]( sc, checkpointDirPath.toString, originalRDD.partitioner) if (newRDD.partitions.length != originalRDD.partitions.length) { throw new SparkException( s"Checkpoint RDD $newRDD(${newRDD.partitions.length}) has different " + s"number of partitions from original RDD $originalRDD(${originalRDD.partitions.length})") } newRDD }分析:runJob重新计算我们需要的RDD,也就是说我们的RDD会再次计算,如果我们在checkpoint之前先对
这个RDD进行persisit的话,能达到更好的效果。我们再来看看runJob中的 writePartitionToCheckpointFile函数
/** * Write a RDD partition's data to a checkpoint file. */ def writePartitionToCheckpointFile[T: ClassTag]( path: String, broadcastedConf: Broadcast[SerializableConfiguration], blockSize: Int = -1)(ctx: TaskContext, iterator: Iterator[T]) { val env = SparkEnv.get val outputDir = new Path(path) val fs = outputDir.getFileSystem(broadcastedConf.value.value) val finalOutputName = ReliableCheckpointRDD.checkpointFileName(ctx.partitionId()) val finalOutputPath = new Path(outputDir, finalOutputName) val tempOutputPath = new Path(outputDir, s".$finalOutputName-attempt-${ctx.attemptNumber()}") if (fs.exists(tempOutputPath)) { throw new IOException(s"Checkpoint failed: temporary path $tempOutputPath already exists") } val bufferSize = env.conf.getInt("spark.buffer.size", 65536) val fileOutputStream = if (blockSize < 0) { fs.create(tempOutputPath, false, bufferSize) } else { // This is mainly for testing purpose fs.create(tempOutputPath, false, bufferSize, fs.getDefaultReplication(fs.getWorkingDirectory), blockSize) } val serializer = env.serializer.newInstance() val serializeStream = serializer.serializeStream(fileOutputStream) Utils.tryWithSafeFinally { serializeStream.writeAll(iterator) } { serializeStream.close() } if (!fs.rename(tempOutputPath, finalOutputPath)) { if (!fs.exists(finalOutputPath)) { logInfo(s"Deleting tempOutputPath $tempOutputPath") fs.delete(tempOutputPath, false) throw new IOException("Checkpoint failed: failed to save output of task: " + s"${ctx.attemptNumber()} and final output path does not exist: $finalOutputPath") } else { // Some other copy of this task must've finished before us and renamed it logInfo(s"Final output path $finalOutputPath already exists; not overwriting it") if (!fs.delete(tempOutputPath, false)) { logWarning(s"Error deleting ${tempOutputPath}") } } } } /** * Write a partitioner to the given RDD checkpoint directory. This is done on a best-effort * basis; any exception while writing the partitioner is caught, logged and ignored. */ private def writePartitionerToCheckpointDir( sc: SparkContext, partitioner: Partitioner, checkpointDirPath: Path): Unit = { try { val partitionerFilePath = new Path(checkpointDirPath, checkpointPartitionerFileName) val bufferSize = sc.conf.getInt("spark.buffer.size", 65536) val fs = partitionerFilePath.getFileSystem(sc.hadoopConfiguration) val fileOutputStream = fs.create(partitionerFilePath, false, bufferSize) val serializer = SparkEnv.get.serializer.newInstance() val serializeStream = serializer.serializeStream(fileOutputStream) Utils.tryWithSafeFinally { serializeStream.writeObject(partitioner) } { serializeStream.close() } logDebug(s"Written partitioner to $partitionerFilePath") } catch { case NonFatal(e) => logWarning(s"Error writing partitioner $partitioner to $checkpointDirPath") } }writePartitionToCheckpointFile中的RDD加载了环境等信息。同时发RDD保存的信息放入这个RDD中。
最后我们来总结一下:
val newRDD = ReliableCheckpointRDD.writeRDDToCheckpointDirectory(rdd, cpDir)
1、把之前originalRDD信息复制到newRDD中
2、同时把这个RDD进行存储到HDFS,目录在我们设置的setCheckpointDir(directory: String)中的directory
3、同时我们会对originalRDD的hadoopConfiguration信息进行广播
4、我们在runJob的时候会再次计算我们这个RDD,也就是说我们可以对它进行缓存,這样可以得到更好的优化。
现在我们再来看看docheckpoint的最后一个环节。
if (rdd.conf.getBoolean("spark.cleaner.referenceTracking.cleanCheckpoints", false)) { rdd.context.cleaner.foreach { cleaner => cleaner.registerRDDCheckpointDataForCleanup(newRDD, rdd.id) } }
发现我docheckpoint会对originalRDD进行清洗,也就是说,之前的计算链全部清除,只留下我们的newRDD
也就是说这个newRDD称为最原始的父RDD(找源头只能找到這里)
如何获取我们的checkpointRDD,在后面的环节再说,但是可以肯定的是都会通过ReliableCheckpointRDD的
readCheckpointFile方法来从file路径里面读出已经Checkpint的数据
- spark源码阅读笔记RDD(五) RDD中的checkpoint原理
- spark源码阅读笔记RDD(一)RDD的基本概念
- spark源码阅读笔记RDD(三)RDD的缓存原理
- spark RDD 源码阅读笔记
- spark源码之RDD(3)checkpoint
- Spark源码阅读笔记(RDD)(一)
- spark源码阅读笔记RDD(二)RDD子类基本方法和信息
- spark源码阅读笔记RDD(四)RDD中WithScope是什么?
- spark源码阅读笔记RDD(六) RDD的依赖关系
- spark源码阅读笔记RDD(七) RDD的创建、读取和保存
- Spark源码阅读(一)RDD
- [spark源码剖析]RDD相关源码阅读笔记
- 走进spark(一) rdd.checkpoint
- spark core源码分析11 RDD缓存及checkpoint
- spark rdd的iterator()计算实现以及checkpoint源码
- spark源码阅读RDD中WithScope是什么?
- Spark学习笔记之<RDD原理>
- spark RDD的原理
- 学习安排
- C++类中const函数与非const函数的调用规则
- [解决]UserLibrary中的jar包不会自动发布Tomcat的lib目录下(基于MyEclipse2014)
- IO学习(八)纯文本的拷贝
- 如何用malloc创建二维数组
- spark源码阅读笔记RDD(五) RDD中的checkpoint原理
- poj 1002 487-3279
- TreeMap、HashMap、ConcurrentSkipListMap之性能比较
- ERROR! The server quit without updating PID file (/var/lib/mysql/service.pid).
- 持续交付之六——构建与部署的脚本化
- Linux下 nginx + php 环境搭建
- android之wifi开发(一)
- chrome源码分析1:content模型
- 深入理解Spring AOP