Spark性能调优：checkPoint的使用

来源：互联网发布：企业专利数据库编辑：程序博客网时间：2024/06/02 01:29

概述

checkpoint的意思就是建立检查点，类似于快照，例如在spark计算里面，计算流程DAG特别长，服务器需要将整个DAG计算完成得出结果，但是如果在这很长的计算流程中突然中间算出的数据丢失了，spark又会根据RDD的依赖关系从头到尾计算一遍，这样子就很费性能，当然我们可以将中间的计算结果通过cache或者persist放到内存或者磁盘中，但是这样也不能保证数据完全不会丢失，存储的这个内存出问题了或者磁盘坏了，也会导致spark从头再根据RDD计算一遍，所以就有了checkpoint，其中checkpoint的作用就是将DAG中比较重要的中间数据做一个检查点将结果存储到一个高可用的地方(通常这个地方就是HDFS里面)。

RDD的依赖

checkpoint先了解一下RDD的依赖，比如计算wordcount：

scala>  sc.textFile("hdfs://leen:8020/user/hive/warehouse/tools.db/cde_prd").flatMap(_.split("\\\t")).map((_,1)).reduceByKey(_+_);res0: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:28scala> res0.toDebugStringres1: String = (2) ShuffledRDD[4] at reduceByKey at <console>:28 [] +-(2) MapPartitionsRDD[3] at map at <console>:28 []    |  MapPartitionsRDD[2] at flatMap at <console>:28 []    |  hdfs://leen:8020/user/hive/warehouse/tools.db/cde_prd MapPartitionsRDD[1] at textFile at <console>:28 []    |  hdfs://leen:8020/user/hive/warehouse/tools.db/cde_prd HadoopRDD[0] at textFile at <console>:28 []

1、在textFile读取hdfs的时候就会先创建一个HadoopRDD，其中这个RDD是去读取hdfs的数据key为偏移量value为一行数据，因为通常来讲偏移量没有太大的作用所以然后会将HadoopRDD转化为MapPartitionsRDD，这个RDD只保留了hdfs的数据。
2、flatMap 产生一个RDD MapPartitionsRDD
3、map 产生一个RDD MapPartitionsRDD
4、reduceByKey 产生一个RDD ShuffledRDD

如何建立checkPoint

1、首先需要用sparkContext设置hdfs的checkpoint的目录，如果不设置使用checkpoint会抛出异常：

scala> res0.checkpointorg.apache.spark.SparkException: Checkpoint directory has not been set in the SparkContextscala> sc.setCheckpointDir("hdfs://leen:8020/checkPointDir")

执行了上面的代码,hdfs里面会创建一个目录:
/checkPointDir/9ae90c62-a7ff-442a-bbf0-e5c8cdd7982d

2、然后执行checkpoint

scala> res0.checkpoint

发现hdfs中还是没有数据，说明checkpoint也是个transformation的算子。

scala> res0.count()INFO ReliableRDDCheckpointData: Done checkpointing RDD 4 to hdfs://leen:8020/checkPointDir/9ae90c62-a7ff-442a-bbf0-e5c8cdd7982d/rdd-4, new parent is RDD 5res5: Long = 73689

hive > dfs -du -h /checkPointDir/9ae90c62-a7ff-442a-bbf0-e5c8cdd7982d/rdd-4;147    147    /checkPointDir/9ae90c62-a7ff-442a-bbf0-e5c8cdd7982d/rdd-4/_partitioner1.2 M  1.2 M  /checkPointDir/9ae90c62-a7ff-442a-bbf0-e5c8cdd7982d/rdd-4/part-000001.2 M  1.2 M  /checkPointDir/9ae90c62-a7ff-442a-bbf0-e5c8cdd7982d/rdd-4/part-00001

但是执行的时候相当于走了两次流程，前面计算了一遍，然后checkpoint又会计算一次，所以一般我们先进行cache然后做checkpoint就会只走一次流程，checkpoint的时候就会从刚cache到内存中取数据写入hdfs中，如下：

rdd.cache()rdd.checkpoint()rdd.collect

在源码中,在checkpoint的时候强烈建议先进行cache，并且当你checkpoint执行成功了，那么前面所有的RDD依赖都会被销毁，如下：

  /**   * Mark this RDD for checkpointing. It will be saved to a file inside the checkpoint   * directory set with `SparkContext#setCheckpointDir` and all references to its parent   * RDDs will be removed. This function must be called before any job has been   * executed on this RDD. It is strongly recommended that this RDD is persisted in   * memory, otherwise saving it on a file will require recomputation.   */  def checkpoint(): Unit = RDDCheckpointData.synchronized {    // NOTE: we use a global lock here due to complexities downstream with ensuring    // children RDD partitions point to the correct parent partitions. In the future    // we should revisit this consideration.    if (context.checkpointDir.isEmpty) {      throw new SparkException("Checkpoint directory has not been set in the SparkContext")    } else if (checkpointData.isEmpty) {      checkpointData = Some(new ReliableRDDCheckpointData(this))    }  }

RDD依赖被销毁

scala> res0.toDebugStringres6: String = (2) ShuffledRDD[4] at reduceByKey at <console>:28 [] |  ReliableCheckpointRDD[5] at count at <console>:30 []

阅读全文

0 0