为什么说cache是persist的特例?

来源:互联网 发布:《算法的乐趣》 编辑:程序博客网 时间:2024/04/29 17:11

有人经常会看到Spark中有句话说:cache是persist的特例。

通过分析源码,我们来看一下,这句话的含义:

\spark-1.5.0\core\src\main\scala\org\apache\spark\rdd\RDD.scala


/** Persist thisRDD with the default storage level (`MEMORY_ONLY`). */

  def persist(): this.type =persist(StorageLevel.MEMORY_ONLY)

 

  /** Persist this RDD with the default storagelevel (`MEMORY_ONLY`). */

  def cache(): this.type = persist()

 

  /**

   * Mark the RDD as non-persistent, and removeall blocks for it from memory and disk.

   * @param blocking Whether to block until allblocks are deleted.

   * @return This RDD.

   */

  def unpersist(blocking: Boolean = true):this.type = {

    logInfo("Removing RDD " + id +" from persistence list")

    sc.unpersistRDD(id, blocking)

    storageLevel = StorageLevel.NONE

    this

  }


从上述的源码当中,我们可以清晰的看出,调用cache方法其实就是调用缓存策略StorageLevel值为MEMORY_ONLY类型的persist方法,所以才说cache是persist的特例。


我们这里顺便也说一下缓存级别:

先看一下源代码位置:

\spark-1.5.0\core\src\main\scala\org\apache\spark\storage\StorageLevel.scala


/**

 * Various[[org.apache.spark.storage.StorageLevel]] defined and utility functions forcreating

 * new storage levels.

 */

objectStorageLevel {

  val NONE = new StorageLevel(false, false,false, false)

  val DISK_ONLY = new StorageLevel(true, false,false, false)

  val DISK_ONLY_2 = new StorageLevel(true,false, false, false, 2)

  val MEMORY_ONLY = new StorageLevel(false,true, false, true)

  val MEMORY_ONLY_2 = new StorageLevel(false,true, false, true, 2)

  val MEMORY_ONLY_SER = new StorageLevel(false,true, false, false)

  val MEMORY_ONLY_SER_2 = newStorageLevel(false, true, false, false, 2)

  val MEMORY_AND_DISK = new StorageLevel(true,true, false, true)

  val MEMORY_AND_DISK_2 = newStorageLevel(true, true, false, true, 2)

  val MEMORY_AND_DISK_SER = newStorageLevel(true, true, false, false)

  val MEMORY_AND_DISK_SER_2 = newStorageLevel(true, true, false, false, 2)

  val OFF_HEAP = new StorageLevel(false, false,true, false) //专门针对Tachyon

……………………………………………..

 

// 这个类需要五个参数,可以看上面的伴生对象传入的参数

class StorageLevelprivate(

    private var _useDisk: Boolean,

    private var _useMemory: Boolean,

    privatevar _useOffHeap: Boolean, //是否使用tachyon分布式内存系统保存RDD

    private var _deserialized: Boolean, //kryo序列化比较好,节省内存,同时消耗cpu

    private var _replication: Int = 1)

  extends Externalizable {

 

  // TODO: Also add fields for cachingpriority, dataset ID, and flushing.

  private def this(flags: Int, replication:Int) {

    this((flags & 8) != 0, (flags & 4)!= 0, (flags & 2) != 0, (flags & 1) != 0, replication)

  }

 

  def this() = this(false, true, false,false)  // For deserialization

 

  def useDisk: Boolean = _useDisk

  def useMemory: Boolean = _useMemory

  def useOffHeap: Boolean = _useOffHeap

  def deserialized: Boolean = _deserialized

  def replication: Int = _replication

...............................................................................................


所以,我们也可以通过persist方法手工设定StorageLevel参数值来满足自己需要的缓存级别。





0 0
原创粉丝点击