存储级别和存储调用

来源：互联网发布：python 最小化到托盘编辑：程序博客网时间：2024/06/05 09:32

下面是StorageLevel类的代码解释

/** * :: DeveloperApi :: * Flags for controlling the storage of an RDD. Each StorageLevel records whether to use memory, * or ExternalBlockStore, whether to drop the RDD to disk if it falls out of memory or * ExternalBlockStore, whether to keep the data in memory in a serialized format, and whether * to replicate the RDD partitions on multiple nodes.  *  * 用于控制RDD存储的标志,每个StorageLevel记录是否使用内存或ExternalBlockStore,如果它脱离内存或ExternalBlockStore,  * 是否将RDD丢弃到磁盘,是否将数据保存在内存中的序列化格式,以及是否复制多个节点上的RDD分区 * * The [org.apache.spark.storage.StorageLevel$]singleton object contains some static constants * for commonly useful storage levels. To create your own storage level object, use the * factory method of the singleton object (`StorageLevel(...)`).  * [org.apache.spark.storage.StorageLevel $] singleton对象包含一些常用的存储级别的静态常量,  * 要创建自己的存储级别对象,请使用单例对象（`StorageLevel（...）`）的工厂方法 */

StorageLevel的成员如下：

    private var _useDisk: Boolean,    private var _useMemory: Boolean,    private var _useOffHeap: Boolean,//使用扩展存储    private var _deserialized: Boolean,    private var _replication: Int = 1) //默认的复本数是1

StorageLevel中的Object方法解释如下：

object StorageLevel {  //不会保存任务数据  val NONE = new StorageLevel(false, false, false, false)  //直接将RDD的partition保存在该节点的Disk上  val DISK_ONLY = new StorageLevel(true, false, false, false)  //直接将RDD的partition保存在该节点的Disk上,在其他节点上保存一个相同的备份  val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)  //将RDD的partition对应的原生的Java Object保存在JVM中,如果RDD太大导致它的部分partition不能存储在内存中  //那么这些partition将不会缓存,并且需要的时候被重新计算,默认缓存的级别  val MEMORY_ONLY = new StorageLevel(false, true, false, true)  //将RDD的partition对应的原生的Java Object保存在JVM中,在其他节点上保存一个相同的备份  val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)    val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)  val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)  //将RDD的partition反序列化后的对象存储在JVM中,如果RDD太大导致它的部分partition不能存储在内存中  //超出的partition将被保存在Disk上,并且在需要时读取  val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)  //在其他节点上保存一个相同的备份  val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)  val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)  val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)  //将RDD的partition序列化后存储在Tachyon中  val OFF_HEAP = new StorageLevel(false, false, true, false)

下面是我们的RDD存储的调用
RDD和Block关系如下：
一个RDD有多个Partition,每个Partition对应一个Block块，也就是说一个RDD有多个Block,同时每个BlockId又有拥有唯一的编号BlockId，对应的数据块的编号规则为：“rdd_”+”rddid_”+”splitIndex”,其中这个splitIndex对应的是数据块对应的Partition的序列号

RDD的存储过程分析如下：

  /**   * Mark this RDD for persisting using the specified level.    * 标记此RDD以使用指定的级别进行持久化   * this.type表示当前对象（this)的类型。this指代当前的对象。   * this.type被用于变量,函数参数和函数返回值的类型声明   * @param newLevel the target storage level   * @param allowOverride whether to override any existing level with the new one   */  private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = {    // TODO: Handle changes of StorageLevel    if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) {      throw new UnsupportedOperationException(        "Cannot change storage level of an RDD after it was already assigned a level")    }    // If this is the first time this RDD is marked for persisting, register it    // with the SparkContext for cleanups and accounting. Do this only once.    //如果这是RDD第一次标记为持久性,请注册与SparkContext进行清理和计费,只做一次    if (storageLevel == StorageLevel.NONE) {      sc.cleaner.foreach(_.registerRDDForCleanup(this))      sc.persistRDD(this)    }    storageLevel = newLevel    this  }  /**   * 设置RDD存储级别在操作之后完成,这里只能分配RDD尚未确认的新存储级别,检查点是一个例别   * Set this RDD's storage level to persist its values across operations after the first time   * it is computed. This can only be used to assign a new storage level if the RDD does not   * have a storage level set yet. Local checkpointing is an exception.    * 设置此RDD的存储级别，以便在第一次操作之后保持其值计算,如果RDD没有,这只能用于分配新的存储级别    *还有一个存储级别,本地检查点是一个例外。   */  def persist(newLevel: StorageLevel): this.type = {    if (isLocallyCheckpointed) {      //之前已经调用过localCheckpoint(),这里应该标记RDD待久化,在这里我们应该重写旧的存储级别,一个是由用户显式请求      // This means the user previously called localCheckpoint(), which should have already      // marked this RDD for persisting. Here we should override the old storage level with      // one that is explicitly requested by the user (after adapting it to use disk).      persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)    } else {      persist(newLevel, allowOverride = false)    }  }  /**    *  Persist this RDD with the default storage level (`MEMORY_ONLY`).    *  持久化RDD,默认存储级别MEMORY_ONLY,内存存储   *  */  def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)  /**    *  Persist this RDD with the default storage level (`MEMORY_ONLY`).    *  持久化RDD使用默认存储级别(内存存储)   *  */  def cache(): this.type = persist()  /**   * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk.   * 删除持久化RDD,同时删除内存和硬盘   * @param blocking Whether to block until all blocks are deleted.   * @return This RDD.   */  def unpersist(blocking: Boolean = true): this.type = {    logInfo("Removing RDD " + id + " from persistence list")    sc.unpersistRDD(id, blocking)    storageLevel = StorageLevel.NONE  //将存储级别设置为None    this  }  /**    *  Get the RDD's current storage level, or StorageLevel.NONE if none is set.   *  获得RDD当前的存储级别    *  */  def getStorageLevel: StorageLevel = storageLevel

下面是我们的RDD的Iterator方法的解析如下：

  /**   * Task的执行起点,计算由此开始   * Internal method to this RDD; will read from cache if applicable(可用), or otherwise(否则) compute it.   * This should ''not'' be called by users directly, but is available for implementors of custom   * subclasses of RDD.   * RDD内部方法,从缓存中读取可用,如果没有则计算它,   */    final def iterator(split: Partition, context: TaskContext): Iterator[T] = {      //如果存储级别不是NONE 那么先检查是否有缓存,没有缓存则要进行计算    if (storageLevel != StorageLevel.NONE) {      getOrCompute(split, context)      //SparkEnv包含运行时节点所需要的环境信息      //cacheManager负责调用BlockManager来管理RDD的缓存,如果当前RDD原来计算过并且把结果缓存起来.      //接下来的运行都可以通过BlockManager来直接读取缓存后返回    } else {    //如果没有缓存,存在检查点时直接获取中间结果      computeOrReadCheckpoint(split, context)    }  }

getOrComputey方法如下：

  /** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.    *  获取或计算一个RDD的分区,当RDD被缓存时由RDD.iterator()使用   *  */  private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {    //通过RDD的编号和Partition的序号获取到数据块Block的编号    val blockId = RDDBlockId(id, partition.index)    var readCachedBlock = true    // This method is called on executors, so we need call SparkEnv.get instead of sc.env    //因为这个方法是由executor来调用这个方法，所以可以使用SparkEnv代替sc.env    //先根据数据块BlockId来读取数据，然后更新数据，这个方法是读写数据的入口    SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag,     //如果是数据块不存在，则尝试读取检查点的结果进行迭代计算    () => {      readCachedBlock = false      computeOrReadCheckpoint(partition, context)    }) match {      case Left(blockResult) =>        if (readCachedBlock) {          val existingMetrics = context.taskMetrics().inputMetrics          existingMetrics.incBytesRead(blockResult.bytes)          new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {            override def next(): T = {              existingMetrics.incRecordsRead(1)              delegate.next()            }          }        } else {          new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])        }      case Right(iter) =>        new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])    }  }

下面是我们的getOrElseUpdate方法的解析如下：

  /**   *这个方法是我们的Spark存写数据的入口点，同时这个方法由我们的Executor调用   * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method   * to compute the block, persist it, and return its values.   *   * @return either a BlockResult if the block was successfully cached, or an iterator if the block   *         could not be cached.   */  def getOrElseUpdate[T](      blockId: BlockId,      level: StorageLevel,      classTag: ClassTag[T],      makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {    // Attempt to read the block from local or remote storage. If it's present, then we don't need    // to go through the local-get-or-put path.    //读取数据的入口点，尝试从本地或者远程读取数据    get[T](blockId)(classTag) match {      case Some(block) =>        return Left(block)      case _ =>        // Need to compute the block.    }    // Initially we hold no locks on this block.    //写数据入口    doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {      case None =>        // doPut() didn't hand work back to us, so the block already existed or was successfully        // stored. Therefore, we now hold a read lock on the block.        val blockResult = getLocalValues(blockId).getOrElse {          // Since we held a read lock between the doPut() and get() calls, the block should not          // have been evicted, so get() not returning the block indicates some internal error.          releaseLock(blockId)          throw new SparkException(s"get() failed for block $blockId even though we held a lock")        }        // We already hold a read lock on the block from the doPut() call and getLocalValues()        // acquires the lock again, so we need to call releaseLock() here so that the net number        // of lock acquisitions is 1 (since the caller will only call release() once).        releaseLock(blockId)        Left(blockResult)      case Some(iter) =>        // The put failed, likely because the data was too large to fit in memory and could not be        // dropped to disk. Therefore, we need to pass the input iterator back to the caller so        // that they can decide what to do with the values (e.g. process them without caching).       Right(iter)    }  }

阅读全文

0 0