存储级别和存储调用
来源:互联网 发布:python 最小化到托盘 编辑:程序博客网 时间:2024/06/05 09:32
下面是StorageLevel类的代码解释
/** * :: DeveloperApi :: * Flags for controlling the storage of an RDD. Each StorageLevel records whether to use memory, * or ExternalBlockStore, whether to drop the RDD to disk if it falls out of memory or * ExternalBlockStore, whether to keep the data in memory in a serialized format, and whether * to replicate the RDD partitions on multiple nodes. * * 用于控制RDD存储的标志,每个StorageLevel记录是否使用内存或ExternalBlockStore,如果它脱离内存或ExternalBlockStore, * 是否将RDD丢弃到磁盘,是否将数据保存在内存中的序列化格式,以及是否复制多个节点上的RDD分区 * * The [org.apache.spark.storage.StorageLevel$]singleton object contains some static constants * for commonly useful storage levels. To create your own storage level object, use the * factory method of the singleton object (`StorageLevel(...)`). * [org.apache.spark.storage.StorageLevel $] singleton对象包含一些常用的存储级别的静态常量, * 要创建自己的存储级别对象,请使用单例对象(`StorageLevel(...)`)的工厂方法 */
StorageLevel的成员如下:
private var _useDisk: Boolean, private var _useMemory: Boolean, private var _useOffHeap: Boolean,//使用扩展存储 private var _deserialized: Boolean, private var _replication: Int = 1) //默认的复本数是1
StorageLevel中的Object方法解释如下:
object StorageLevel { //不会保存任务数据 val NONE = new StorageLevel(false, false, false, false) //直接将RDD的partition保存在该节点的Disk上 val DISK_ONLY = new StorageLevel(true, false, false, false) //直接将RDD的partition保存在该节点的Disk上,在其他节点上保存一个相同的备份 val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2) //将RDD的partition对应的原生的Java Object保存在JVM中,如果RDD太大导致它的部分partition不能存储在内存中 //那么这些partition将不会缓存,并且需要的时候被重新计算,默认缓存的级别 val MEMORY_ONLY = new StorageLevel(false, true, false, true) //将RDD的partition对应的原生的Java Object保存在JVM中,在其他节点上保存一个相同的备份 val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2) val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false) val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2) //将RDD的partition反序列化后的对象存储在JVM中,如果RDD太大导致它的部分partition不能存储在内存中 //超出的partition将被保存在Disk上,并且在需要时读取 val MEMORY_AND_DISK = new StorageLevel(true, true, false, true) //在其他节点上保存一个相同的备份 val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2) val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false) val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2) //将RDD的partition序列化后存储在Tachyon中 val OFF_HEAP = new StorageLevel(false, false, true, false)
下面是我们的RDD存储的调用
RDD和Block关系如下:
一个RDD有多个Partition,每个Partition对应一个Block块,也就是说一个RDD有多个Block,同时每个BlockId又有拥有唯一的编号BlockId,对应的数据块的编号规则为:“rdd_”+”rddid_”+”splitIndex”,其中这个splitIndex对应的是数据块对应的Partition的序列号
RDD的存储过程分析如下:
/** * Mark this RDD for persisting using the specified level. * 标记此RDD以使用指定的级别进行持久化 * this.type表示当前对象(this)的类型。this指代当前的对象。 * this.type被用于变量,函数参数和函数返回值的类型声明 * @param newLevel the target storage level * @param allowOverride whether to override any existing level with the new one */ private def persist(newLevel: StorageLevel, allowOverride: Boolean): this.type = { // TODO: Handle changes of StorageLevel if (storageLevel != StorageLevel.NONE && newLevel != storageLevel && !allowOverride) { throw new UnsupportedOperationException( "Cannot change storage level of an RDD after it was already assigned a level") } // If this is the first time this RDD is marked for persisting, register it // with the SparkContext for cleanups and accounting. Do this only once. //如果这是RDD第一次标记为持久性,请注册与SparkContext进行清理和计费,只做一次 if (storageLevel == StorageLevel.NONE) { sc.cleaner.foreach(_.registerRDDForCleanup(this)) sc.persistRDD(this) } storageLevel = newLevel this } /** * 设置RDD存储级别在操作之后完成,这里只能分配RDD尚未确认的新存储级别,检查点是一个例别 * Set this RDD's storage level to persist its values across operations after the first time * it is computed. This can only be used to assign a new storage level if the RDD does not * have a storage level set yet. Local checkpointing is an exception. * 设置此RDD的存储级别,以便在第一次操作之后保持其值计算,如果RDD没有,这只能用于分配新的存储级别 *还有一个存储级别,本地检查点是一个例外。 */ def persist(newLevel: StorageLevel): this.type = { if (isLocallyCheckpointed) { //之前已经调用过localCheckpoint(),这里应该标记RDD待久化,在这里我们应该重写旧的存储级别,一个是由用户显式请求 // This means the user previously called localCheckpoint(), which should have already // marked this RDD for persisting. Here we should override the old storage level with // one that is explicitly requested by the user (after adapting it to use disk). persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true) } else { persist(newLevel, allowOverride = false) } } /** * Persist this RDD with the default storage level (`MEMORY_ONLY`). * 持久化RDD,默认存储级别MEMORY_ONLY,内存存储 * */ def persist(): this.type = persist(StorageLevel.MEMORY_ONLY) /** * Persist this RDD with the default storage level (`MEMORY_ONLY`). * 持久化RDD使用默认存储级别(内存存储) * */ def cache(): this.type = persist() /** * Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. * 删除持久化RDD,同时删除内存和硬盘 * @param blocking Whether to block until all blocks are deleted. * @return This RDD. */ def unpersist(blocking: Boolean = true): this.type = { logInfo("Removing RDD " + id + " from persistence list") sc.unpersistRDD(id, blocking) storageLevel = StorageLevel.NONE //将存储级别设置为None this } /** * Get the RDD's current storage level, or StorageLevel.NONE if none is set. * 获得RDD当前的存储级别 * */ def getStorageLevel: StorageLevel = storageLevel
下面是我们的RDD的Iterator方法的解析如下:
/** * Task的执行起点,计算由此开始 * Internal method to this RDD; will read from cache if applicable(可用), or otherwise(否则) compute it. * This should ''not'' be called by users directly, but is available for implementors of custom * subclasses of RDD. * RDD内部方法,从缓存中读取可用,如果没有则计算它, */ final def iterator(split: Partition, context: TaskContext): Iterator[T] = { //如果存储级别不是NONE 那么先检查是否有缓存,没有缓存则要进行计算 if (storageLevel != StorageLevel.NONE) { getOrCompute(split, context) //SparkEnv包含运行时节点所需要的环境信息 //cacheManager负责调用BlockManager来管理RDD的缓存,如果当前RDD原来计算过并且把结果缓存起来. //接下来的运行都可以通过BlockManager来直接读取缓存后返回 } else { //如果没有缓存,存在检查点时直接获取中间结果 computeOrReadCheckpoint(split, context) } }
getOrComputey方法如下:
/** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached. * 获取或计算一个RDD的分区,当RDD被缓存时由RDD.iterator()使用 * */ private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = { //通过RDD的编号和Partition的序号获取到数据块Block的编号 val blockId = RDDBlockId(id, partition.index) var readCachedBlock = true // This method is called on executors, so we need call SparkEnv.get instead of sc.env //因为这个方法是由executor来调用这个方法,所以可以使用SparkEnv代替sc.env //先根据数据块BlockId来读取数据,然后更新数据,这个方法是读写数据的入口 SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, //如果是数据块不存在,则尝试读取检查点的结果进行迭代计算 () => { readCachedBlock = false computeOrReadCheckpoint(partition, context) }) match { case Left(blockResult) => if (readCachedBlock) { val existingMetrics = context.taskMetrics().inputMetrics existingMetrics.incBytesRead(blockResult.bytes) new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) { override def next(): T = { existingMetrics.incRecordsRead(1) delegate.next() } } } else { new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]]) } case Right(iter) => new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]]) } }
下面是我们的getOrElseUpdate方法的解析如下:
/** *这个方法是我们的Spark存写数据的入口点,同时这个方法由我们的Executor调用 * Retrieve the given block if it exists, otherwise call the provided `makeIterator` method * to compute the block, persist it, and return its values. * * @return either a BlockResult if the block was successfully cached, or an iterator if the block * could not be cached. */ def getOrElseUpdate[T]( blockId: BlockId, level: StorageLevel, classTag: ClassTag[T], makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = { // Attempt to read the block from local or remote storage. If it's present, then we don't need // to go through the local-get-or-put path. //读取数据的入口点,尝试从本地或者远程读取数据 get[T](blockId)(classTag) match { case Some(block) => return Left(block) case _ => // Need to compute the block. } // Initially we hold no locks on this block. //写数据入口 doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match { case None => // doPut() didn't hand work back to us, so the block already existed or was successfully // stored. Therefore, we now hold a read lock on the block. val blockResult = getLocalValues(blockId).getOrElse { // Since we held a read lock between the doPut() and get() calls, the block should not // have been evicted, so get() not returning the block indicates some internal error. releaseLock(blockId) throw new SparkException(s"get() failed for block $blockId even though we held a lock") } // We already hold a read lock on the block from the doPut() call and getLocalValues() // acquires the lock again, so we need to call releaseLock() here so that the net number // of lock acquisitions is 1 (since the caller will only call release() once). releaseLock(blockId) Left(blockResult) case Some(iter) => // The put failed, likely because the data was too large to fit in memory and could not be // dropped to disk. Therefore, we need to pass the input iterator back to the caller so // that they can decide what to do with the values (e.g. process them without caching). Right(iter) } }
阅读全文
0 0
- 存储级别和存储调用
- 存储过程事物级别
- 存储过程和c#调用
- JAVA存储过程和调用
- 调用存储过程和函数
- 存储,调用
- Spark中cache和persist的作用以及存储级别
- Oracle存储过程、存储函数以及Java程序调用存储过程和存储函数
- 用户验证SQL存储过程和调用存储过程
- Java调用Oracle数据库存储过程和存储函数
- MySQL存储过程的创建和Java调用存储过程
- java程序调用存储过程和存储函数
- 创建存储过程和java调用存储过程
- JDBC-MYSQL-存储函数和存储过程的调用
- 【4】Oracle_Java程序调用存储过程和存储函数
- 创建存储过程和调用存储过程(Mysql)
- oracle--在java中调用存储过程和存储函数
- Java代码调用存储过程和存储方法
- 齐次坐标详解与普通坐标之间的转换
- 【SMOJ】2017.10.18模拟赛27
- jvm类加载器
- linux下的strerror和perror
- tcpdump抓包
- 存储级别和存储调用
- Binary Search:436. Find Right Interval
- 深入研究java.lang.ThreadLocal类
- [Python 实战]
- 基于swing界面的简单的班级管理系统
- 【Linked-list专题-1】445. Add Two Numbers II 328. Odd Even Linked List
- 控制反转(IOC)和依赖注入(DI)的理解
- Pyhton 单行、多行注释方法
- [Leetcode1_Two Sum]