Spark源码解读(7)——内存管理
来源:互联网 发布:知我药妆app 编辑:程序博客网 时间:2024/05/22 15:54
Spark的内存主要由MemoryManager来管理,其管理的内存分为两个部分:StorageMemory和ExecutionMemory,ExecutionMemory又分为onHeap和offHeap
其中StorageMemory主要给BlockManager用,属于Spark存储系统的一部分,ExecutionMemory则主要为执行Task用,主要是Shuffle过程的结果写入
@GuardedBy("this") protected val storageMemoryPool = new StorageMemoryPool(this) @GuardedBy("this") protected val onHeapExecutionMemoryPool = new ExecutionMemoryPool(this, "on-heap execution") @GuardedBy("this") protected val offHeapExecutionMemoryPool = new ExecutionMemoryPool(this, "off-heap execution")首先,看下各个区域的大小:
storageMemoryPool.incrementPoolSize(storageMemory) onHeapExecutionMemoryPool.incrementPoolSize(onHeapExecutionMemory) offHeapExecutionMemoryPool.incrementPoolSize(conf.getSizeAsBytes("spark.memory.offHeap.size", 0))MemoryManager主要有两个子类:StaticMemoryManager、UnifiedMemoryManager
因为默认用UnifiedMemoryManager所以这里以UnifiedMemoryManager为例进行分析:
val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false) val memoryManager: MemoryManager = if (useLegacyMemoryManager) { new StaticMemoryManager(conf, numUsableCores) } else { UnifiedMemoryManager(conf, numUsableCores) }以所有参数均为默认情况来分析每个区域的内存大小:
object UnifiedMemoryManager { // Set aside a fixed amount of memory for non-storage, non-execution purposes. // This serves a function similar to `spark.memory.fraction`, but guarantees that we reserve // sufficient memory for the system even for small heaps. E.g. if we have a 1GB JVM, then // the memory used for execution and storage will be (1024 - 300) * 0.75 = 543MB by default. private val RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024 def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = { val maxMemory = getMaxMemory(conf) new UnifiedMemoryManager( conf, maxMemory = maxMemory, storageRegionSize = (maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong, numCores = numCores) } /** * Return the total amount of memory shared between execution and storage, in bytes. */ private def getMaxMemory(conf: SparkConf): Long = { val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory) val reservedMemory = conf.getLong("spark.testing.reservedMemory", if (conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES) val minSystemMemory = reservedMemory * 1.5 if (systemMemory < minSystemMemory) { throw new IllegalArgumentException(s"System memory $systemMemory must " + s"be at least $minSystemMemory. Please use a larger heap size.") } val usableMemory = systemMemory - reservedMemory val memoryFraction = conf.getDouble("spark.memory.fraction", 0.75) (usableMemory * memoryFraction).toLong }}
private[spark] class UnifiedMemoryManager private[memory] ( conf: SparkConf, val maxMemory: Long, storageRegionSize: Long, numCores: Int) extends MemoryManager( conf, numCores, storageRegionSize, maxMemory - storageRegionSize) {从上面的代码可知MemoryManager管理的堆上内存如下图(默认参数情况下):
对StorageMemory的使用主要是通过BlockManager的MemoryStore进行调用,对OnHeapExecutionMemory的使用则主要通过TaskMemoryManager,请求内存的方法分别为:acquireStorageMemory()和acquireExecutionMemory()
首先看下相对简单的acqureStorageMemory()方法
override def acquireStorageMemory( blockId: BlockId, numBytes: Long, evictedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = synchronized { assertInvariant() assert(numBytes >= 0) if (numBytes > maxStorageMemory) { // Fail fast if the block simply won't fit logInfo(s"Will not store $blockId as the required space ($numBytes bytes) exceeds our " + s"memory limit ($maxStorageMemory bytes)") return false } if (numBytes > storageMemoryPool.memoryFree) { // There is not enough free memory in the storage pool, so try to borrow free memory from // the execution pool. val memoryBorrowedFromExecution = Math.min(onHeapExecutionMemoryPool.memoryFree, numBytes) onHeapExecutionMemoryPool.decrementPoolSize(memoryBorrowedFromExecution) storageMemoryPool.incrementPoolSize(memoryBorrowedFromExecution) } storageMemoryPool.acquireMemory(blockId, numBytes, evictedBlocks) }这里的逻辑相对简单,首先判断如果需要的大小大于最大可用内存则直接返回false,否则查看当前StorageMemory的memoryFree能否满足需要分配的内存,如果能满足则直接分配,否则尝试从Execution Pool借存储空间
但关于所借内存的大小这里有一个疑问?
将
val memoryBorrowedFromExecution = Math.min(onHeapExecutionMemoryPool.memoryFree, numBytes)改为
val memoryBorrowedFromExecution = Math.min(onHeapExecutionMemoryPool.memoryFree, numBytes - storageMemoryPool.memoryFree)是否更合理些?
这里不清楚这段代码的作者是否有意为之
再看看acquireExecutionMemory()方法,这个方法就相对复杂了
/** * Try to acquire up to `numBytes` of execution memory for the current task and return the * number of bytes obtained, or 0 if none can be allocated. * * This call may block until there is enough free memory in some situations, to make sure each * task has a chance to ramp up to at least 1 / 2N of the total memory pool (where N is the # of * active tasks) before it is forced to spill. This can happen if the number of tasks increase * but an older task had a lot of memory already. */ override private[memory] def acquireExecutionMemory( numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Long = synchronized { assertInvariant() assert(numBytes >= 0) memoryMode match { case MemoryMode.ON_HEAP => /** * Grow the execution pool by evicting cached blocks, thereby shrinking the storage pool. * * When acquiring memory for a task, the execution pool may need to make multiple * attempts. Each attempt must be able to evict storage in case another task jumps in * and caches a large block between the attempts. This is called once per attempt. */ def maybeGrowExecutionPool(extraMemoryNeeded: Long): Unit = { if (extraMemoryNeeded > 0) { // There is not enough free memory in the execution pool, so try to reclaim memory from // storage. We can reclaim any free memory from the storage pool. If the storage pool // has grown to become larger than `storageRegionSize`, we can evict blocks and reclaim // the memory that storage has borrowed from execution. val memoryReclaimableFromStorage = math.max(storageMemoryPool.memoryFree, storageMemoryPool.poolSize - storageRegionSize) if (memoryReclaimableFromStorage > 0) { // Only reclaim as much space as is necessary and available: val spaceToReclaim = storageMemoryPool.freeSpaceToShrinkPool( math.min(extraMemoryNeeded, memoryReclaimableFromStorage)) storageMemoryPool.decrementPoolSize(spaceToReclaim) onHeapExecutionMemoryPool.incrementPoolSize(spaceToReclaim) } } } /** * The size the execution pool would have after evicting storage memory. * * The execution memory pool divides this quantity among the active tasks evenly to cap * the execution memory allocation for each task. It is important to keep this greater * than the execution pool size, which doesn't take into account potential memory that * could be freed by evicting storage. Otherwise we may hit SPARK-12155. * * Additionally, this quantity should be kept below `maxMemory` to arbitrate fairness * in execution memory allocation across tasks, Otherwise, a task may occupy more than * its fair share of execution memory, mistakenly thinking that other tasks can acquire * the portion of storage memory that cannot be evicted. */ def computeMaxExecutionPoolSize(): Long = { maxMemory - math.min(storageMemoryUsed, storageRegionSize) } onHeapExecutionMemoryPool.acquireMemory( numBytes, taskAttemptId, maybeGrowExecutionPool, computeMaxExecutionPoolSize) case MemoryMode.OFF_HEAP => // For now, we only support on-heap caching of data, so we do not need to interact with // the storage pool when allocating off-heap memory. This will change in the future, though. offHeapExecutionMemoryPool.acquireMemory(numBytes, taskAttemptId) } }
/** * Try to acquire up to `numBytes` of memory for the given task and return the number of bytes * obtained, or 0 if none can be allocated. * * This call may block until there is enough free memory in some situations, to make sure each * task has a chance to ramp up to at least 1 / 2N of the total memory pool (where N is the # of * active tasks) before it is forced to spill. This can happen if the number of tasks increase * but an older task had a lot of memory already. * * @param numBytes number of bytes to acquire * @param taskAttemptId the task attempt acquiring memory * @param maybeGrowPool a callback that potentially grows the size of this pool. It takes in * one parameter (Long) that represents the desired amount of memory by * which this pool should be expanded. * @param computeMaxPoolSize a callback that returns the maximum allowable size of this pool * at this given moment. This is not a field because the max pool * size is variable in certain cases. For instance, in unified * memory management, the execution pool can be expanded by evicting * cached blocks, thereby shrinking the storage pool. * * @return the number of bytes granted to the task. */ private[memory] def acquireMemory( numBytes: Long, taskAttemptId: Long, maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => Unit, computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized { assert(numBytes > 0, s"invalid number of bytes requested: $numBytes") // TODO: clean up this clunky method signature // Add this task to the taskMemory map just so we can keep an accurate count of the number // of active tasks, to let other tasks ramp down their memory in calls to `acquireMemory` if (!memoryForTask.contains(taskAttemptId)) { memoryForTask(taskAttemptId) = 0L // This will later cause waiting tasks to wake up and check numTasks again lock.notifyAll() } // Keep looping until we're either sure that we don't want to grant this request (because this // task would have more than 1 / numActiveTasks of the memory) or we have enough free // memory to give it (we always let each task get at least 1 / (2 * numActiveTasks)). // TODO: simplify this to limit each task to its own slot while (true) { val numActiveTasks = memoryForTask.keys.size val curMem = memoryForTask(taskAttemptId) // In every iteration of this loop, we should first try to reclaim any borrowed execution // space from storage. This is necessary because of the potential race condition where new // storage blocks may steal the free execution memory that this task was waiting for. maybeGrowPool(numBytes - memoryFree) // Maximum size the pool would have after potentially growing the pool. // This is used to compute the upper bound of how much memory each task can occupy. This // must take into account potential free memory as well as the amount this pool currently // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management, // we did not take into account space that could have been freed by evicting cached blocks. val maxPoolSize = computeMaxPoolSize() val maxMemoryPerTask = maxPoolSize / numActiveTasks val minMemoryPerTask = poolSize / (2 * numActiveTasks) // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem)) // Only give it as much memory as is free, which might be none if it reached 1 / numTasks val toGrant = math.min(maxToGrant, memoryFree) // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking; // if we can't give it this much now, wait for other tasks to free up memory // (this happens if older tasks allocated lots of memory before N grew) if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) { logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free") lock.wait() } else { memoryForTask(taskAttemptId) += toGrant return toGrant } } 0L // Never reached }这里同样有借的概念,当Execution Pool的内存不足时会向Storage Pool借内存,但是不同是,这种借更为“强势”,强势主要体现在,如果Storage Pool之前向Execution Pool借用了一定的内存,不论这部分内存是否存储了数据,Execution Pool都必须还回来,存储在内存中的数据如果设置了使用DiskLevel则可以转存到磁盘上,否则将被直接丢弃,下面是从MemoryStore中释放Block的代码:
/** * Drop a block from memory, possibly putting it on disk if applicable. Called when the memory * store reaches its limit and needs to free up space. * * If `data` is not put on disk, it won't be created. * * Return the block status if the given block has been updated, else None. */ def dropFromMemory( blockId: BlockId, data: () => Either[Array[Any], ByteBuffer]): Option[BlockStatus] = { logInfo(s"Dropping block $blockId from memory") val info = blockInfo.get(blockId).orNull // If the block has not already been dropped if (info != null && pendingToRemove.putIfAbsent(blockId, currentTaskAttemptId) == 0L) { try { info.synchronized { // required ? As of now, this will be invoked only for blocks which are ready // But in case this changes in future, adding for consistency sake. if (!info.waitForReady()) { // If we get here, the block write failed. logWarning(s"Block $blockId was marked as failure. Nothing to drop") return None } else if (blockInfo.get(blockId).isEmpty) { logWarning(s"Block $blockId was already dropped.") return None } var blockIsUpdated = false val level = info.level // Drop to disk, if storage level requires if (level.useDisk && !diskStore.contains(blockId)) { logInfo(s"Writing block $blockId to disk") data() match { case Left(elements) => diskStore.putArray(blockId, elements, level, returnValues = false) case Right(bytes) => diskStore.putBytes(blockId, bytes, level) } blockIsUpdated = true } // Actually drop from memory store val droppedMemorySize = if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L val blockIsRemoved = memoryStore.remove(blockId) if (blockIsRemoved) { blockIsUpdated = true } else { logWarning(s"Block $blockId could not be dropped from memory as it does not exist") } val status = getCurrentBlockStatus(blockId, info) if (info.tellMaster) { reportBlockStatus(blockId, info, status, droppedMemorySize) } if (!level.useDisk) { // The block is completely gone from this node;forget it so we can put() it again later. blockInfo.remove(blockId) } if (blockIsUpdated) { return Some(status) } } } finally { pendingToRemove.remove(blockId) } } None }
试着分析下面的情况:
假如给一个Executor堆内存分配2048M,其他参数均使用默认值:
则有:
MaxMemory = (2048M - 300M) * 0.75 = 1311M
StorageMemory = 1311M * 0.5 = 655.5M
ExecutionMemory = 1311M - 655.5M = 655.5M
当前StorageMemory和ExecutionMemory均没有被任何占用
a,现在假如第一个Task被分配到该Executor,假设该Task在写输出结果的时候需要1000M内存,很显然ExecutionMemory不够用,需要向StorageMemory借内存,经判断是可以借的,这时候重新调整两个内存区域的大小,StorageMemory调整为311M,ExecutionMemory调整为1000M
此时计算得:
maxMemoryPerTask = (1311M - math.min(0M, 655.5M)) / 1 = 1311M
minMemoryPerTask = 1000M / 2 = 500M
maxToGrant = math.min(1000M, math.max(0, 1311M -0M)) = 1000M
toGrant = math.min(1000M, 1000M) = 1000M
最终这个Task分配了1000M内存
b,假如这时候又向StorageMemory中存储了200M的数据,此时StorageMemory的freeSpace仅剩111M
c,这时候该Executor又收到了一个新的Task,这个Task需要500M的存储空间
此时可以从StorageMemory中借得111M,ExecutorMemory重新调整为1111M,StorageMemory调整为200M
maxMemoryPerTask = (1311M - math.min(200M, 655.5M)) / 2 = 555.5M
minMemoryPerTask = 1111M / 4 = 277.75MmaxToGrant = math.min(500M, math.max(0, 555.5M -0M)) = 500M
toGrant = math.min(500M, 111M) = 111M
此时toGrant < 500M 并且 0M + toGrant < minMemoryPerTask,第二个提交的Task将不得不等待,直到第一个Task执行完释放部分内存,但是如果很不幸,第一个Task执行了时间比较长(如10min),那么第二个Task将不得不等待长达10min的时间
if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) { logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free") lock.wait() }
造成这种现象的原因在于:没有对同一个Executor进程上的每个Task使用的内存数量做单独的限制和管理
而如果使用StaticMemoryManager虽然可以在一定程度上缓解这种问题,但是会带来内存利用率低的问题
或者可以在没有得到足够的内存时,不让线程wait在这里,毕竟没有内存可以spill到磁盘,虽然慢一些但总比长时间wait要好
- Spark源码解读(7)——内存管理
- Spark源码解读(4)——RDD
- Spark源码解读(8)——累加器
- iOS内存管理和malloc源码解读
- Spark源码解读(1)——Master启动过程
- Spark源码解读(2)——Worker启动过程
- Spark源码解读(6)——Shuffle过程
- Spark源码解读(5)——存储子系统
- spark源码解读
- 【Spark】SparkContext源码解读
- Spark Streaming源码解读
- spark streaming源码解读
- Spark 定制版:014~Spark Streaming源码解读之State管理之updateStateByKey和mapWithState解密
- AMPS:内存管理模块源码解读(一)
- AMPS:内存管理模块源码解读(二)
- jemalloc源码解读(二)内存页的管理
- Flink内存管理源码解读之基础数据结构
- Flink内存管理源码解读之内存管理器
- g++/gcc编译选项
- win8 打开.chm文件,左侧有目录,却不显示内容,解决办法
- php 过滤js输入 过滤脏字
- SqlServer与MongoDB结合使用NHibernate
- LinkedIn获取hash
- Spark源码解读(7)——内存管理
- php中的全等于和不全等于 不等于的 用法详解
- Node的模块系统
- PHP变量声明
- dubbo和zookeeper结合使用超简教程(附工具和源码地址)
- KeyguardManager简介 解锁和锁屏
- 51nodwangyurzee的树
- 前端代码规则检查
- iOS面试宝典