Spark源码解读(7)——内存管理

来源:互联网 发布:知我药妆app 编辑:程序博客网 时间:2024/05/22 15:54

Spark的内存主要由MemoryManager来管理,其管理的内存分为两个部分:StorageMemory和ExecutionMemory,ExecutionMemory又分为onHeap和offHeap

其中StorageMemory主要给BlockManager用,属于Spark存储系统的一部分,ExecutionMemory则主要为执行Task用,主要是Shuffle过程的结果写入

  @GuardedBy("this")  protected val storageMemoryPool = new StorageMemoryPool(this)  @GuardedBy("this")  protected val onHeapExecutionMemoryPool = new ExecutionMemoryPool(this, "on-heap execution")  @GuardedBy("this")  protected val offHeapExecutionMemoryPool = new ExecutionMemoryPool(this, "off-heap execution")
首先,看下各个区域的大小:

  storageMemoryPool.incrementPoolSize(storageMemory)  onHeapExecutionMemoryPool.incrementPoolSize(onHeapExecutionMemory)  offHeapExecutionMemoryPool.incrementPoolSize(conf.getSizeAsBytes("spark.memory.offHeap.size", 0))
MemoryManager主要有两个子类:StaticMemoryManager、UnifiedMemoryManager


因为默认用UnifiedMemoryManager所以这里以UnifiedMemoryManager为例进行分析:

    val useLegacyMemoryManager = conf.getBoolean("spark.memory.useLegacyMode", false)    val memoryManager: MemoryManager =      if (useLegacyMemoryManager) {        new StaticMemoryManager(conf, numUsableCores)      } else {        UnifiedMemoryManager(conf, numUsableCores)      }
以所有参数均为默认情况来分析每个区域的内存大小:

object UnifiedMemoryManager {  // Set aside a fixed amount of memory for non-storage, non-execution purposes.  // This serves a function similar to `spark.memory.fraction`, but guarantees that we reserve  // sufficient memory for the system even for small heaps. E.g. if we have a 1GB JVM, then  // the memory used for execution and storage will be (1024 - 300) * 0.75 = 543MB by default.  private val RESERVED_SYSTEM_MEMORY_BYTES = 300 * 1024 * 1024  def apply(conf: SparkConf, numCores: Int): UnifiedMemoryManager = {    val maxMemory = getMaxMemory(conf)    new UnifiedMemoryManager(      conf,      maxMemory = maxMemory,      storageRegionSize =        (maxMemory * conf.getDouble("spark.memory.storageFraction", 0.5)).toLong,      numCores = numCores)  }  /**   * Return the total amount of memory shared between execution and storage, in bytes.   */  private def getMaxMemory(conf: SparkConf): Long = {    val systemMemory = conf.getLong("spark.testing.memory", Runtime.getRuntime.maxMemory)    val reservedMemory = conf.getLong("spark.testing.reservedMemory",      if (conf.contains("spark.testing")) 0 else RESERVED_SYSTEM_MEMORY_BYTES)    val minSystemMemory = reservedMemory * 1.5    if (systemMemory < minSystemMemory) {      throw new IllegalArgumentException(s"System memory $systemMemory must " +        s"be at least $minSystemMemory. Please use a larger heap size.")    }    val usableMemory = systemMemory - reservedMemory    val memoryFraction = conf.getDouble("spark.memory.fraction", 0.75)    (usableMemory * memoryFraction).toLong  }}

private[spark] class UnifiedMemoryManager private[memory] (    conf: SparkConf,    val maxMemory: Long,    storageRegionSize: Long,    numCores: Int)  extends MemoryManager(    conf,    numCores,    storageRegionSize,    maxMemory - storageRegionSize) {
从上面的代码可知MemoryManager管理的堆上内存如下图(默认参数情况下):


对StorageMemory的使用主要是通过BlockManager的MemoryStore进行调用,对OnHeapExecutionMemory的使用则主要通过TaskMemoryManager,请求内存的方法分别为:acquireStorageMemory()和acquireExecutionMemory()

首先看下相对简单的acqureStorageMemory()方法

  override def acquireStorageMemory(      blockId: BlockId,      numBytes: Long,      evictedBlocks: mutable.Buffer[(BlockId, BlockStatus)]): Boolean = synchronized {    assertInvariant()    assert(numBytes >= 0)    if (numBytes > maxStorageMemory) {      // Fail fast if the block simply won't fit      logInfo(s"Will not store $blockId as the required space ($numBytes bytes) exceeds our " +        s"memory limit ($maxStorageMemory bytes)")      return false    }    if (numBytes > storageMemoryPool.memoryFree) {      // There is not enough free memory in the storage pool, so try to borrow free memory from      // the execution pool.      val memoryBorrowedFromExecution = Math.min(onHeapExecutionMemoryPool.memoryFree, numBytes)      onHeapExecutionMemoryPool.decrementPoolSize(memoryBorrowedFromExecution)      storageMemoryPool.incrementPoolSize(memoryBorrowedFromExecution)    }    storageMemoryPool.acquireMemory(blockId, numBytes, evictedBlocks)  }
这里的逻辑相对简单,首先判断如果需要的大小大于最大可用内存则直接返回false,否则查看当前StorageMemory的memoryFree能否满足需要分配的内存,如果能满足则直接分配,否则尝试从Execution Pool借存储空间

但关于所借内存的大小这里有一个疑问?

      val memoryBorrowedFromExecution = Math.min(onHeapExecutionMemoryPool.memoryFree, numBytes)
改为

      val memoryBorrowedFromExecution = Math.min(onHeapExecutionMemoryPool.memoryFree, numBytes - storageMemoryPool.memoryFree)
是否更合理些?

这里不清楚这段代码的作者是否有意为之


再看看acquireExecutionMemory()方法,这个方法就相对复杂了

  /**   * Try to acquire up to `numBytes` of execution memory for the current task and return the   * number of bytes obtained, or 0 if none can be allocated.   *   * This call may block until there is enough free memory in some situations, to make sure each   * task has a chance to ramp up to at least 1 / 2N of the total memory pool (where N is the # of   * active tasks) before it is forced to spill. This can happen if the number of tasks increase   * but an older task had a lot of memory already.   */  override private[memory] def acquireExecutionMemory(      numBytes: Long,      taskAttemptId: Long,      memoryMode: MemoryMode): Long = synchronized {    assertInvariant()    assert(numBytes >= 0)    memoryMode match {      case MemoryMode.ON_HEAP =>        /**         * Grow the execution pool by evicting cached blocks, thereby shrinking the storage pool.         *         * When acquiring memory for a task, the execution pool may need to make multiple         * attempts. Each attempt must be able to evict storage in case another task jumps in         * and caches a large block between the attempts. This is called once per attempt.         */        def maybeGrowExecutionPool(extraMemoryNeeded: Long): Unit = {          if (extraMemoryNeeded > 0) {            // There is not enough free memory in the execution pool, so try to reclaim memory from            // storage. We can reclaim any free memory from the storage pool. If the storage pool            // has grown to become larger than `storageRegionSize`, we can evict blocks and reclaim            // the memory that storage has borrowed from execution.            val memoryReclaimableFromStorage =              math.max(storageMemoryPool.memoryFree, storageMemoryPool.poolSize - storageRegionSize)            if (memoryReclaimableFromStorage > 0) {              // Only reclaim as much space as is necessary and available:              val spaceToReclaim = storageMemoryPool.freeSpaceToShrinkPool(                math.min(extraMemoryNeeded, memoryReclaimableFromStorage))              storageMemoryPool.decrementPoolSize(spaceToReclaim)              onHeapExecutionMemoryPool.incrementPoolSize(spaceToReclaim)            }          }        }        /**         * The size the execution pool would have after evicting storage memory.         *         * The execution memory pool divides this quantity among the active tasks evenly to cap         * the execution memory allocation for each task. It is important to keep this greater         * than the execution pool size, which doesn't take into account potential memory that         * could be freed by evicting storage. Otherwise we may hit SPARK-12155.         *         * Additionally, this quantity should be kept below `maxMemory` to arbitrate fairness         * in execution memory allocation across tasks, Otherwise, a task may occupy more than         * its fair share of execution memory, mistakenly thinking that other tasks can acquire         * the portion of storage memory that cannot be evicted.         */        def computeMaxExecutionPoolSize(): Long = {          maxMemory - math.min(storageMemoryUsed, storageRegionSize)        }        onHeapExecutionMemoryPool.acquireMemory(          numBytes, taskAttemptId, maybeGrowExecutionPool, computeMaxExecutionPoolSize)      case MemoryMode.OFF_HEAP =>        // For now, we only support on-heap caching of data, so we do not need to interact with        // the storage pool when allocating off-heap memory. This will change in the future, though.        offHeapExecutionMemoryPool.acquireMemory(numBytes, taskAttemptId)    }  }
  /**   * Try to acquire up to `numBytes` of memory for the given task and return the number of bytes   * obtained, or 0 if none can be allocated.   *   * This call may block until there is enough free memory in some situations, to make sure each   * task has a chance to ramp up to at least 1 / 2N of the total memory pool (where N is the # of   * active tasks) before it is forced to spill. This can happen if the number of tasks increase   * but an older task had a lot of memory already.   *   * @param numBytes number of bytes to acquire   * @param taskAttemptId the task attempt acquiring memory   * @param maybeGrowPool a callback that potentially grows the size of this pool. It takes in   *                      one parameter (Long) that represents the desired amount of memory by   *                      which this pool should be expanded.   * @param computeMaxPoolSize a callback that returns the maximum allowable size of this pool   *                           at this given moment. This is not a field because the max pool   *                           size is variable in certain cases. For instance, in unified   *                           memory management, the execution pool can be expanded by evicting   *                           cached blocks, thereby shrinking the storage pool.   *   * @return the number of bytes granted to the task.   */  private[memory] def acquireMemory(      numBytes: Long,      taskAttemptId: Long,      maybeGrowPool: Long => Unit = (additionalSpaceNeeded: Long) => Unit,      computeMaxPoolSize: () => Long = () => poolSize): Long = lock.synchronized {    assert(numBytes > 0, s"invalid number of bytes requested: $numBytes")    // TODO: clean up this clunky method signature    // Add this task to the taskMemory map just so we can keep an accurate count of the number    // of active tasks, to let other tasks ramp down their memory in calls to `acquireMemory`    if (!memoryForTask.contains(taskAttemptId)) {      memoryForTask(taskAttemptId) = 0L      // This will later cause waiting tasks to wake up and check numTasks again      lock.notifyAll()    }    // Keep looping until we're either sure that we don't want to grant this request (because this    // task would have more than 1 / numActiveTasks of the memory) or we have enough free    // memory to give it (we always let each task get at least 1 / (2 * numActiveTasks)).    // TODO: simplify this to limit each task to its own slot    while (true) {      val numActiveTasks = memoryForTask.keys.size      val curMem = memoryForTask(taskAttemptId)      // In every iteration of this loop, we should first try to reclaim any borrowed execution      // space from storage. This is necessary because of the potential race condition where new      // storage blocks may steal the free execution memory that this task was waiting for.      maybeGrowPool(numBytes - memoryFree)      // Maximum size the pool would have after potentially growing the pool.      // This is used to compute the upper bound of how much memory each task can occupy. This      // must take into account potential free memory as well as the amount this pool currently      // occupies. Otherwise, we may run into SPARK-12155 where, in unified memory management,      // we did not take into account space that could have been freed by evicting cached blocks.      val maxPoolSize = computeMaxPoolSize()      val maxMemoryPerTask = maxPoolSize / numActiveTasks      val minMemoryPerTask = poolSize / (2 * numActiveTasks)      // How much we can grant this task; keep its share within 0 <= X <= 1 / numActiveTasks      val maxToGrant = math.min(numBytes, math.max(0, maxMemoryPerTask - curMem))      // Only give it as much memory as is free, which might be none if it reached 1 / numTasks      val toGrant = math.min(maxToGrant, memoryFree)      // We want to let each task get at least 1 / (2 * numActiveTasks) before blocking;      // if we can't give it this much now, wait for other tasks to free up memory      // (this happens if older tasks allocated lots of memory before N grew)      if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {        logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")        lock.wait()      } else {        memoryForTask(taskAttemptId) += toGrant        return toGrant      }    }    0L  // Never reached  }
这里同样有借的概念,当Execution Pool的内存不足时会向Storage Pool借内存,但是不同是,这种借更为“强势”,强势主要体现在,如果Storage Pool之前向Execution Pool借用了一定的内存,不论这部分内存是否存储了数据,Execution Pool都必须还回来,存储在内存中的数据如果设置了使用DiskLevel则可以转存到磁盘上,否则将被直接丢弃,下面是从MemoryStore中释放Block的代码:

  /**   * Drop a block from memory, possibly putting it on disk if applicable. Called when the memory   * store reaches its limit and needs to free up space.   *   * If `data` is not put on disk, it won't be created.   *   * Return the block status if the given block has been updated, else None.   */  def dropFromMemory(      blockId: BlockId,      data: () => Either[Array[Any], ByteBuffer]): Option[BlockStatus] = {    logInfo(s"Dropping block $blockId from memory")    val info = blockInfo.get(blockId).orNull    // If the block has not already been dropped    if (info != null && pendingToRemove.putIfAbsent(blockId, currentTaskAttemptId) == 0L) {      try {        info.synchronized {          // required ? As of now, this will be invoked only for blocks which are ready          // But in case this changes in future, adding for consistency sake.          if (!info.waitForReady()) {            // If we get here, the block write failed.            logWarning(s"Block $blockId was marked as failure. Nothing to drop")            return None          } else if (blockInfo.get(blockId).isEmpty) {            logWarning(s"Block $blockId was already dropped.")            return None          }          var blockIsUpdated = false          val level = info.level          // Drop to disk, if storage level requires          if (level.useDisk && !diskStore.contains(blockId)) {            logInfo(s"Writing block $blockId to disk")            data() match {              case Left(elements) =>                diskStore.putArray(blockId, elements, level, returnValues = false)              case Right(bytes) =>                diskStore.putBytes(blockId, bytes, level)            }            blockIsUpdated = true          }          // Actually drop from memory store          val droppedMemorySize =            if (memoryStore.contains(blockId)) memoryStore.getSize(blockId) else 0L          val blockIsRemoved = memoryStore.remove(blockId)          if (blockIsRemoved) {            blockIsUpdated = true          } else {            logWarning(s"Block $blockId could not be dropped from memory as it does not exist")          }          val status = getCurrentBlockStatus(blockId, info)          if (info.tellMaster) {            reportBlockStatus(blockId, info, status, droppedMemorySize)          }          if (!level.useDisk) {            // The block is completely gone from this node;forget it so we can put() it again later.            blockInfo.remove(blockId)          }          if (blockIsUpdated) {            return Some(status)          }        }      } finally {        pendingToRemove.remove(blockId)      }    }    None  }


如果仔细分析上面的代码会发现这里有一个坑:

试着分析下面的情况:

假如给一个Executor堆内存分配2048M,其他参数均使用默认值:

则有:

MaxMemory = (2048M - 300M) * 0.75 = 1311M

StorageMemory = 1311M * 0.5 = 655.5M

ExecutionMemory = 1311M - 655.5M = 655.5M


当前StorageMemory和ExecutionMemory均没有被任何占用

a,现在假如第一个Task被分配到该Executor,假设该Task在写输出结果的时候需要1000M内存,很显然ExecutionMemory不够用,需要向StorageMemory借内存,经判断是可以借的,这时候重新调整两个内存区域的大小,StorageMemory调整为311M,ExecutionMemory调整为1000M

此时计算得:

maxMemoryPerTask = (1311M - math.min(0M, 655.5M)) / 1 = 1311M

minMemoryPerTask =  1000M / 2 = 500M

maxToGrant = math.min(1000M, math.max(0, 1311M -0M)) = 1000M

toGrant = math.min(1000M, 1000M) = 1000M

最终这个Task分配了1000M内存

b,假如这时候又向StorageMemory中存储了200M的数据,此时StorageMemory的freeSpace仅剩111M

c,这时候该Executor又收到了一个新的Task,这个Task需要500M的存储空间

此时可以从StorageMemory中借得111M,ExecutorMemory重新调整为1111M,StorageMemory调整为200M

maxMemoryPerTask =  (1311M - math.min(200M, 655.5M)) / 2 = 555.5M

minMemoryPerTask = 1111M / 4 = 277.75M

maxToGrant = math.min(500M, math.max(0, 555.5M -0M)) = 500M
toGrant = math.min(500M, 111M) = 111M

此时toGrant  < 500M 并且 0M + toGrant < minMemoryPerTask,第二个提交的Task将不得不等待,直到第一个Task执行完释放部分内存,但是如果很不幸,第一个Task执行了时间比较长(如10min),那么第二个Task将不得不等待长达10min的时间

      if (toGrant < numBytes && curMem + toGrant < minMemoryPerTask) {        logInfo(s"TID $taskAttemptId waiting for at least 1/2N of $poolName pool to be free")        lock.wait()      }

造成这种现象的原因在于:没有对同一个Executor进程上的每个Task使用的内存数量做单独的限制和管理


而如果使用StaticMemoryManager虽然可以在一定程度上缓解这种问题,但是会带来内存利用率低的问题


或者可以在没有得到足够的内存时,不让线程wait在这里,毕竟没有内存可以spill到磁盘,虽然慢一些但总比长时间wait要好

0 0