第40课： CacheManager彻底解密：CacheManager运行原理流程图和源码详解

来源：互联网发布：教育视频网站知乎编辑：程序博客网时间：2024/05/14 14:06

CacheManager管理是缓存，而缓存可以是基于内存的缓存，也可以是基于磁盘的缓存。CacheManager需要通过BlockManager来操作数据。

Task发生计算的时候要调用RDD的compute进行计算。我们看一下MapPartitionsRDD的compute方法：

MapPartitionsRDD的源码：

1. private[spark] class MapPartitionsRDD[U:ClassTag, T: ClassTag](

2. var prev: RDD[T],

3. f: (TaskContext, Int, Iterator[T]) =>Iterator[U], // (TaskContext, partitionindex, iterator)

4. preservesPartitioning: Boolean = false)

5. extends RDD[U](prev) {

7. override val partitioner = if(preservesPartitioning) firstParent[T].partitioner else None

9. override def getPartitions: Array[Partition]= firstParent[T].partitions

10.

11. override def compute(split: Partition,context: TaskContext): Iterator[U] =

12. f(context, split.index,firstParent[T].iterator(split, context))

13.

14. override def clearDependencies() {

15. super.clearDependencies()

16. prev = null

17. }

18. }

compute真正计算的时候通过iterator计算，MapPartitionsRDD的iterator依赖父RDD计算。iterator是RDD内部的方法，如有缓存将从缓存中读取数据，否则进行计算。这不是被用户直接调用，但可用于实现自定义子RDD。

RDD.scala的iterator方法：

1. final def iterator(split: Partition, context:TaskContext): Iterator[T] = {

2. if (storageLevel != StorageLevel.NONE) {

3. getOrCompute(split, context)

4. } else {

5. computeOrReadCheckpoint(split, context)

6. }

7. }

RDD.scala的iterator方法中判断storageLevel!= StorageLevel.NONE说明数据可能存放内存、磁盘中，调用getOrCompute(split, context)方法。如果之前计算过一次，再次计算可以找CacheManager要数据。

RDD.scala的getOrCompute源码：

1. private[spark] def getOrCompute(partition:Partition, context: TaskContext): Iterator[T] = {

2. val blockId = RDDBlockId(id,partition.index)

3. var readCachedBlock = true

4. // This method is called on executors, sowe need call SparkEnv.get instead of sc.env.

5. SparkEnv.get.blockManager.getOrElseUpdate(blockId,storageLevel, elementClassTag, () => {

6. readCachedBlock = false

7. computeOrReadCheckpoint(partition, context)

8. }) match {

9. case Left(blockResult) =>

10. if (readCachedBlock) {

11. val existingMetrics =context.taskMetrics().inputMetrics

12. existingMetrics.incBytesRead(blockResult.bytes)

13. new InterruptibleIterator[T](context,blockResult.data.asInstanceOf[Iterator[T]]) {

14. override def next(): T = {

15. existingMetrics.incRecordsRead(1)

16. delegate.next()

17. }

18. }

19. } else {

20. new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])

21. }

22. case Right(iter) =>

23. new InterruptibleIterator(context,iter.asInstanceOf[Iterator[T]])

24. }

25. }

有缓存的情况下，缓存可能基于内存也可能基于磁盘，getOrCompute获取缓存；如没有缓存则需重新计算RDD。为何需要重新计算？如果数据放在内存中，假设缓存了1百万个数据分片，下一个步骤计算的时候需要内存，因为需要进行计算的内存空间占用比之前缓存的数据占用内存空间重要，假设需腾出1万个数据分片所在的空间，因此从BlockManager中将内存中的缓存数据drop到磁盘上，如果不是内存和磁盘的存储级别，那1万个数据分片的缓存数据就可能丢失，99万个数据分片可以复用，而这1万个数据分片需重新进行计算。

Cache在工作的时候会最大化的保留数据，但是数据不一定绝对完整，因为当前的计算如果需要内存空间的话，那么Cache在内存中的数据必须让出空间，此时如何在RDD持久化的时候同时指定了可以把数据放在Disk上，那么部分Cache的数据就可以从内存转入磁盘，否则的话，数据就会丢失！

getOrCompute方法返回的是Iterator。进行了Cache以后，BlockManager对其进行管理，通过blockId可以获得曾经缓存的数据。具体CacheManager在获得缓存数据的时候会通过BlockManager来抓到数据：

getOrElseUpdate方法中：如果block存在，检索给定的块block；如果不存在，则调用提供`makeIterator`方法计算块block，对块block进行持久化，并返回block的值。

BlockManager.scala的getOrElseUpdate源码：

1. defgetOrElseUpdate[T](

2. blockId: BlockId,

3. level: StorageLevel,

4. classTag: ClassTag[T],

5. makeIterator: () => Iterator[T]):Either[BlockResult, Iterator[T]] = {

6. // Attempt to read the block from local orremote storage. If it's present, then we don't need

7. // to go through the local-get-or-put path.

8. get[T](blockId)(classTag) match {

9. case Some(block) =>

10. return Left(block)

11. case _ =>

12. // Need to compute the block.

13. }

14. // Initially we hold no locks on thisblock.

15. doPutIterator(blockId, makeIterator, level,classTag, keepReadLock = true) match {

16. case None =>

17. // doPut() didn't hand work back to us,so the block already existed or was successfully

18. // stored. Therefore, we now hold aread lock on the block.

19. val blockResult =getLocalValues(blockId).getOrElse {

20. // Since we held a read lock betweenthe doPut() and get() calls, the block should not

21. // have been evicted, so get() notreturning the block indicates some internal error.

22. releaseLock(blockId)

23. throw new SparkException(s"get()failed for block $blockId even though we held a lock")

24. }

25. // We already hold a read lock on theblock from the doPut() call and getLocalValues()

26. // acquires the lock again, so we needto call releaseLock() here so that the net number

27. // of lock acquisitions is 1 (since thecaller will only call release() once).

28. releaseLock(blockId)

29. Left(blockResult)

30. case Some(iter) =>

31. // The put failed, likely because thedata was too large to fit in memory and could not be

32. // dropped to disk. Therefore, we needto pass the input iterator back to the caller so

33. // that they can decide what to do withthe values (e.g. process them without caching).

34. Right(iter)

35. }

36. }

BlockManager.scala的getOrElseUpdate中根据blockId调用了get[T](blockId)方法，get方法从block块管理器（本地或远程）获取一个块block。如果块在本地存储且没获取锁，则先获取块block的读取锁。如果该块是从远程块管理器获取的，当`data`迭代器被完全消费以后，那么读取锁将自动释放。

BlockManager.scala的get方法源码如下：

1. def get[T: ClassTag](blockId: BlockId):Option[BlockResult] = {

2. val local = getLocalValues(blockId)

3. if (local.isDefined) {

4. logInfo(s"Found block $blockIdlocally")

5. return local

6. }

7. val remote = getRemoteValues[T](blockId)

8. if (remote.isDefined) {

9. logInfo(s"Found block $blockIdremotely")

10. return remote

11. }

12. None

13. }

BlockManager.的get方法从Local的角度讲：如果数据在本地，get方法调用getLocalValues获取数据。如果数据如果在内存中（level.useMemory且memoryStore包含了blockId），则从memoryStore中获取数据；如果数据在磁盘中（level.useDisk且diskStore包含了blockId），则从diskStore中获取数据。这说明数据在本地缓存，可以在内存中，也可以在磁盘上。

BlockManager.scala的getLocalValues方法源码如下：

1. def getLocalValues(blockId: BlockId):Option[BlockResult] = {

2. logDebug(s"Getting local block$blockId")

3. blockInfoManager.lockForReading(blockId)match {

4. case None =>

5. logDebug(s"Block $blockId was notfound")

6. None

7. case Some(info) =>

8. val level = info.level

9. logDebug(s"Level for block$blockId is $level")

10. if (level.useMemory &&memoryStore.contains(blockId)) {

11. val iter: Iterator[Any] = if(level.deserialized) {

12. memoryStore.getValues(blockId).get

13. } else {

14. serializerManager.dataDeserializeStream(

15. blockId,memoryStore.getBytes(blockId).get.toInputStream())(info.classTag)

16. }

17. val ci = CompletionIterator[Any,Iterator[Any]](iter, releaseLock(blockId))

18. Some(new BlockResult(ci,DataReadMethod.Memory, info.size))

19. } else if (level.useDisk &&diskStore.contains(blockId)) {

20. val iterToReturn: Iterator[Any] = {

21. val diskBytes =diskStore.getBytes(blockId)

22. if (level.deserialized) {

23. val diskValues =serializerManager.dataDeserializeStream(

24. blockId,

25. diskBytes.toInputStream(dispose= true))(info.classTag)

26. maybeCacheDiskValuesInMemory(info,blockId, level, diskValues)

27. } else {

28. val stream =maybeCacheDiskBytesInMemory(info, blockId, level, diskBytes)

29. .map {_.toInputStream(dispose =false)}

30. .getOrElse {diskBytes.toInputStream(dispose = true) }

31. serializerManager.dataDeserializeStream(blockId,stream)(info.classTag)

32. }

33. }

34. val ci = CompletionIterator[Any,Iterator[Any]](iterToReturn, releaseLock(blockId))

35. Some(new BlockResult(ci,DataReadMethod.Disk, info.size))

36. } else {

37. handleLocalReadFailure(blockId)

38. }

39. }

40. }

BlockManager的get方法从remote的角度讲：get方法中将调用getRemoteValues方法。

BlockManager.Scala的getRemoteValues源码：

1. private def getRemoteValues[T:ClassTag](blockId: BlockId): Option[BlockResult] = {

2. val ct = implicitly[ClassTag[T]]

3. getRemoteBytes(blockId).map { data =>

4. val values =

5. serializerManager.dataDeserializeStream(blockId,data.toInputStream(dispose = true))(ct)

6. new BlockResult(values,DataReadMethod.Network, data.size)

7. }

8. }

getRemoteValues方法中调用getRemoteBytes方法，通过blockTransferService.fetchBlockSync从远程节点获取数据。

BlockManager.Scala的getRemoteBytes源码：

1. defgetRemoteBytes(blockId: BlockId): Option[ChunkedByteBuffer] = {

2. logDebug(s"Getting remote block$blockId")

3. require(blockId != null, "BlockId isnull")

4. var runningFailureCount = 0

5. var totalFailureCount = 0

6. val locations = getLocations(blockId)

7. val maxFetchFailures = locations.size

8. var locationIterator = locations.iterator

9. while (locationIterator.hasNext) {

10. val loc = locationIterator.next()

11. logDebug(s"Getting remote block$blockId from $loc")

12. val data = try {

13. blockTransferService.fetchBlockSync(

14. loc.host, loc.port, loc.executorId,blockId.toString).nioByteBuffer()

15. } catch {

16. case NonFatal(e) =>

17. runningFailureCount += 1

18. totalFailureCount += 1

19.

20. if (totalFailureCount >=maxFetchFailures) {

21. // Give up trying anymorelocations. Either we've tried all of the original locations,

22. // or we've refreshed the list oflocations from the master, and have still

23. // hit failures after tryinglocations from the refreshed list.

24. logWarning(s"Failed to fetchblock after $totalFailureCount fetch failures. " +

25. s"Most recent failurecause:", e)

26. return None

27. }

28.

29. logWarning(s"Failed to fetchremote block $blockId " +

30. s"from $loc (failed attempt$runningFailureCount)", e)

31.

32. // If there is a large number ofexecutors then locations list can contain a

33. // large number of stale entriescausing a large number of retries that may

34. // take a significant amount of time.To get rid of these stale entries

35. // we refresh the block locationsafter a certain number of fetch failures

36. if (runningFailureCount >=maxFailuresBeforeLocationRefresh) {

37. locationIterator =getLocations(blockId).iterator

38. logDebug(s"Refreshed locationsfrom the driver " +

39. s"after${runningFailureCount} fetch failures.")

40. runningFailureCount = 0

41. }

42.

43. // This location failed, so we retryfetch from a different one by returning null here

44. null

45. }

46.

47. if (data != null) {

48. return Some(newChunkedByteBuffer(data))

49. }

50. logDebug(s"The value of block $blockIdis null")

51. }

52. logDebug(s"Block $blockId notfound")

53. None

54. }

BlockManager的get方法，如果本地有数据，从本地获取数据返回；如果远程有数据，从远程获取数据返回；如果都没有数据，就返回None。get方法的返回类型是Option[BlockResult]，Option的结果分为二种情况：1，如果有内容，返回Some[BlockResult]2，如果没有内容，返回None。这是Option的基础语法。

Option.scala源码如下：

1. sealedabstract class Option[+A] extends Product with Serializable {

2. self =>

3. .....

4. final case class Some[+A](x: A)extends Option[A] {

5. def isEmpty = false

6. def get = x

7. }

9. .......

10. case object None extendsOption[Nothing] {

11. def isEmpty = true

12. def get = throw newNoSuchElementException("None.get")

13. }

回到BlockManager的getOrElseUpdate方法，从get方法返回的结果进行模式匹配，如果有数据，则对Some(block)返回Left(block)，这是获取到block的情况；如果没数据，则是None，需进行计算block。

BlockManager的getOrElseUpdate源码：

1. defgetOrElseUpdate[T](

2. blockId: BlockId,

3. level: StorageLevel,

4. classTag: ClassTag[T],

5. makeIterator: () => Iterator[T]):Either[BlockResult, Iterator[T]] = {

6. // Attempt to read the block from local orremote storage. If it's present, then we don't need

7. // to go through the local-get-or-put path.

8. get[T](blockId)(classTag) match {

9. case Some(block) =>

10. return Left(block)

11. case _ =>

12. // Need to compute the block.

13. }

14. ......

回到RDD.scala的getOrCompute方法, 在getOrCompute方法中调用SparkEnv.get.blockManager.getOrElseUpdate方法时，传入blockId、storageLevel、elementClassTag，其中第四个参数是一个匿名函数，在匿名函数中调用了computeOrReadCheckpoint(partition, context)。然后在getOrElseUpdate方法中，根据blockId获取数据，如果获取到缓存数据，就返回；如果没有数据，就调用doPutIterator(blockId, makeIterator, level, classTag, keepReadLock =true)进行计算，doPutIterator其中第二个参数makeIterator就是getOrElseUpdate方法中传入的匿名函数，在匿名函数获取到的Iterator数据。

RDD. getOrCompute源码：

1. private[spark] def getOrCompute(partition:Partition, context: TaskContext): Iterator[T] = {

2. val blockId = RDDBlockId(id,partition.index)

3. var readCachedBlock = true

4. // This method is called on executors, sowe need call SparkEnv.get instead of sc.env.

5. SparkEnv.get.blockManager.getOrElseUpdate(blockId,storageLevel, elementClassTag, () => {

6. readCachedBlock = false

7. computeOrReadCheckpoint(partition,context)

8. })

9. …….

其中computeOrReadCheckpoint方法, 如果RDD进行了checkpoint，则从父RDD的iterator中直接获取数据；或者没有Checkpoint物化，则重新计算RDD的数据。

RDD.scala的computeOrReadCheckpoint源码：

1. private[spark] def computeOrReadCheckpoint(split:Partition, context: TaskContext): Iterator[T] =

2. {

3. if (isCheckpointedAndMaterialized) {

4. firstParent[T].iterator(split, context)

5. } else {

6. compute(split, context)

7. }

8. }

BlockManager.scala的getOrElseUpdate方法中如果根据blockID没有获取到本地数据，则调用doPutIterator将通过BlockManager再次进行持久化。。

BlockManager.scala的getOrElseUpdate方法源码：

1. def getOrElseUpdate[T](

2. blockId: BlockId,

3. level: StorageLevel,

4. classTag: ClassTag[T],

5. makeIterator: () => Iterator[T]):Either[BlockResult, Iterator[T]] = {

6. // Attempt to read the block from local orremote storage. If it's present, then we don't need

7. // to go through the local-get-or-put path.

8. get[T](blockId)(classTag) match {

9. case Some(block) =>

10. return Left(block)

11. case _ =>

12. // Need to compute the block.

13. }

14. // Initially we hold no locks on thisblock.

15. doPutIterator(blockId, makeIterator, level,classTag, keepReadLock = true) match {

16. …….

BlockManager.scala的getOrElseUpdate方法中调用了doPutIterator，doPutIterator将makeIterator从父RDD的checkpoint读取的数据或者重新计算的数据存放到内存中，如果内存不够，就溢出到磁盘中持久化。

BlockManager.scala的doPutIterator方法源码：

1. privatedef doPutIterator[T](

2. blockId: BlockId,

3. iterator: () => Iterator[T],

4. level: StorageLevel,

5. classTag: ClassTag[T],

6. tellMaster: Boolean = true,

7. keepReadLock: Boolean = false):Option[PartiallyUnrolledIterator[T]] = {

8. doPut(blockId, level, classTag, tellMaster= tellMaster, keepReadLock = keepReadLock) { info =>

9. val startTimeMs =System.currentTimeMillis

10. var iteratorFromFailedMemoryStorePut:Option[PartiallyUnrolledIterator[T]] = None

11. // Size of the block in bytes

12. var size = 0L

13. if (level.useMemory) {

14. // Put it in memory first, even if italso has useDisk set to true;

15. // We will drop it to disk later if thememory store can't hold it.

16. if (level.deserialized) {

17. memoryStore.putIteratorAsValues(blockId,iterator(), classTag) match {

18. case Right(s) =>

19. size = s

20. case Left(iter) =>

21. // Not enough space to unrollthis block; drop to disk if applicable

22. if (level.useDisk) {

23. logWarning(s"Persistingblock $blockId to disk instead.")

24. diskStore.put(blockId) {fileOutputStream =>

25. serializerManager.dataSerializeStream(blockId,fileOutputStream, iter)(classTag)

26. }

27. size =diskStore.getSize(blockId)

28. } else {

29. iteratorFromFailedMemoryStorePut= Some(iter)

30. }

31. }

32. } else { // !level.deserialized

33. memoryStore.putIteratorAsBytes(blockId,iterator(), classTag, level.memoryMode) match {

34. case Right(s) =>

35. size = s

36. caseLeft(partiallySerializedValues) =>

37. // Not enough space to unrollthis block; drop to disk if applicable

38. if (level.useDisk) {

39. logWarning(s"Persistingblock $blockId to disk instead.")

40. diskStore.put(blockId) {fileOutputStream =>

41. partiallySerializedValues.finishWritingToStream(fileOutputStream)

42. }

43. size = diskStore.getSize(blockId)

44. } else {

45. iteratorFromFailedMemoryStorePut= Some(partiallySerializedValues.valuesIterator)

46. }

47. }

48. }

49.

50. } else if (level.useDisk) {

51. diskStore.put(blockId) {fileOutputStream =>

52. serializerManager.dataSerializeStream(blockId,fileOutputStream, iterator())(classTag)

53. }

54. size = diskStore.getSize(blockId)

55. }

56.

57. val putBlockStatus =getCurrentBlockStatus(blockId, info)

58. val blockWasSuccessfullyStored =putBlockStatus.storageLevel.isValid

59. if (blockWasSuccessfullyStored) {

60. // Now that the block is in either thememory or disk store, tell the master about it.

61. info.size = size

62. if (tellMaster &&info.tellMaster) {

63. reportBlockStatus(blockId,putBlockStatus)

64. }

65. addUpdatedBlockStatusToTaskMetrics(blockId,putBlockStatus)

66. logDebug("Put block %s locallytook %s".format(blockId, Utils.getUsedTimeMs(startTimeMs)))

67. if (level.replication > 1) {

68. val remoteStartTime =System.currentTimeMillis

69. val bytesToReplicate =doGetLocalBytes(blockId, info)

70. // [SPARK-16550] Erase the typedclassTag when using default serialization, since

71. // NettyBlockRpcServer crashes whendeserializing repl-defined classes.

72. // TODO(ekl) remove this once theclassloader issue on the remote end is fixed.

73. val remoteClassTag = if(!serializerManager.canUseKryo(classTag)) {

74. scala.reflect.classTag[Any]

75. } else {

76. classTag

77. }

78. try {

79. replicate(blockId,bytesToReplicate, level, remoteClassTag)

80. } finally {

81. bytesToReplicate.unmap()

82. }

83. logDebug("Put block %s remotelytook %s"

84. .format(blockId,Utils.getUsedTimeMs(remoteStartTime)))

85. }

86. }

87. assert(blockWasSuccessfullyStored ==iteratorFromFailedMemoryStorePut.isEmpty)

88. iteratorFromFailedMemoryStorePut

89. }

90. }

总结CacheManager内幕解密如下：

图 9- 1 Cache示意图

首先调用RDD的iterator方法：

（一）如果在内存或磁盘中有缓存，则通过BlockManager从Local或者Remote获取数据。

l 如果成功获取缓存数据，通过BlockManager首先从本地获取数据，如果获得不到数据，则从远程获取数据。

l 如果没有直接获取缓存数据，首先会查看当前的RDD是否进行了Checkpoint，如果进行了Checkpoint就直接读取Checkpoint的数据，否则必须进行计算。因为此时RDD需要缓存，所以计算结果需要通过BlockManager再次进行持久化。

1）如果持久化的时候只是缓存到磁盘中，就直接使用BlockManager的doPut方法写入磁盘（需要考虑replication副本）的情况。

2）如果指定了内存做缓存，优先保存到内存中，此时使用memoryStore的unrollSafely方法来尝试安全的将数据保存到内存中，如果内存不够的话，使用方法整理出一部分空间，然后基于整理出来的内存空间放入我们想缓存的最新数据。

（二）如果在内存或磁盘中没有缓存，直接通过RDD的compute进行计算，有可能需要考虑Checkpoint。

阅读全文

0 0