Spark源码阅读笔记之DiskStore
来源:互联网 发布:农村淘宝运营方案模板 编辑:程序博客网 时间:2024/06/06 20:15
Spark源码阅读笔记之DiskStore
BlockManager底层通过BlockStore来对数据进行实际的存储。BlockStore是一个抽象类,有三种实现:DiskStore(磁盘级别的持久化)、MemoryStore(内存级别的持久化)和TachyonStore(Tachyon内存分布式文件系统级别的持久化)。
DiskStore通过DiskBlockManager来实现Block和相应磁盘文件的映射关系,从而将Block存储到磁盘的文件中。DiskBlockManager根据YARN_LOCAL_DIRS或LOCAL_DIRS(yarn模式),SPARK_LOCAL_DIRS或spark.local.dir(其他模式,默认值System.getProperty(“java.io.tmpdir“))配置的本地根目录(可能有多个,以逗号分隔)来生成DiskStore存放Block的根目录(与配置的根目录对应,也有可能有多个):…/blockmgr-UUID.randomUUID.toString(yarn模式)或…/spark-UUID.randomUUID.toString/blockmgr-UUID.randomUUID.toString(其他模式)。同时DiskBlockManager会为每个根目录生成conf.getInt(“spark.diskStore.subDirectories“, 64)个子目录用来存放Block对应的文件,每个Block会根据它的name哈希到相应的子目录,然后以Block的name为文件名来生成文件存储。
Creates and maintains the logical mapping between logical blocks and physical on-disk locations. By default, one block is mapped to one file with a name given by its BlockId. However, it is also possible to have a block map to only a segment of a file, by calling mapBlockToFileSegment().
Block files are hashed among the directories listed in spark.local.dir (or in SPARK_LOCAL_DIRS, if it’s set).
DiskBlockManager属性
blockManager:BlockManager
subDirsPerLocalDir:Int = blockManager.conf.getInt(“spark.diskStore.subDirectories“, 64)
每个本地根目录生成子目录的个数,生成子目录是为了避免生成过多的索引节点Create one local directory for each path mentioned in spark.local.dir; then, inside this directory, create multiple subdirectories that we will hash files into, in order to avoid having really large inodes at the top level.
localDirs:Array[File]
存放Block对应的File的本地根目录,根据根据YARN_LOCAL_DIRS或LOCAL_DIRS(yarn模式), SPARK_LOCAL_DIRS或spark.local.dir(其他模式,默认值System.getProperty(“java.io.tmpdir”))配置生成subDirs:Array[File](localDirs.lenght)(subDirsPerLocalDir)
存放所有子目录的二维数组
DiskBlockManager方法
- getFile(filename: String): File
根据文件名,取得文件。该方法先将filename哈希到相应的子目录(subDirs[hash % localDirs.length][(hash / localDirs.length) % subDirsPerLocalDir]),然后判断子目录是否存在,若不存在则生成
/** Looks up a file by hashing it into one of our local subdirectories. */def getFile(filename: String): File = { // Figure out which local directory it hashes to, and which subdirectory in that val hash = Utils.nonNegativeHash(filename) val dirId = hash % localDirs.length val subDirId = (hash / localDirs.length) % subDirsPerLocalDir // Create the subdirectory if it doesn't already exist var subDir = subDirs(dirId)(subDirId) if (subDir == null) { subDir = subDirs(dirId).synchronized { val old = subDirs(dirId)(subDirId) if (old != null) { old } else { val newDir = new File(localDirs(dirId), "%02x".format(subDirId)) if (!newDir.exists() && !newDir.mkdir()) { throw new IOException(s"Failed to create local dir in $newDir.") } subDirs(dirId)(subDirId) = newDir newDir } } } new File(subDir, filename) }
getFile(blockId: BlockId): File = getFile(blockId.name)
根据BlockId取得相应的FilecontainsBlock(blockId: BlockId): Boolean = getFile(blockId.name).exists()
判断BlockId是否有存储在该本地磁盘getAllFiles(): Seq[File]
取得存储的所有的文件/** List all the files currently stored on disk by the disk manager. */def getAllFiles(): Seq[File] = {// Get all the files inside the array of array of directoriessubDirs.flatten.filter(_ != null).flatMap { dir => val files = dir.listFiles() if (files != null) files else Seq.empty}}
getAllBlocks(): Seq[BlockId] = getAllFiles().map(f => BlockId(f.getName))
取得存储的所有Block的BlockIdcreateTempLocalBlock(): (TempLocalBlockId, File)
创建本地临时文件
/** Produces a unique block id and File suitable for storing local intermediate results. */ def createTempLocalBlock(): (TempLocalBlockId, File) = { var blockId = new TempLocalBlockId(UUID.randomUUID()) while (getFile(blockId).exists()) { blockId = new TempLocalBlockId(UUID.randomUUID()) } (blockId, getFile(blockId)) }
- createTempShuffleBlock(): (TempShuffleBlockId, File)
创建sort shuffle使用的临时文件Produces a unique block id and File suitable for storing shuffled intermediate results. “
def createTempShuffleBlock(): (TempShuffleBlockId, File) = { var blockId = new TempShuffleBlockId(UUID.randomUUID()) while (getFile(blockId).exists()) { blockId = new TempShuffleBlockId(UUID.randomUUID()) } (blockId, getFile(blockId)) }
**DiskStore**属性blockManager: BlockManager
diskManager: DiskBlockManager
minMemoryMapBytes:Long= blockManager.conf.getLong(
“spark.storage.memoryMapThreshold“, 2 * 1024L * 1024L)
对文件进行内存映射的阈值,即当文件大于该值时getBytes方法对文件进行内存映射,而不是直接将该文件的内容读取到字节缓存区。
DiskStore方法
- def putBytes(blockId: BlockId, _bytes: ByteBuffer, level: StorageLevel): PutResult
将BlockId对应的字节缓存存储到磁盘
override def putBytes(blockId: BlockId, _bytes: ByteBuffer, level: StorageLevel): PutResult = { // So that we do not modify the input offsets ! // duplicate does not copy buffer, so inexpensive val bytes = _bytes.duplicate() logDebug(s"Attempting to put block $blockId") val startTime = System.currentTimeMillis val file = diskManager.getFile(blockId) val channel = new FileOutputStream(file).getChannel while (bytes.remaining > 0) { channel.write(bytes) } channel.close() val finishTime = System.currentTimeMillis logDebug("Block %s stored as %s file on disk in %d ms".format( file.getName, Utils.bytesToString(bytes.limit), finishTime - startTime)) PutResult(bytes.limit(), Right(bytes.duplicate())) }
- putIterator(blockId: BlockId, values: Iterator[Any],level: StorageLevel,returnValues: Boolean): PutResult
将BlockId对应的Iterator数据存储到磁盘,该方法先将Iterator序列化,然后存储到相应的文件。
override def putIterator( blockId: BlockId, values: Iterator[Any], level: StorageLevel, returnValues: Boolean): PutResult = { logDebug(s"Attempting to write values for block $blockId") val startTime = System.currentTimeMillis val file = diskManager.getFile(blockId) val outputStream = new FileOutputStream(file) try { try { blockManager.dataSerializeStream(blockId, outputStream, values) } finally { // Close outputStream here because it should be closed before file is deleted. outputStream.close() } } catch { case e: Throwable => if (file.exists()) { file.delete() } throw e } val length = file.length val timeTaken = System.currentTimeMillis - startTime logDebug("Block %s stored as %s file on disk in %d ms".format( file.getName, Utils.bytesToString(length), timeTaken)) if (returnValues) { // Return a byte buffer for the contents of the file val buffer = getBytes(blockId).get PutResult(length, Right(buffer)) } else { PutResult(length, null) } }
- putArray(blockId: BlockId,values: Array[Any],level: StorageLevel,returnValues: Boolean): PutResult
将BlockId对应的Array数据存储到磁盘,该方法先将Array序列化,然后存储到相应的文件。
override def putArray( blockId: BlockId, values: Array[Any], level: StorageLevel, returnValues: Boolean): PutResult = { putIterator(blockId, values.toIterator, level, returnValues) }
- getBytes(file: File, offset: Long, length: Long): Option[ByteBuffer]
底层方法,读取文件中偏移为offset,长度为length的内容。该方法会判断length是否大于minMemoryMapBytes,若大于,则做内存映射,否则直接读取到字节缓存中。
private def getBytes(file: File, offset: Long, length: Long): Option[ByteBuffer] = { val channel = new RandomAccessFile(file, "r").getChannel try { // For small files, directly read rather than memory map if (length < minMemoryMapBytes) { val buf = ByteBuffer.allocate(length.toInt) channel.position(offset) while (buf.remaining() != 0) { if (channel.read(buf) == -1) { throw new IOException("Reached EOF before filling buffer\n" + s"offset=$offset\nfile=${file.getAbsolutePath}\nbuf.remaining=${buf.remaining}") } } buf.flip() Some(buf) } else { Some(channel.map(MapMode.READ_ONLY, offset, length)) } } finally { channel.close() } }
- getBytes(blockId: BlockId): Option[ByteBuffer]
读取存储在磁盘中与BlockId对应的内容。
override def getBytes(blockId: BlockId): Option[ByteBuffer] = { val file = diskManager.getFile(blockId.name) getBytes(file, 0, file.length) }
getBytes(segment: FileSegment): Option[ByteBuffer] = getBytes(segment.file, segment.offset, segment.length)
根据FileSegment读取内容,其中 FileSegment存放文件和要读取数据的偏移和大小:FileSegment(val file: File, val offset: Long, val length: Long)getValues(blockId: BlockId): Option[Iterator[Any]]
读取BlockId对应的内容,并反序列化为Iterator。
override def getValues(blockId: BlockId): Option[Iterator[Any]] = { getBytes(blockId).map(buffer => blockManager.dataDeserialize(blockId, buffer)) }
- getValues(blockId: BlockId, serializer: Serializer): Option[Iterator[Any]]
读取BlockId对应的内容,并根据自定义的Serializer反序列化为Iterator。
/** - A version of getValues that allows a custom serializer. This is used as part of the - shuffle short-circuit code. */ def getValues(blockId: BlockId, serializer: Serializer): Option[Iterator[Any]] = { // TODO: Should bypass getBytes and use a stream based implementation, so that // we won't use a lot of memory during e.g. external sort merge. getBytes(blockId).map(bytes => blockManager.dataDeserialize(blockId, bytes, serializer)) }
getSize(blockId: BlockId): Long = diskManager.getFile(blockId.name).length
得到存储在该本地磁盘的BlockId对应Block的大小。remove(blockId: BlockId): Boolean
删除存储的BlockId对应的Block。
override def remove(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) // If consolidation mode is used With HashShuffleMananger, the physical filename for the block // is different from blockId.name. So the file returns here will not be exist, thus we avoid to // delete the whole consolidated file by mistake. if (file.exists()) { file.delete() } else { false } }
- contains(blockId: BlockId): Boolean
判断是否存储BlockId对应的Block。
override def contains(blockId: BlockId): Boolean = { val file = diskManager.getFile(blockId.name) file.exists() }
- Spark源码阅读笔记之DiskStore
- Spark Storage之DiskStore
- Spark源码阅读笔记之BlockStore
- Spark源码阅读笔记之MemoryStore
- Spark源码阅读笔记之BlockObjectWriter
- Spark源码阅读笔记之MetadataCleaner
- Spark源码阅读笔记之Broadcast(一)
- Apache Spark源码走读之1 -- Spark论文阅读笔记
- Apache Spark源码走读之1 -- Spark论文阅读笔记
- Spark源码阅读笔记:TaskScheduler
- Spark源码阅读笔记:DAGScheduler
- spark RDD 源码阅读笔记
- Spark源码阅读笔记之任务调度(一)
- Spark源码阅读笔记之Broadcast(二)
- Spark源码阅读笔记之Broadcast(三)
- Spark源码阅读笔记:Spark的Task
- Spark源码阅读之HistoryServer
- spark源码阅读之spark-shell
- Java ConcurrentModificationException 异常分析与解决方案
- 黑马程序员--反射
- ListView下拉加载更多练习
- oracle多表查询(7)
- Mysql配置
- Spark源码阅读笔记之DiskStore
- 关于ubuntu下git ssh key重设的一些东东
- SVM 的推导、特点、优缺点、多分类问题及应用
- 如何在QML中使用camera API来拍照
- 多表查询2(8)
- 使用VS2012创建ActiveX控件
- 博文
- final 攻击网站的游戏
- android FileObserver的用法与避坑指南