Shuffle的读写操作(一)
来源:互联网 发布:下载手机开关机软件 编辑:程序博客网 时间:2024/06/05 22:54
下面是我们的ShuffleMapTask当中的runTask的方法,在这个方法当中主要是调用了我们的HashShuffleWrite当中的write方法来进行具体的写出操作
/** * */ override def runTask(context: TaskContext): MapStatus = { // Deserialize the RDD using the broadcast variable. //反序列化的起始时间 val deserializeStartTime = System.currentTimeMillis() // 获得反序列化器closureSerializer val ser = SparkEnv.get.closureSerializer.newInstance() // 调用反序列化器closureSerializer的deserialize()进行RDD和ShuffleDependency的反序列化,数据来源于taskBinary val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) //计算Executor进行反序列化的时间 _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime metrics = Some(context.taskMetrics) var writer: ShuffleWriter[Any, Any] = null try { //获得shuffleManager val manager = SparkEnv.get.shuffleManager //根据partition指定分区的Shufflea获取Shuffle Writer,shuffleHandle是shuffle ID //partitionId表示的是当前RDD的某个partition,也就是说write操作作用于partition之上 writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) //针对RDD中的分区partition,调用rdd的iterator()方法后,再调用writer的write()方法,写数据 writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) //停止writer,并返回标志位 writer.stop(success = true).get } catch { case e: Exception => try { if (writer != null) { writer.stop(success = false) } } catch { case e: Exception => log.debug("Could not stop writer", e) } throw e } }
下面这个代码是我们的HashShuffleWrite的写方法的代码如下:
/** * Write a bunch of records to this task's output * 将一堆记录写入此任务的输出*/ /** * 主要处理两件事: * 1)判断是否需要进行聚合,比如<hello,1>和<hello,1>都要写入的话,那么先生成<hello,2> * 然后再进行后续的写入工作 * 2)利用Partition函数来决定<k,val>写入哪一个文件中. */ override def write(records: Iterator[Product2[K, V]]): Unit = { //判断aggregator是否被定义,需要做Map端聚合操作 val iter = if (dep.aggregator.isDefined) { if (dep.mapSideCombine) {//判断是否需要聚合,如果需要,聚合records执行map端的聚合 //汇聚工作,reducebyKey是一分为二的,一部在ShuffleMapTask中进行聚合 //另一部分在resultTask中聚合 dep.aggregator.get.combineValuesByKey(records, context) } else { records } } else { require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!") records } //利用getPartition函数来决定<k,val>写入哪一个文件中. for (elem <- iter) { //elem是类似于<k,val>的键值对,以K为参数用partitioner计算其对应的值, val bucketId = dep.partitioner.getPartition(elem._1)//获得该element需要写入的partitioner //实际调用FileShuffleBlockManager.forMapTask进入数据写入 //bucketId文件名称,key elem._1,value elem._2 shuffle.writers(bucketId).write(elem._1, elem._2) } }
FileShuffleBlockResolver类的主要解析如下:
/** * Manages assigning disk-based block writers to shuffle tasks. Each shuffle task gets one file * per reducer (this set of files is called a ShuffleFileGroup). * 管理分配基于磁盘的块写入器来随机播放任务,每个shuffle任务每个reducer获取一个文件(这组文件称为ShuffleFileGroup) * * As an optimization to reduce the number of physical shuffle files produced, multiple shuffle * blocks are aggregated into the same file. There is one "combined shuffle file" per reducer * per concurrently executing shuffle task. As soon as a task finishes writing to its shuffle * files, it releases them for another task. * * 作为减少生成的物理随机播放文件数量的优化,多个shuffle块被聚合到同一个文件中,每个并发执行随机播放任务,每个reducer有一个“组合shuffle文件” * 一旦任务完成对其随机播放文件的写入,它将释放它们用于另一个任务。 * * Regarding the implementation of this feature, shuffle files are identified by a 3-tuple: * 关于此功能的实现,随机播放文件由3元组标识: * - shuffleId: The unique id given to the entire shuffle stage.给予整个洗牌阶段的唯一身份 * - bucketId: The id of the output partition (i.e., reducer id)输出分区的id(即reducer id) * - fileId: The unique id identifying a group of "combined shuffle files." Only one task at a * time owns a particular fileId, and this id is returned to a pool when the task finishes. * 识别一组“组合的shuffle文件”的唯一ID,一次只有一个任务拥有一个特定的fileId,当任务完成时,这个id返回给一个池 * Each shuffle file is then mapped to a FileSegment, which is a 3-tuple (file, offset, length) * that specifies where in a given file the actual block data is located. * 然后将每个随机shuffle文件映射到FileSegment,FileSegment是一个3元组(文件,偏移量,长度),用于指定给定文件中实际块数据所在的位置 * * Shuffle file metadata is stored in a space-efficient manner. Rather than simply mapping * ShuffleBlockIds directly to FileSegments, each ShuffleFileGroup maintains a list of offsets for * each block stored in each file. In order to find the location of a shuffle block, we search the * files within a ShuffleFileGroups associated with the block's reducer. * *Shuffle文件元数据以节省空间的方式存储,而不是简单的映射ShuffleBlock直接转到FileSegments, * 每个ShuffleFileGroup为每个文件中存储的每个块维护一个偏移量列表,为了找到混洗块的位置, * 我们搜索与块的reducer相关联的ShuffleFileGroup中的文件。 */
上面这个类的 forMapTask方法如下
/** * * Get a ShuffleWriterGroup for the given map task, which will register it as complete * when the writers are closed successfully * 为给定的Map任务获取一个ShuffleWriterGroup,当写关闭成功时,它将注册为完整的 * mapId对应RDD的partionsID * */ def forMapTask(shuffleId: Int, mapId: Int, numBuckets: Int, serializer: Serializer, writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = { new ShuffleWriterGroup { shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numBuckets)) private val shuffleState = shuffleStates(shuffleId) private var fileGroup: ShuffleFileGroup = null val openStartTime = System.nanoTime val serializerInstance = serializer.newInstance() //如果consolidateShuffleFiles为true,那么在一个Task中,有多少个输出的Partition就会有多少个中间文件,默认为false val writers: Array[DiskBlockObjectWriter] = if (consolidateShuffleFiles) { fileGroup = getUnusedFileGroup()//获取没有使用的FileGroup Array.tabulate[DiskBlockObjectWriter](numBuckets) { bucketId => //mapId对应RDD的partionsID val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) blockManager.getDiskWriter(blockId, fileGroup(bucketId), serializerInstance, bufferSize, writeMetrics) } } else { Array.tabulate[DiskBlockObjectWriter](numBuckets) { bucketId => //mapId对应RDD的partionsID val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) //如果blockFile已经存在,那么删除它并打印日志 val blockFile = blockManager.diskBlockManager.getFile(blockId) val tmp = Utils.tempFileWith(blockFile) //tmp也就是blockFile如果已经存在则,在后面追加数据 blockManager.getDiskWriter(blockId, tmp, serializerInstance, bufferSize, writeMetrics) } } // Creating the file to write to and creating a disk writer both involve interacting with // the disk, so should be included in the shuffle write time. //创建要写入和创建磁盘刻录机的文件都涉及与磁盘交互,因此应该包含在shuffle写入的时间。 writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime) override def releaseWriters(success: Boolean) { if (consolidateShuffleFiles) { if (success) { val offsets = writers.map(_.fileSegment().offset) val lengths = writers.map(_.fileSegment().length) //mapId对应RDD的partionsID fileGroup.recordMapOutput(mapId, offsets, lengths) } recycleFileGroup(fileGroup) } else { //mapId对应RDD的partionsID shuffleState.completedMapTasks.add(mapId) } }
阅读全文
0 0
- Shuffle的读写操作(一)
- 文件的读写操作一
- python实现对excel表的读写操作(一)
- IO操作之文件读写(一)
- 大数据:Spark Shuffle(一)ShuffleWrite:Executor如何将Shuffle的结果进行归并写到数据文件中去
- Spark技术内幕: Shuffle详解(一)
- C/C++ 的文件读写操作总结(一)
- 【Android】文件读写操作(含SDCard的读写)
- 【Android】文件读写操作(含SDCard的读写)
- Android文件读写操作(含SDCard的读写)
- 【Android】文件读写操作(含SDCard的读写)
- 【Android】文件读写操作(含SDCard的读写)
- 【Android】文件读写操作(含SDCard的读写)
- Android数据存储方式(一)文件读写操作
- PHP 文件读写操作(一)简易版
- C++文件读写操作(一) 逐字符读取文件
- 9.Shuffle读写源码分析
- Spark Shuffle系列-----3. spark shuffle reduce操作RDD partition的生成
- 基于shiro的权限管理-基础概念
- Linux基本命令、文件目录管理
- JVM 之 类的装载机制
- Android混淆实现
- [Android]Snackbar的第一个参数
- Shuffle的读写操作(一)
- CSS3设置Div左右两边或者上下边框的样式
- u-boot烧录、使用和编译
- css实现左侧固定宽度,右侧宽度自适应
- (非常有用)loadrunner资源监控问题及调优方法
- OpenCV android sdk开发实例 OpenCV android NDK实例
- JAVA_编程小案例_拆解数字
- 树莓派3B+ 私有云储存(Samba)
- Angular 4入门教程系列:4:Tour Of Heroes之事件处理