第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解
来源:互联网 发布:网络代刷信誉兼职中心 编辑:程序博客网 时间:2024/06/17 01:39
第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解
家林大神视频笔记记录,欢迎转载。
1,获取shuffleManager
Spark Stage里面除了最后一个stage,前面都是map级别,图中Stage2里面的任务是ShuffleMapTask,而ShuffleMapTask的runTask方法要从SparkEnv里面找shuffleManager,获取shuffleManager。
override def runTask(context: TaskContext): MapStatus = { // Deserialize the RDD using the broadcast variable. val deserializeStartTime = System.currentTimeMillis() val ser = SparkEnv.get.closureSerializer.newInstance() val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])]( ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader) _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime metrics = Some(context.taskMetrics) var writer: ShuffleWriter[Any, Any] = null try { val manager = SparkEnv.get.shuffleManager writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context) writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]]) writer.stop(success = true).get }
2,shuffleManager分成三种:HashShuffleManager, SortShuffleManager,tungsten-sort
// Let the user specify short names for shuffle managers val shortShuffleMgrNames = Map( "hash" -> "org.apache.spark.shuffle.hash.HashShuffleManager", "sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager", "tungsten-sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager") val shuffleMgrName = conf.get("spark.shuffle.manager", "sort") val shuffleMgrClass = shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName) val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)
3,默认使用的是HashShuffleWriter,我们看一下HashShuffleWriter的 write方法
HashShuffleWriter的write方法将读入的rdd的Iterator作为任务的输出,输出到下一个stage读取。
先判断是否需聚合,如map端聚合,则进行combineValuesByKey,不聚合就返回records。然后for循环遍历,iter的格式类型是Iterator[Product2[K, Any]],第一个元素是key值,第二个元素是value;将key值的getPartition作为bucketId,然后将iter的每一个元素(key,value)写入到shuffle的文件中。
/** Write a bunch of records to this task's output */ override def write(records: Iterator[Product2[K, V]]): Unit = { val iter = if (dep.aggregator.isDefined) { if (dep.mapSideCombine) { dep.aggregator.get.combineValuesByKey(records, context) } else { records } } else { require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!") records } for (elem <- iter) { val bucketId = dep.partitioner.getPartition(elem._1) shuffle.writers(bucketId).write(elem._1, elem._2) } }
4,spark 1.6.0的DiskBlockObjectWriter 的write方法
/** * Writes a key-value pair. */ def write(key: Any, value: Any) { if (!initialized) { open() } objOut.writeKey(key) objOut.writeValue(value) recordWritten() }
5,之前是FileShuffleBlockManager.scala,spark2.1.0是FileShuffleBlockResolver
forMapTask的用途:对于给定的map任务获得一个ShuffleWriterGroup,当成功写入将进行注册。
/** * Get a ShuffleWriterGroup for the given map task, which will register it as complete * when the writers are closed successfully */ def forMapTask(shuffleId: Int, mapId: Int, numReducers: Int, serializer: Serializer, writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = { new ShuffleWriterGroup { shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numReducers)) private val shuffleState = shuffleStates(shuffleId) val openStartTime = System.nanoTime val serializerInstance = serializer.newInstance() val writers: Array[DiskBlockObjectWriter] = { Array.tabulate[DiskBlockObjectWriter](numReducers) { bucketId => val blockId = ShuffleBlockId(shuffleId, mapId, bucketId) val blockFile = blockManager.diskBlockManager.getFile(blockId) val tmp = Utils.tempFileWith(blockFile) blockManager.getDiskWriter(blockId, tmp, serializerInstance, bufferSize, writeMetrics) } } // Creating the file to write to and creating a disk writer both involve interacting with // the disk, so should be included in the shuffle write time. writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime) override def releaseWriters(success: Boolean) { shuffleState.completedMapTasks.add(mapId) } } }
6,HashShuffleWriter 通过 shuffleBlockResolver.forMapTask创建shuffle
private val shuffle: ShuffleWriterGroup = shuffleBlockResolver.forMapTask(dep.shuffleId, mapId, numOutputSplits, ser, writeMetrics)
/** A group of writers for a ShuffleMapTask, one writer per reducer. */private[spark] trait ShuffleWriterGroup { val writers: Array[DiskBlockObjectWriter]//数组里面是DiskBlockObjectWriter
0 0
- 第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解
- 【Spark工作机制详解】 Shuffle机制
- 第225讲:Spark Shuffle Pluggable框架SortShuffle解析以及创建源码详解
- 第28课:彻底解密Spark Sort-Based Shuffle排序具体实现内幕和源码详解
- 第37课:Spark中Shuffle详解及作业
- 第31课:彻底解密Spark 2.1.X中Shuffle中内存管理源码解密:StaticMemory和UnifiedMemory
- day25:Spark Sort-Based Shuffle内幕工作机制、案例实战、源码剖析、优缺点及改进方式
- 第124讲:Hadoop集群管理之fsimage和edits工作机制内幕详解学习笔记
- spark源码系列文章------shuffle模块详解
- 《Spark商业案例与性能调优实战100课》第28课:彻底解密Spark Sort-Based Shuffle排序具体实现内幕和源码详解
- Spark的Shuffle机制(讲得很好哦)
- 《Spark商业案例与性能调优实战100课》第31课:彻底解密Spark 2.1.X中Shuffle中内存管理源码解密:StaticMemory和UnifiedMemory
- 大数据Spark “蘑菇云”行动第38课:Spark中Shuffle详解
- 第221讲:Spark Shuffle Pluggable框架ShuffleManager解析
- 第222讲:Spark Shuffle Pluggable框架ShuffleWriter解析
- 第223讲:Spark Shuffle Pluggable框架ShuffleReader解析
- 第224讲:Spark Shuffle Pluggable框架ShuffleBlockManager解析
- 对Spark中shuffle机制的浅谈
- adb 通过wifi进行调试
- 使用radon变换进行直线检测
- AngularJs之新手小白入门篇
- 01背包变形(poj2184)
- 属性 选择器 空格规范
- 第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解
- OWASP TOP 10
- Super Jumping! Jumping! Jumping!
- 用数学思维解决高级阶乘问题
- HAUT校赛 魔法宝石 暴力
- 死锁的定义、产生原因、必要条件、避免死锁和解除死锁的方法
- 有36匹马,六个跑道。没有记时器等设备,用最少的比赛次数算出跑的最快的前3匹马
- BZOJ3729: Gty的游戏
- MyBatis