第223讲:Spark Shuffle Pluggable框架ShuffleReader解析
来源:互联网 发布:数据报表分析 编辑:程序博客网 时间:2024/06/04 19:55
第223讲:Spark Shuffle Pluggable框架ShuffleReader解析
ShuffleReader:具体实现Stage在读取上一个Stage结果的接口。
在reduce任务中,读取mappers中的聚合数据。
从上一个shuffleMapTask中读取想要的数据,读取的内容是Iterator,具体的读可以看它的子类。private[spark] trait ShuffleReader[K, C] { /** Read the combined key-values for this reduce task */ def read(): Iterator[Product2[K, C]] /** * Close this reader. * TODO: Add this back when we make the ShuffleReader a developer API that others can implement * (at which point this will likely be necessary). */ // def stop(): Unit}
具体实现的时候shuffleReader通过MapOutputTracker获取数据的位置信息。shuffleWriter将MapStatus相关信息交给Driver,Driver中有MapOutputTracker。
之前shuffleReader的子类是HashShuffleReader ,在Release 1.6.0版本中将HashShuffleReader 更名为BlockStoreShuffleReader
[SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader Josh Rosen <joshrosen@databricks.com> 2015-09-22 11:50:22 -0700 Commit: 1ca5e2e, github.com/apache/spark/pull/8825
我们看一下BlockStoreShuffleReader ,BlockStoreShuffleReader继承至ShuffleReader。
1,获取序列化器:Serializer.getSerializer(dep.serializer)
2,读取过程中,判断是否mapSideCombine。
/** * Fetches and reads the partitions in range [startPartition, endPartition) from a shuffle by * requesting them from other nodes' block stores. */private[spark] class BlockStoreShuffleReader[K, C]( handle: BaseShuffleHandle[K, _, C], startPartition: Int, endPartition: Int, context: TaskContext, blockManager: BlockManager = SparkEnv.get.blockManager, mapOutputTracker: MapOutputTracker = SparkEnv.get.mapOutputTracker) extends ShuffleReader[K, C] with Logging { private val dep = handle.dependency /** Read the combined key-values for this reduce task */ override def read(): Iterator[Product2[K, C]] = { val blockFetcherItr = new ShuffleBlockFetcherIterator( context, blockManager.shuffleClient, blockManager, mapOutputTracker.getMapSizesByExecutorId(handle.shuffleId, startPartition, endPartition), // Note: we use getSizeAsMb when no suffix is provided for backwards compatibility SparkEnv.get.conf.getSizeAsMb("spark.reducer.maxSizeInFlight", "48m") * 1024 * 1024) // Wrap the streams for compression based on configuration val wrappedStreams = blockFetcherItr.map { case (blockId, inputStream) => blockManager.wrapForCompression(blockId, inputStream) } val ser = Serializer.getSerializer(dep.serializer) val serializerInstance = ser.newInstance() // Create a key/value iterator for each stream val recordIter = wrappedStreams.flatMap { wrappedStream => // Note: the asKeyValueIterator below wraps a key/value iterator inside of a // NextIterator. The NextIterator makes sure that close() is called on the // underlying InputStream when all records have been read. serializerInstance.deserializeStream(wrappedStream).asKeyValueIterator } // Update the context task metrics for each record read. val readMetrics = context.taskMetrics.createShuffleReadMetricsForDependency() val metricIter = CompletionIterator[(Any, Any), Iterator[(Any, Any)]]( recordIter.map(record => { readMetrics.incRecordsRead(1) record }), context.taskMetrics().updateShuffleReadMetrics()) // An interruptible iterator must be used here in order to support task cancellation val interruptibleIter = new InterruptibleIterator[(Any, Any)](context, metricIter) val aggregatedIter: Iterator[Product2[K, C]] = if (dep.aggregator.isDefined) { if (dep.mapSideCombine) { // We are reading values that are already combined val combinedKeyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, C)]] dep.aggregator.get.combineCombinersByKey(combinedKeyValuesIterator, context) } else { // We don't know the value type, but also don't care -- the dependency *should* // have made sure its compatible w/ this aggregator, which will convert the value // type to the combined type C val keyValuesIterator = interruptibleIter.asInstanceOf[Iterator[(K, Nothing)]] dep.aggregator.get.combineValuesByKey(keyValuesIterator, context) } } else { require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!") interruptibleIter.asInstanceOf[Iterator[Product2[K, C]]] } // Sort the output if there is a sort ordering defined. dep.keyOrdering match { case Some(keyOrd: Ordering[K]) => // Create an ExternalSorter to sort the data. Note that if spark.shuffle.spill is disabled, // the ExternalSorter won't spill to disk. val sorter = new ExternalSorter[K, C, C](context, ordering = Some(keyOrd), serializer = Some(ser)) sorter.insertAll(aggregatedIter) context.taskMetrics().incMemoryBytesSpilled(sorter.memoryBytesSpilled) context.taskMetrics().incDiskBytesSpilled(sorter.diskBytesSpilled) context.internalMetricsToAccumulators( InternalAccumulator.PEAK_EXECUTION_MEMORY).add(sorter.peakMemoryUsedBytes) CompletionIterator[Product2[K, C], Iterator[Product2[K, C]]](sorter.iterator, sorter.stop()) case None => aggregatedIter } }}
1,如果是本地,通过BlockManager的getBlockData方法获取本地数据
2,如果是远程, 数据可能在远程remote。
BlockManager的getBlockData方法获取本地数据
/** * Interface to get local block data. Throws an exception if the block cannot be found or * cannot be read successfully. */ override def getBlockData(blockId: BlockId): ManagedBuffer = { if (blockId.isShuffle) { shuffleManager.shuffleBlockResolver.getBlockData(blockId.asInstanceOf[ShuffleBlockId]) } else { val blockBytesOpt = doGetLocal(blockId, asBlockResult = false) .asInstanceOf[Option[ByteBuffer]] if (blockBytesOpt.isDefined) { val buffer = blockBytesOpt.get new NioManagedBuffer(buffer) } else { throw new BlockNotFoundException(blockId.toString) } } }
0 0
- 第223讲:Spark Shuffle Pluggable框架ShuffleReader解析
- 第221讲:Spark Shuffle Pluggable框架ShuffleManager解析
- 第222讲:Spark Shuffle Pluggable框架ShuffleWriter解析
- 第224讲:Spark Shuffle Pluggable框架ShuffleBlockManager解析
- 第225讲:Spark Shuffle Pluggable框架SortShuffle解析以及创建源码详解
- 第226讲:Spark Shuffle Pluggable框架SortShuffle具体实现解析
- Spark技术内幕:Shuffle Pluggable框架详解,你怎么开发自己的Shuffle Service?
- Spark技术内幕:Shuffle Pluggable框架详解,你怎么开发自己的Shuffle Service?
- 【spark】Shuffle过程解析
- Spark Shuffle原理解析
- [spark] Shuffle Write解析 (Sort Based Shuffle)
- [spark] Shuffle Read解析 (Sort Based Shuffle)
- 第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解
- spark中shuffle框架剖析
- Spark源码解析——Shuffle
- Spark的Shuffle机制(讲得很好哦)
- spark shuffle mapreduce shuffle
- spark shuffle
- JavaScript中的运算符
- 1024: 求1+2!+3!+...+N!的和
- 五一假第三天
- 基于DSP28335的3KW单相光伏并网逆变实验台的搭建
- codeforces 798a Mike and palindrome 水题
- 第223讲:Spark Shuffle Pluggable框架ShuffleReader解析
- linux安装rz、sz上传下载文件工具
- qt5 使用oracle简单实例
- jsp页面使用EL表达式输出Java中的Date对象
- 移动应用开发Android通讯录导入小工具
- 2016 ICPC 大连 C Game of Taking Stones 【威佐夫博弈+大数+高精度】
- Python leetcode #2 Add Two Numbers
- python爬虫--selenium等待页面加载
- ubuntu下使用intellij运行CloudSimSDN的示例