第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解

来源:互联网 发布:网络代刷信誉兼职中心 编辑:程序博客网 时间:2024/06/17 01:39

第217讲:Spark Shuffle中HashShuffleWriter工作机制和源码详解

家林大神视频笔记记录,欢迎转载。




1,获取shuffleManager

 Spark Stage里面除了最后一个stage,前面都是map级别,图中Stage2里面的任务是ShuffleMapTask,而ShuffleMapTask的runTask方法要从SparkEnv里面找shuffleManager,获取shuffleManager。

override def runTask(context: TaskContext): MapStatus = {    // Deserialize the RDD using the broadcast variable.    val deserializeStartTime = System.currentTimeMillis()    val ser = SparkEnv.get.closureSerializer.newInstance()    val (rdd, dep) = ser.deserialize[(RDD[_], ShuffleDependency[_, _, _])](      ByteBuffer.wrap(taskBinary.value), Thread.currentThread.getContextClassLoader)    _executorDeserializeTime = System.currentTimeMillis() - deserializeStartTime    metrics = Some(context.taskMetrics)    var writer: ShuffleWriter[Any, Any] = null    try {      val manager = SparkEnv.get.shuffleManager      writer = manager.getWriter[Any, Any](dep.shuffleHandle, partitionId, context)      writer.write(rdd.iterator(partition, context).asInstanceOf[Iterator[_ <: Product2[Any, Any]]])      writer.stop(success = true).get    } 



2,shuffleManager分成三种:HashShuffleManager, SortShuffleManager,tungsten-sort

 
 // Let the user specify short names for shuffle managers    val shortShuffleMgrNames = Map(      "hash" -> "org.apache.spark.shuffle.hash.HashShuffleManager",      "sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager",      "tungsten-sort" -> "org.apache.spark.shuffle.sort.SortShuffleManager")    val shuffleMgrName = conf.get("spark.shuffle.manager", "sort")    val shuffleMgrClass = shortShuffleMgrNames.getOrElse(shuffleMgrName.toLowerCase, shuffleMgrName)    val shuffleManager = instantiateClass[ShuffleManager](shuffleMgrClass)


3,默认使用的是HashShuffleWriter,我们看一下HashShuffleWriter的 write方法

HashShuffleWriter的write方法将读入的rdd的Iterator作为任务的输出,输出到下一个stage读取。
先判断是否需聚合,如map端聚合,则进行combineValuesByKey,不聚合就返回records。然后for循环遍历,iter的格式类型是Iterator[Product2[K, Any]],第一个元素是key值,第二个元素是value;将key值的getPartition作为bucketId,然后将iter的每一个元素(key,value)写入到shuffle的文件中。

/** Write a bunch of records to this task's output */  override def write(records: Iterator[Product2[K, V]]): Unit = {    val iter = if (dep.aggregator.isDefined) {      if (dep.mapSideCombine) {        dep.aggregator.get.combineValuesByKey(records, context)      } else {        records      }    } else {      require(!dep.mapSideCombine, "Map-side combine without Aggregator specified!")      records    }    for (elem <- iter) {      val bucketId = dep.partitioner.getPartition(elem._1)      shuffle.writers(bucketId).write(elem._1, elem._2)    }  }


4,spark 1.6.0的DiskBlockObjectWriter 的write方法

 /**   * Writes a key-value pair.   */  def write(key: Any, value: Any) {    if (!initialized) {      open()    }    objOut.writeKey(key)    objOut.writeValue(value)    recordWritten()  }


5,之前是FileShuffleBlockManager.scala,spark2.1.0是FileShuffleBlockResolver

 forMapTask的用途:对于给定的map任务获得一个ShuffleWriterGroup,当成功写入将进行注册。


/**   * Get a ShuffleWriterGroup for the given map task, which will register it as complete   * when the writers are closed successfully   */  def forMapTask(shuffleId: Int, mapId: Int, numReducers: Int, serializer: Serializer,      writeMetrics: ShuffleWriteMetrics): ShuffleWriterGroup = {    new ShuffleWriterGroup {      shuffleStates.putIfAbsent(shuffleId, new ShuffleState(numReducers))      private val shuffleState = shuffleStates(shuffleId)      val openStartTime = System.nanoTime      val serializerInstance = serializer.newInstance()      val writers: Array[DiskBlockObjectWriter] = {        Array.tabulate[DiskBlockObjectWriter](numReducers) { bucketId =>          val blockId = ShuffleBlockId(shuffleId, mapId, bucketId)          val blockFile = blockManager.diskBlockManager.getFile(blockId)          val tmp = Utils.tempFileWith(blockFile)          blockManager.getDiskWriter(blockId, tmp, serializerInstance, bufferSize, writeMetrics)        }      }      // Creating the file to write to and creating a disk writer both involve interacting with      // the disk, so should be included in the shuffle write time.      writeMetrics.incShuffleWriteTime(System.nanoTime - openStartTime)      override def releaseWriters(success: Boolean) {        shuffleState.completedMapTasks.add(mapId)      }    }  }


 6,HashShuffleWriter 通过 shuffleBlockResolver.forMapTask创建shuffle 


private val shuffle: ShuffleWriterGroup = shuffleBlockResolver.forMapTask(dep.shuffleId, mapId, numOutputSplits, ser,    writeMetrics)


/** A group of writers for a ShuffleMapTask, one writer per reducer. */private[spark] trait ShuffleWriterGroup {  val writers: Array[DiskBlockObjectWriter]//数组里面是DiskBlockObjectWriter





0 0
原创粉丝点击
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 尿涨但是尿很少怎么办 十四岁尿血医生说是肾炎怎么办 吃肉反胃想吐怎么办 母牛排尿带血尿发烧怎么办 4岁发烧40度怎么办 狗狗拉肚子咳漱哮喘怎么办 拉肚子拉脱水人无力怎么办 孕中期拉稀拉水怎么办 吃坏肚子拉稀水怎么办 手上起小疙瘩疼怎么办 手上长东西很痛怎么办 七八十斤猪拉稀怎么办 宝宝扁桃体化脓反复发烧怎么办 骑单车后膝盖痛怎么办 孩子一运动就喘怎么办 小孩晚上咳嗽很厉害怎么办 1岁宝宝夜里咳嗽怎么办 咳嗽咳的胸疼怎么办 儿童又咳又喘怎么办 咳嗽咳到胸口痛怎么办 咳嗽咳得肋骨疼怎么办 孕妇咳嗽咳得胸口疼怎么办 怀孕偏左侧宫腔怎么办 晨起活动后咳嗽怎么办 运动后乳房坠痛怎么办 嗓子痒咳嗽怎么办夜间最为难受 小孩鼻炎引起的咳嗽怎么办 跑完800米喉咙痒怎么办 过敏源总ige高怎么办 一岁宝宝咳嗽喘怎么办 宝宝又咳又喘怎么办 3岁宝宝有痰怎么办 小孩又咳又喘怎么办 宝宝有点吼和喘怎么办 气管里呛了辣油怎么办 玩手机手抖该怎么办 紧张到手抖做不了事该怎么办 做什么事手抖怎么办啊 跳舞时不会提气怎么办 啤酒喝多了手抖怎么办 头撞了一下头疼怎么办