spark RDD join的核心过程

来源：互联网发布：mac是什么意思编辑：程序博客网时间：2024/06/03 18:36

spark RDD join的核心过程

spark join的过程是查询过程中最核心的过程，怎么做到实现两个表的关联查询耗费资源最少。可看源码如下
join的实现在 PairRDDFunctions类当中。

def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))] = self.withScope {this.cogroup(other, partitioner).flatMapValues( pair =>  // _1 是左表，_2 是右表的值，这是一个笛卡尔积的过程，key 一样，左表和右表各一些数据  for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, w))}

可以看到上面，自身RDD和其它的RDD进行数据的关联，同时传进去partitioner对象

def cogroup[W](other: RDD[(K, W)], partitioner: Partitioner)  : RDD[(K, (Iterable[V], Iterable[W]))] = self.withScope {if (partitioner.isInstanceOf[HashPartitioner] && keyClass.isArray) {  // hash分区方式不能用于key是数组的对象  throw new SparkException("Default partitioner cannot partition array keys.")}// join操作中很核心的执行类val cg = new CoGroupedRDD[K](Seq(self, other), partitioner)cg.mapValues { case Array(vs, w1s) =>  (vs.asInstanceOf[Iterable[V]], w1s.asInstanceOf[Iterable[W]])}}

然后创建 CoGroupedRDD 专门用于RDD的关联操作对象。我们现在完整分析CoGroupedRDD源码

override def getDependencies: Seq[Dependency[_]] = {rdds.map { rdd: RDD[_] =>  if (rdd.partitioner == Some(part)) {    logDebug("Adding one-to-one dependency with " + rdd)    // 该RDD和 join合并的分区partitioner一样    new OneToOneDependency(rdd)  } else {    logDebug("Adding shuffle dependency with " + rdd)    // 当partitioner不一样时，要对数据进行重新分区,就是shuff的过程    new ShuffleDependency[K, Any, CoGroupCombiner](      rdd.asInstanceOf[RDD[_ <: Product2[K, _]]], part, serializer)  }}}

上面就是这个join的相关RDD依赖，如果part分区一样，就是OneToOneDependency依赖，不用进行hash拆分。否则
要关联的RDD和part的分区不一致时，就要对RDD进行重新hash分区，分到正确的分片上面，所以就要用ShuffleDependency 进行
hash分片数据，然后在正确的split分片处理业务进程中进行处理。

override def getPartitions: Array[Partition] = {// 这里对数据进行分片，一个分片就在一台work进程中进行处理了val array = new Array[Partition](part.numPartitions)for (i <- 0 until array.length) {  // Each CoGroupPartition will have a dependency per contributing RDD  array(i) = new CoGroupPartition(i, rdds.zipWithIndex.map { case (rdd, j) =>    // Assume each RDD contributed a single dependency, and get it    dependencies(j) match {      case s: ShuffleDependency[_, _, _] =>        // 当这个数据要进行shuffler时        None      case _ =>        // 当分区是一样时，就直接进行了        Some(new NarrowCoGroupSplitDep(rdd, i, rdd.partitions(i)))    }  }.toArray)}// 这样就可以把关联的RDD拆成了numPartitions分了array}

上面就是对各个关联的数据进行hash分片了，就是有几个RDD，然后根据它们的key进行hash分片，分到正确的partition中，如果是 OneToOneDependency 就不用进行数据的再拆分片了，ShuffleDependency 就要通过传进去的part对key进行分片，把所有一样的key
分到同样的split数据分片当中。这样各个RDD一样的key就在一样的，就可以执行关联操作了。

override def compute(s: Partition, context: TaskContext): Iterator[(K, Array[Iterable[_]])] = {// 在其中一个work进程中执行这一分区数据了val split = s.asInstanceOf[CoGroupPartition]// 依赖这么多RDDval numRdds = dependencies.length// A list of (rdd iterator, dependency number) pairs// 拿这个分片的数据进行计算val rddIterators = new ArrayBuffer[(Iterator[Product2[K, Any]], Int)]for ((dep, depNum) <- dependencies.zipWithIndex) dep match {  case oneToOneDependency: OneToOneDependency[Product2[K, Any]] @unchecked =>    // 依赖于 depNum 那个RDD的分片数据    val dependencyPartition = split.narrowDeps(depNum).get.split    // Read them from the parent    // 在这个work进程中读取这个分片数据    val it = oneToOneDependency.rdd.iterator(dependencyPartition, context)    rddIterators += ((it, depNum))  case shuffleDependency: ShuffleDependency[_, _, _] =>    // Read map outputs of shuffle    // 说明之前对这个RDD 的数据进行分片hash过的了    // 然后这里专门去拉取该分片对应的数据回来    val it = SparkEnv.get.shuffleManager      .getReader(shuffleDependency.shuffleHandle, split.index, split.index + 1, context)      .read()    rddIterators += ((it, depNum))}// 创建一个多个RDD的合并器val map = createExternalMap(numRdds)for ((it, depNum) <- rddIterators) {  map.insertAll(it.map(pair => (pair._1, new CoGroupValue(pair._2, depNum))))}context.taskMetrics().incMemoryBytesSpilled(map.memoryBytesSpilled)context.taskMetrics().incDiskBytesSpilled(map.diskBytesSpilled)context.internalMetricsToAccumulators(  InternalAccumulator.PEAK_EXECUTION_MEMORY).add(map.peakMemoryUsedBytes)// 结果就这样排好序的了new InterruptibleIterator(context,  map.iterator.asInstanceOf[Iterator[(K, Array[Iterable[_]])]])}

再来研究一下compute方法，这个方法就是对当前要计算的split进行处理的，上面已经对多个RDD进行hash分片了，然后把
相同的key都分片到这里来了，如果是oneToOneDependency就直接读取那个分片数据，否则就要启动对RDD的shuffle的过程
把一个RDD通过hash分到多个分片当中，然后该函数拉取自己需求的那一个分片数据。当该split需求的分片数据准备好后，就创建
下面的ExternalAppendOnlyMap 类进行对数据的排序关联功能了。

private def createExternalMap(numRdds: Int): ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner] = {// 创建一个多个RDD的合并器,key value rdd_index//  value._2 应该是rdd的index value._1 应该是value// 初始化 rdd_index --> valueval createCombiner: (CoGroupValue => CoGroupCombiner) = value => {  val newCombiner = Array.fill(numRdds)(new CoGroup)  newCombiner(value._2) += value._1  newCombiner}// 中间过程数据的合并 rdd_index --> valueval mergeValue: (CoGroupCombiner, CoGroupValue) => CoGroupCombiner =  (combiner, value) => {  combiner(value._2) += value._1  combiner}// 最后所有 rdd数据的合并val mergeCombiners: (CoGroupCombiner, CoGroupCombiner) => CoGroupCombiner =  (combiner1, combiner2) => {    var depNum = 0    while (depNum < numRdds) {      combiner1(depNum) ++= combiner2(depNum)      depNum += 1    }    combiner1  }// key 在这个 对象里面自己管控，各业务方法只要缓存value和rdd_index的关系就好了new ExternalAppendOnlyMap[K, CoGroupValue, CoGroupCombiner](  createCombiner, mergeValue, mergeCombiners)}

可以看到这里有一个公共的多RDD数据联合器，只要把数据往里面插入进去，就自动进行数据的关联操作了。
最后就返回排好序的InterruptibleIterator对象，实现多RDD的联合join。

下面再看下 left out join

def leftOuterJoin[W](  other: RDD[(K, W)],  partitioner: Partitioner): RDD[(K, (V, Option[W]))] = self.withScope {this.cogroup(other, partitioner).flatMapValues { pair =>  if (pair._2.isEmpty) {    // 当是右表为空时，左表也要输出    pair._1.iterator.map(v => (v, None))  } else {    for (v <- pair._1.iterator; w <- pair._2.iterator) yield (v, Some(w))  }}}

可以看到，当是 left out join 时，底层也是一样关联的，只是在外面通过判断左表有值时，也进行输出。
同时，right out join也一样。

总结

传入多个RDD对象
判断该RDD对象是否和给定的part分区函数一致，如果是就直接拉取对应的分区，否则就shuffle，hash分片数据，然后拉取
把相同partition分片后的数据发到对应work进程中进行读取
然后在该work业务进程中，单独对这hash分片一致的数据进行关联操作
最后返回有序iterator对象

阅读全文

0 0