spark transform系列__Coalesce

来源：互联网发布：windows 10菜单栏隐藏编辑：程序博客网时间：2024/06/06 03:58

Coalesce/repartition

coalesce

这个操作是把当前的RDD中的partition根据一个新的传入的parition的个数,对partition中的结果集进行重新组合成一个新的结果集的函数.

这个函数需要传入两个参数:

参数1:需要重新进行分区的分区个数.

参数2:是否执行shuffle操作,默认为false.

def coalesce(numPartitions: Int, shuffle: Boolean = false)

(implicit ord: Ordering[T] = null)
: RDD[T] = withScope {
if (shuffle) {

这种情况是表示需要执行shuffle的情况,

首先定义针对当前的RDD的partition的执行的function,distributePartition函数.

这个函数中,根据需要重新生成的RDD的partition的个数,把当前的RDD中的PARTITION的结果随机存储到新RDD的某个PARTITION中,这个distributePartition函数的key就是要hash完成后的新的partition的下标,value是原来(也就是当前的RDD)的key-value.

这样做的作用是,在对key执行hash操作时,key就是对应的新的分区的下标,直接就能得到这个对应的分区.

这个地方,每条记录都向从开始的分区位置,一直向下增加,也就是每条记录都会轮询的向下一个partition中分发数据.通过执行shuffle操作,可以最大可能的保证新的RDD中每个PARTITION的数据都差不多.
/** Distributes elements evenly across output partitions, starting from a random

        partition. */
    val distributePartition = (index: Int, items: Iterator[T]) => {
      var position = (new Random(index)).nextInt(numPartitions)
      items.map { t =>
        // Note that the hash code of the key will just be the key itself.

           The HashPartitioner
        // will mod it with the number of total partitions.
        position = position + 1
        (position, t)
      }
    } : Iterator[(Int, T)]
生成CoalesedRDD的实例,这个情况需要执行shuffle操作,因此,在这个实例传入的上层RDD的依赖根据当前的rdd先生成一个ShuffledRDD的实例.下面的mapPartitionsWithIndex是对当前的RDD执行了一个MAP操作.最后根据生成的CoalesedRDD执行values操作就得到原来RDD的key-value.
    // include a shuffle step so that our upstream tasks are still distributed
    new CoalescedRDD(
      new ShuffledRDD[Int, T, T](mapPartitionsWithIndex(distributePartition),
      new HashPartitioner(numPartitions)),
      numPartitions).values
  } else {

不需要执行shuffle操作,直接根据当前的RDD与新的PARTITION的个数,生成CoalescedRDD的实例.这里会根据每个新的RDD中partition的个数与老的partition个数进行组合,原则上保持host相同的partition放到一个分区中,但是如果某一个分区对应上层的partition太多(多个0.1个百分比时),会进行随机分区(这个0.1其实并不能完全保证).
new CoalescedRDD(this, numPartitions)
}
}

CoalescedRDD的处理流程:

首先看实例生成的部分,在实例生成时,上层RDD的依赖部分默认为Nil,这个依赖通过getDependencie得到.

private[spark] class CoalescedRDD[T: ClassTag](
    @transient var prev: RDD[T],
    maxPartitions: Int,
    balanceSlack: Double = 0.10)
  extends RDD[T](prev.context, Nil) {

接下来看看新的RDD的partition的生成部分.

override def getPartitions: Array[Partition] = {

这里通过当前新生成的RDD的partition的个数与上层RDD的依赖,每个partition可接受的数据误差的范围,默认是0.10.生成一个PartitionCoalescer实例,通过这个实例的run函数来得到这个新的RDD的partitions的信息.

这里的balanceSlack的值用于控制针对上一个rdd中的partitions与当前的partitions可接受的误差的partition的个数.
val pc = new PartitionCoalescer(maxPartitions, prev, balanceSlack)

这里的run函数,返回的是对应的一个一个的PartitionGroup的实例,

在run的函数中,需要执行的流程:

这上地方分成两个处理:

1,如果当前的RDD的上层依赖的RDD是一个shuffle的RDD时,那么当前的RDD的partition的个数与上层的依赖RDD的partition相同,这个时候很好处理,当前RDD与上层依赖是一个一对一的关系.

2,如果当前的RDD的上层的依赖的RDD是一个非SHUFFLE的RDD时,这个时候,如果两个RDD的PARTITION的个数相同,也就好处理,与1的处理相同.

3,如果当前的RDD的partition小于上层的依赖RDD的partition的个数,这个时候的处理相对较麻烦:

3,1,首先根据当前的RDD的partition的个数取对应上层的rdd的partition的个数,并通过partition的host进行分组存储,也就是每个host中最少存储了一个partition,

3,2,然后,把上层的rdd中多出部分的partition(还没有在分组中存储的)进行处理,如果这个partition对应现在已经存在的host分组不存在时,从现有的分组中取出最小的一个,用于存储这个partition,

3,3,如果对应此partition的host已经存在,取出这个host中分区组中对应上层rdd的partition个数最小的分组,同时在现有的所有分组中随机取出两个分组,找到最小的一个分组,如果这个分组的partition的个数加上可接受的误差的个数,小于现在host对应的最小分组的个数时,把这个partition添加到这个随机的分组中,否则添加到host对应的分组中.

根据上面的处理,最后得到CoalescedRDDPartition的信息,
  pc.run().zipWithIndex.map {
    case (pg, i) =>
      val ids = pg.arr.map(_.index).toArray
      new CoalescedRDDPartition(i, prev, ids, pg.prefLoc)
  }
}

下面来看看getDependencie函数的逻辑:

override def getDependencies: Seq[Dependency[_]] = {

这里的Dependency是一个Narrow的依赖,也就是说,当前RDD中的partition对应上层的rdd的partition的个数为1到多个,1对1通常是做过shuffle操作的情况,
  Seq(new NarrowDependency(prev) {
    def getParents(id: Int): Seq[Int] =
      partitions(id).asInstanceOf[CoalescedRDDPartition].parentsIndices
  })
}

最后看看这个CoalescedRDD的compute的函数逻辑:

这个compute根据当前的RDD中的partition对应的上层依赖RDD的partitions(下面的parents)的iterator进行flatMap操作,把对应上层的每个partition的iterator组合到一个iterator中,

这个组合是每个partition的数据集的iterator进行首层相连.

override def compute(partition: Partition, context: TaskContext): Iterator[T] = {
partition.asInstanceOf[CoalescedRDDPartition].parents.iterator

    .flatMap { parentPartition =>
        firstParent[T].iterator(parentPartition, context)
    }
}

Repartition

这个操作是直接使用的coalesce的操作,不作太细的说明,只在默认情况下,coalesce的操作shuffle的参数默认为false,repartition的操作时,会显示指定shuffle的值为true,在指定定个参数为true的情况下,重新生成的RDD的实例与上层的RDD的依赖的PARTITION的结果为1对1的关系.不需要做太多的PARTITION的区划,同时repartition比不做shuffle操作的coalesce功能在数据分划上能够更加的平均.

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null)

: RDD[T] = withScope {
coalesce(numPartitions, shuffle = true)
}

0 0