spark transform系列__aggregateByKey

来源：互联网发布：vm虚拟机ubuntu性能编辑：程序博客网时间：2024/06/05 14:18

aggregateByKey

这个函数可用于完成对groupByKey,reduceByKey的相同的功能,用于对rdd中相同的key的值的聚合操作,主要用于返回一个指定的类型U的RDD的transform,在这个函数中,需要传入三个参数:

参数1:用于在每个分区中,对key值第一次读取V类型的值时,使用的U类型的初始变量,

参数2:用于在每个分区中,相同的key中V类型的值合并到参数1创建的U类型的变量中,

参数3:用于对重新分区后两个分区中传入的U类型数据的合并的函数.

这个函数相对于groupByKey与reduceByKey这类的函数来讲,在编码上可以更加灵活的进行组织.

def aggregateByKey[U: ClassTag](zeroValue: U)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
aggregateByKey(zeroValue, defaultPartitioner(self))(seqOp, combOp)
}

.....这是这个函数的最终定义.....

这个函数中:

U: ClassTag==>表示这个最终的RDD的返回值类型.

zeroValue: U==>表示在每个分区中第一次拿到key值时,用于创建一个返回类型的函数,这个函数最终会被包装成先生成一个返回类型,然后通过调用seqOp函数,把第一个key对应的value添加到这个类型U的变量中,下面代码的红色部分.

seqOp: (U,V) => U ==> 这个用于把迭代分区中key对应的值添加到zeroValue创建的U类型实例中.

combOp: (U,U) => U ==> 这个用于合并每个分区中聚合过来的两个U类型的值.

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)(seqOp: (U, V) => U,
combOp: (U, U) => U): RDD[(K, U)] = self.withScope {
// Serialize the zero value to a byte array so that we can get a new clone of it

       on each key
  val zeroBuffer = SparkEnv.get.serializer.newInstance().serialize(zeroValue)
  val zeroArray = new Array[Byte](zeroBuffer.limit)
  zeroBuffer.get(zeroArray)

  lazy val cachedSerializer = SparkEnv.get.serializer.newInstance()
  val createZero = () => cachedSerializer.deserialize[U](ByteBuffer.wrap(zeroArray))

  // We will clean the combiner closure later in `combineByKey`
  val cleanedSeqOp = self.context.clean(seqOp)
  combineByKeyWithClassTag[U]((v: V) => cleanedSeqOp(createZero(), v),
    cleanedSeqOp, combOp, partitioner)
}

1 0