Spark combinebykey使用示例

来源:互联网 发布:php artisan 安装 编辑:程序博客网 时间:2024/06/03 05:08
combineByKey是对RDD中的数据集按照Key进行聚合操作。聚合操作的逻辑是通过自定义函数提供给combineByKey。

combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C, numPartitions: Int):RDD[(K, C)]

把(K,V) 类型的RDD转换为(K,C)类型的RDD,C和V可以不一样。

combineByKey三个参数:

val data = Array((1, 1.0), (1, 2.0), (1, 3.0), (2, 4.0), (2, 5.0), (2, 6.0))
val rdd = sc.parallelize(data, 2)

val combine1 = rdd.combineByKey(

createCombiner = (v:Double) => (v:Double, 1),

mergeValue = (c:(Double, Int), v:Double) => (c._1 + v, c._2 + 1),
mergeCombiners = (c1:(Double, Int), c2:(Double, Int)) => (c1._1 + c2._1, c1._2 + c2._2),
numPartitions = 2 )
combine1.collect
res0: Array[(Int, (Double, Int))] = Array((2,(15.0,3)), (1,(6.0,3)))
原创粉丝点击