RDD之aggregate

来源：互联网发布：淘宝收到货后怎么换货编辑：程序博客网时间：2024/04/29 14:03

定义

定义可参考RDD的API

aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U
Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral “zero value”. This function can return a different result type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U and one operation for merging two U’s, as in scala.TraversableOnce. Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.
zeroValue
the initial value for the accumulated result of each partition for the seqOp operator, and also the initial value for the combine results from different partitions for the combOp operator - this will typically be the neutral element (e.g. Nil for list concatenation or 0 for summation)
seqOp
an operator used to accumulate results within a partition
combOp
an associative operator used to combine results from different partitions

实验1-熟悉使用

api讲的比较清楚了，该函数用来聚集每个分区的元素，并用合并函数和zeroValue来聚集分区结果。并给予我们两个函数，seqOp和CombOp

实验程序

打开spark-shell，我们执行实验1（当复制并粘贴以下代码实验时请将注释去掉）

//该函数用来将每个分区的index展示出来def myfunc[T](index:Int,iter:Iterator[T]):Iterator[(Int,T)]={var res = List[(Int,T)]()for(x<-iter)res.::=(index,x)res.iterator}val data = sc.parallelize(1 to 10,3)data.mapPartitionsWithIndex(myfunc).collectdata.aggregate(0)((a,b)=>if(a>b) a else b ,_+_)

实验结果

实验1结果

结果分析

实验2-zeroValue

api讲解如下：zeroValue值为seqOp函数的初始值，同时也是combOp函数的初始值。

实验程序

打开spark-shell，我们执行实验2（当复制并粘贴以下代码实验时请将注释去掉）

//seqOp函数def seqOp(arg1:Int,arg2:Int):Int={var res:Int=arg2if(arg1>arg2)res=arg1println("seqOp:"+arg1+","+arg2+"=>"+res)res}//combOp函数def combOp(arg1:Int,arg2:Int):Int={println("combOp:"+arg1+","+arg2+"=>"+(arg1+arg2))arg1+arg2}//将每个分区index显示出来def myfunc[T](index:Int,iter:Iterator[T]):Iterator[(Int,T)]={var res = List[(Int,T)]()for(x<-iter)res.::=(index,x)res.iterator}val data = sc.parallelize(1 to 10,3)data.mapPartitionsWithIndex(myfunc).collectdata.aggregate(11)(seqOp,combOp)

实验结果

结果分析

当然，该实验的zeroValue取值比较极端，大家可换成5或者6试一试
结果分析

参考博客：
[1]：http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#aggregate
[2]：http://www.iteblog.com/archives/1268

1 0