Spark RDD转换成其他数据结构

来源：互联网发布：广播知乎编辑：程序博客网时间：2024/05/06 16:25

在Spark推荐系统编程中，一般都是通过文件加载成RDD：

//在这里默认 (userId, itemId, preference)val fields = sparkContext.textFile("").split("\t").map{ field =>    field(1), field(2), field(3)}

然后直接操作RDD，诸如：

groupBy(field._1)map (field._2)sortBy(field._3)

但是某些时候，想用到scala的数据结构比如TreeMap、HashMap之类的。

在scala里，集合之间的转换可以通过fold, foldLeft, foldRight
fold只能同构转换，而foldLeft在scala程序中能用，但是在RDD不能用，这涉及到并行的问题

在scala集合中，foldLeft可以这样用：

//数据格式为//  TreeMap(userId, TreeMap(itemId, preference))val preferences = fields.foldLeft(    TreeMap[Long, TreeMap[Long, Double]]()) {     (m, e) =>      val (userId, itemId, preference) = e      val values = m.getOrElse(      userId, TreeMap[Long, Double]())      m + (userId->(values+(itemId->preference)))  }

RDD不能用foldLeft，官方提供了一个方法aggregate

Aggregate the elements of each partition, and then the results for all the partitions, using given combine functions and a neutral “zero value”.
This function can return a different result type, U, than the type of this RDD, T.
Thus, we need one operation for merging a T into an U and one operation for merging two U’s, as in scala.TraversableOnce.
Both of these functions are allowed to modify and return their first argument instead of creating a new U to avoid memory allocation.

aggregate函数将每个分区里面的元素进行聚合，然后用combine函数将每个分区的结果和初始值(zeroValue)进行combine操作。这个函数最终返回的类型不需要和RDD中元素类型一致。

aggregate函数原型:

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

RDD用aggregate转换成其他数据结构：

//  TreeMap(userId, TreeMap(itemId, preference))val preferences = fields.aggregate(    TreeMap[Long, TreeMap[Long, Double]]()) (    {     (m, e) =>      val (userId, itemId, preference) = e      val values = m.getOrElse(      userId, TreeMap[Long, Double]())      m + (userId->(values+(itemId->preference)))  },  {    (map1, map2) =>    map1 ++ map2  })

通过这样，我们就得到了从RDD转换成TreeMap数据结构。

值得注意的是，通过aggregate函数是相当于把数据collect到Driver上，因此对转换后的数据结构操作并不能像RDD那样分布式

0 0