Spark Transformation —— union

来源：互联网发布：淘宝如何退换货物编辑：程序博客网时间：2024/06/13 12:53

def union(other: RDD[T]): RDD[T]

该函数比较简单，就是将两个RDD进行合并，不去重。

scala> var rdd1 = sc.makeRDD(1 to 2,1)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21scala> rdd1.collectres42: Array[Int] = Array(1, 2)scala> var rdd2 = sc.makeRDD(2 to 3,1)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21scala> rdd2.collectres43: Array[Int] = Array(2, 3)scala> rdd1.union(rdd2).collectres44: Array[Int] = Array(1, 2, 2, 3)

原理图

这里写图片描述

图中，左侧的大方框代表两个RDD，大方框内的小方框代表RDD的分区。右侧大方框代表合并后的RDD，大方框内的小方框代表分区。含有V1，V2…U4的RDD和含有V1，V8…U8的RDD合并所有元素形成一个RDD。V1、V1、V2、V8形成一个分区，其他元素同理进行合并。

源码

/** * Return the union of this RDD and another one. Any identical elements will appear multiple * times (use `.distinct()` to eliminate them). */def union(other: RDD[T]): RDD[T] = {  if (partitioner.isDefined && other.partitioner == partitioner) {    new PartitionerAwareUnionRDD(sc, Array(this, other))  } else {    new UnionRDD(sc, Array(this, other))  }}

0 0