Spark开发-transformations操作

来源：互联网发布：数据库前置库怎么配置编辑：程序博客网时间：2024/06/06 19:34

核心
transformations操作
map(func)
返回一个新的RDD，这个函数的主要功能是对所有元素进行参数上的操作
对每一条输入进行指定的操作，然后为每一条输入返回一个对象
例如 val rdd1=sc.parallelize(Array(1,2,3,4)).map(x=>2*x).collect
这个是对数据 1,2,3,4进行map操作，里面的函数是2*x就是每个元素都乘以2返回
返回结果是 rdd1: Array[Int] = Array(2, 4, 6, 8)

filter(func)
返回一个新的RDD，这个函数的主要功能是对元素进行过滤获取符合条件的元素
例如 val rdd1=sc.parallelize(Array(1,2,3,4)).filter(x=>x>1).collect
这个是对数组1,2,3,4进行filter操作，将符合大于1的元素返回
rdd1: Array[Int] = Array(2, 3, 4)

flatMap(func)
返回一个新的RDD，这个参数是函数，类似map的操作
和map不一样的地方是最后将所有对象合并为一个对象

案例：

scala> val data =Array(Array(1, 2, 3, 4, 5),Array(4,5,6))data: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5), Array(4, 5, 6))scala> val rdd1=sc.parallelize(data)rdd1: org.apache.spark.rdd.RDD[Array[Int]] = ParallelCollectionRDD[4] at parallelize at <console>:29scala> val rdd2=rdd1.flatMap(x=>x.map(y=>y))rdd2: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[5] at flatMap at <console>:31scala> rdd2.collectres0: Array[Int] = Array(1, 2, 3, 4, 5, 4, 5, 6)

mapPartitions(func)
与map方法类似，map是对rdd中的每一个元素进行操作，而mapPartitions(foreachPartition)则是对rdd中的每个分区的迭代器进行操作。
如果在map过程中需要频繁创建额外的对象(例如将rdd中的数据通过jdbc写入数据库,map需要为每个元素创建一个链接而mapPartition为每个partition创建一个链接),则mapPartitions效率比map高的多。

val a = sc.parallelize(1 to 9, 3)  def doubleFunc(iter: Iterator[Int]) : Iterator[(Int,Int)] = {    var res = List[(Int,Int)]()    while (iter.hasNext)    {      val cur = iter.next;      res .::= (cur,cur*2)    }    res.iterator  }val result = a.mapPartitions(doubleFunc)println(result.collect().mkString)结果：(3,6)(2,4)(1,2)(6,12)(5,10)(4,8)(9,18)(8,16)(7,14)

mapPartitionsWithIndex(func)
函数作用同mapPartitions，不过提供了两个参数，第一个参数为分区的索引

scala> val a = sc.parallelize(1 to 9, 3)scala> def myfunc[T](index:T,iter: Iterator[T]) : Iterator[(T,T,T)] = {    var res = List[(T,T, T)]()     var pre = iter.next     while (iter.hasNext) {        val cur = iter.next        res .::= (index,pre, cur)         pre = cur    }     res.iterator}scala> a.mapPartitionsWithIndex(myfunc).collectres11: Array[(Int, Int, Int)] = Array((0,2,3), (0,1,2), (1,5,6), (1,4,5), (2,8,9), (2,7,8))

sample(withReplacement, fraction, seed)
Sample是对rdd中的数据集进行采样,并生成一个新的RDD,这个新的RDD只有原来RDD的部分数据,这个保留的数据集大小由fraction来进行控制
代码中的参数说明:
withReplacement=>,这个值如果是true时,采用PoissonSampler取样器(Poisson分布),否则使用BernoulliSampler的取样器.
Fraction=>,一个大于0,小于或等于1的小数值,用于控制要读取的数据所占整个数据集的概率.
Seed=>,这个值如果没有传入,默认值是一个0~Long.maxvalue之间的整数.

val a = sc.parallelize(1 to 9, 3)val b = a.sample(true,0.5,4)b.collect()res7: Array[Int] = Array(2, 3, 4, 4, 6, 8)val c = a.sample(false,0.5,4)c.collect()Array[Int] = Array(2, 3, 5, 6, 8)

union(otherDataset)
将2个RDD合并起来。返回一个新的RDD

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))scala> val rdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))scala> val rdd3 = rdd1 union rdd2scala> rdd3.collect()res0: Array[(Char, Int)] = Array((a,2), (b,4), (c,6), (d,9), (c,6), (c,7), (d,8), (e,10))

intersection(otherDataset)
该函数返回两个RDD的交集，并且去重

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))scala> val rdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))scala> val rdd3 = rdd1 intersection rdd2scala> rdd3.collect()Array[(Char, Int)] = Array((c,6))

distinct([numTasks]))
该函数将RDD去重

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9),('d',9),('d',9),('d',9)))scala> val rdd2=rdd1.distinct()scala> rdd2.collect()res3: Array[(Char, Int)] = Array((a,2), (c,6), (d,9), (b,4))

groupByKey([numTasks])
输入数据为(K, V) 对, 返回的是 (K, Iterable) ，numTasks指定task数量，该参数是可选的

scala> val rdd1=sc.parallelize(1 to 5)scala> val rdd2=sc.parallelize(4 to 9)scala> rdd1.union(rdd2).map(word=>(word,1)).groupByKey().collect()Array[(Int, Iterable[Int])] = Array((1,CompactBuffer(1)), (2,CompactBuffer(1)), (3,CompactBuffer(1)),(4,CompactBuffer(1, 1)), (5,CompactBuffer(1, 1)), (6,CompactBuffer(1)), (7,CompactBuffer(1)), (8,CompactBuffer(1)), (9,CompactBuffer(1)))

reduceByKey(func, [numTasks])
reduceByKey函数输入数据为(K, V)对，返回的数据集结果也是（K,V）对，只不过V为经过聚合操作后的值

scala> val rdd1=sc.parallelize(1 to 5)scala> val rdd2=sc.parallelize(4 to 9)scala> rdd1.union(rdd2).map(word=>(word,1)).reduceByKey(_+_).collect()Array[(Int, Int)] = Array((1,1), (2,1), (3,1), (4,2), (5,2), (6,1), (7,1), (8,1), (9,1))

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
aggregateByKey函数对PairRDD中相同Key的值进行聚合操作，在聚合过程中同样使用了一个中立的初始值

sortByKey([ascending], [numTasks])
对输入的数据集按key排序

scala> var data = sc.parallelize(List((1,3),(1,2),(1, 4),(2,3),(3,4),(7,9),(2,4)))scala> data.sortByKey(true).collect()Array[(Int, Int)] = Array((1,2), (1,4), (1,3), (2,3), (2,4), (3,4), (7,9))

join(otherDataset, [numTasks])
将2个RDD根据key关联起来

scala> val rdd1=sc.parallelize(List(('a',2),('b',4),('c',6),('d',9)))scala> val rdd2=sc.parallelize(List(('c',6),('c',7),('d',8),('e',10)))scala> val rdd3 = rdd1 join rdd2scala> rdd3.collect()Array[(Char, (Int, Int))] = Array((c,(6,6)), (c,(6,7)), (d,(9,8)))

cogroup(otherDataset, [numTasks])
如果输入的RDD类型为(K, V) 和(K, W)，则返回的RDD类型为 (K,

(Iterable, Iterable)) . 该操作与 groupWith 等同scala> val rdd1=sc.parallelize(Array((1,2),(1,3)))scala> val rdd2=sc.parallelize(Array((1,3)))scala> rdd1.cogroup(rdd2).collectArray[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))scala> rdd1.groupWith(rdd2).collectres10: Array[(Int, (Iterable[Int], Iterable[Int]))] = Array((1,(CompactBuffer(2, 3),CompactBuffer(3))))

cartesian(otherDataset)
求两个RDD数据集间的笛卡尔积

scala> val rdd1=sc.parallelize(Array(1,2,3,4))scala> val rdd2=sc.parallelize(Array(5,6))scala> rdd1.cartesian(rdd2).collectres12: Array[(Int, Int)] = Array((1,5), (1,6), (2,5), (2,6), (3,5), (3,6), (4,5), (4,6))

coalesce(numPartitions)
将RDD的分区数减至指定的numPartitions分区数，默认shuffle = false不进行shuffle的操作

scala> val rdd1=sc.parallelize(1 to 100,3)scala> val rdd2=rdd1.coalesce(2)scala> rdd1.collect()17/09/22 08:52:40 INFO spark.SparkContext: Starting job: collect at <console>:3017/09/22 08:52:40 INFO scheduler.DAGScheduler: Got job 14 (collect at <console>:30) with 3 output partitionsscala> rdd2.collect()17/09/22 08:52:09 INFO spark.SparkContext: Starting job: collect at <console>:3217/09/22 08:52:09 INFO scheduler.DAGScheduler: Got job 13 (collect at <console>:32) with 2 output partitions

repartition(numPartitions)
repartition(numPartitions)，功能与coalesce函数相同，实质上它调用的就是coalesce函数，只不是shuffle = true，意味着可能会导致大量的网络开销

repartitionAndSortWithinPartitions(partitioner)
repartitionAndSortWithinPartitions函数是repartition函数的变种，与repartition函数不同的是，
repartitionAndSortWithinPartitions在给定的partitioner内部进行排序，性能比repartition要高

阅读全文

0 0