spark的算子

来源:互联网 发布:视频网站 python 编辑:程序博客网 时间:2024/05/09 00:18

一、介绍

每个spark应用程序包含一个驱动程序,这个驱动程序可以在集群中运行用户的main方法,可以执行各种各样的并行操作。Spark提供了最主要的抽象的是弹性的分布式的数据集(resilient distribute dataset,RDD)。RDD是一个在很多节点上的元素分区集合,可以被并行处理。RDD可以从HDFS中读取数据来创建RDD(或者通过hadoop支持的其他的文件系统),或者存在的scala集合来创建RDD。

 

二、RDD算子分类,大致可以分为两类,即:

Transformation:转换算子,这类转换并不触发提交作业,完成作业中间过程处理(延迟加载)。

Action:行动算子,这类算子会触发SparkContext提交Job作业。

Transformation延迟执行,Transformation会记录元数据信息,当计算任务触发Action时才会真正开始计算。

 

 三、一些小例子

scala> val rdd1 = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10))

rdd1: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[0] at parallelize at <console>:27

注:

 

scala> val rdd2 = rdd1.map(_*10)

rdd2: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[1] at map at <console>:29

注:rdd2中每个元素没有乘以10,仅仅是记录是map算子,会记录哪个匿名函数

 

scala> val rdd3 = rdd2.filter(_<50)

rdd3: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[2] at filter at <console>:25

注:没有立即过滤

 

scala> rdd3.collect

res0: Array[Int] = Array(10, 20, 30,40) 

小结:

创建RDD有两种方式

1.      通过HDFS支持的文件系统创建RDD,RDD里面没有真正要计算的数据,只记录了一些元数据

2.      通过scala集合或数组以并行化的方式创建RDD

 

RDD五个特性:

--有一系列的分区(一个分区肯定在一台机器上,但一台机器上可以有多个分区)

--有一个函数会作用在每一个分区上

--RDD之间有一系列的依赖(根据依赖关系恢复前面丢掉的数据)

--可选地,key-value的RDD会有一个分区器(默认的是hash-partition)

--可选地,在一系列最佳位置计算每个分区(例如一个HDFS文件的块位置)

 

#查看该RDD的分区数

rdd1.partitions.length

 

#可以指定分区,指定五个分区

val rdd1 =sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10),5)

 

#排序

rdd1.map(_*10).sortBy(x =>x,true).collect

rdd1.map(_*10).sortBy(x =>x+””,true).collect

 

 

scala> val rdd4 =sc.parallelize(Array("a b c","d e f","h i j"))

rdd4: org.apache.spark.rdd.RDD[String] =ParallelCollectionRDD[0] at parallelize at <console>:27

scala> rdd4.flatMap(_.split("")).collect

res7: Array[String] = Array(a, b, c, d, e,f, h, i, j)

 

scala> val rdd5 =sc.parallelize(List(List("a b c","a b b"),List("e fg","a f g"),List("h i j","a a b")))

rdd5:org.apache.spark.rdd.RDD[List[String]] = ParallelCollectionRDD[4] atparallelize at <console>:27

scala>rdd5.flatMap(_.flatMap(_.split(" "))).collect

res9: Array[String] = Array(a, b, c, a, b,b, e, f, g, a, f, g, h, i, j, a, a, b)

 

#求并集

scala> val rdd6 =sc.parallelize(List(5,6,4,7))

rdd6: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[7] at parallelize at <console>:27

 

scala> val rdd7 =sc.parallelize(List(1,2,3,4))

rdd7: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[8] at parallelize at <console>:27

 

scala> val rdd8 = rdd6.union(rdd7)

rdd8: org.apache.spark.rdd.RDD[Int] =UnionRDD[9] at union at <console>:31

 

scala> rdd8.collect

res10: Array[Int] = Array(5, 6, 4, 7, 1, 2,3, 4)

 

#求交集

scala> val rdd9 =rdd6.intersection(rdd7)

rdd9: org.apache.spark.rdd.RDD[Int] =MapPartitionsRDD[15] at intersection at <console>:31

 

scala> rdd9.collect

res11: Array[Int] = Array(4)  

 

scala> val rdd1 =sc.parallelize(List(("tom", 1), ("jerry", 2), ("kitty",3)))

rdd1: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27

 

scala> val rdd2 =sc.parallelize(List(("jerry", 9), ("tom", 8),("shuke", 7)))

rdd2: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[17] at parallelize at <console>:27

 

scala> rdd1.intersection(rdd2).collect

res12: Array[(String, Int)] = Array()

注;交集为空

#join(key相同)

scala> rdd1.join(rdd2).collect

res13: Array[(String, (Int, Int))] = Array((tom,(1,8)),(jerry,(2,9))) 

 

#修改rdd2

 

scala> val rdd2 =sc.parallelize(List(("jerry", 9), ("tom", 8),("shuke", 7),("tom",2)))

rdd2: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[1] at parallelize at <console>:27

 

scala> rdd1.join(rdd2).collect

res0: Array[(String, (Int, Int))] =Array((tom,(1,8)), (tom,(1,2)), (jerry,(2,9)))

 

 

scala> rdd1.leftOuterJoin(rdd2).collect

res1: Array[(String, (Int, Option[Int]))] =Array((tom,(1,Some(2))), (tom,(1,Some(8))), (jerry,(2,Some(9))),(kitty,(3,None)))

注:左外链接,左边保留

 

scala> rdd1.rightOuterJoin(rdd2).collect

res2: Array[(String, (Option[Int], Int))] =Array((tom,(Some(1),2)), (tom,(Some(1),8)), (jerry,(Some(2),9)),(shuke,(None,7)))

 

#groupByKey

scala> val rdd3 = rdd1 union rdd2

rdd3: org.apache.spark.rdd.RDD[(String,Int)] = UnionRDD[3] at union at <console>:31

 

scala> rdd3.groupByKey

res0: org.apache.spark.rdd.RDD[(String,Iterable[Int])] = ShuffledRDD[4] at groupByKey at <console>:34

 

scala> rdd3.groupByKey.collect

res1: Array[(String, Iterable[Int])] =Array((tom,CompactBuffer(1, 8, 2)), (jerry,CompactBuffer(2, 9)),(shuke,CompactBuffer(7)), (kitty,CompactBuffer(3)))

 

scala>rdd3.groupByKey.map(x=>(x._1,x._2.sum)).collect

res5: Array[(String, Int)] =Array((tom,11), (jerry,11), (shuke,7), (kitty,3)) 

 

scala> rdd3.groupByKey.mapValues(_.sum).collect

res7: Array[(String, Int)] =Array((tom,11), (jerry,11), (shuke,7), (kitty,3))

 

# cogroup

scala> val rdd1 =sc.parallelize(List(("tom", 1), ("tom", 2),("jerry", 3), ("kitty", 2)))

rdd1: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[15] at parallelize at <console>:27

 

scala> val rdd2 =sc.parallelize(List(("jerry", 2), ("tom", 1),("shuke", 2)))

rdd2: org.apache.spark.rdd.RDD[(String,Int)] = ParallelCollectionRDD[16] at parallelize at <console>:27

 

scala> val rdd3 = rdd1.cogroup(rdd2)

rdd3: org.apache.spark.rdd.RDD[(String,(Iterable[Int], Iterable[Int]))] = MapPartitionsRDD[18] at cogroup at<console>:31


scala> val rdd4 = rdd3.map(t=>(t._1,t._2._1.sum + t._2._2.sum))

rdd4: org.apache.spark.rdd.RDD[(String,Int)] = MapPartitionsRDD[19] at map at <console>:33

 

scala> val rdd4 = rdd3.map(t=>(t._1,t._2._1.sum + t._2._2.sum))

rdd4: org.apache.spark.rdd.RDD[(String, Int)]= MapPartitionsRDD[19] at map at <console>:33

 

#cartesian笛卡尔积

scala> val rdd1 =sc.parallelize(List("tom", "jerry"))

rdd1: org.apache.spark.rdd.RDD[String] =ParallelCollectionRDD[20] at parallelize at <console>:27

 

scala> val rdd2 = sc.parallelize(List("tom","kitty", "shuke"))

rdd2: org.apache.spark.rdd.RDD[String] =ParallelCollectionRDD[21] at parallelize at <console>:27

 

scala> val rdd3 = rdd1.cartesian(rdd2)

rdd3: org.apache.spark.rdd.RDD[(String,String)] = CartesianRDD[22] at cartesian at <console>:31

 

scala> rdd3.collect

res9: Array[(String, String)] =Array((tom,tom), (tom,kitty), (tom,shuke), (jerry,tom), (jerry,kitty),(jerry,shuke))

 

scala> val rdd1 =sc.parallelize(List(1,2,3,4,5), 2)

rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23]at parallelize at <console>:27

 

scala> rdd1.collect

res10: Array[Int] = Array(1, 2, 3, 4, 5)

 

scala> val rdd2 = rdd1.reduce(_+_)

rdd2: Int = 15

 

scala> rdd1.count

res11: Long = 5

 

scala> rdd1.top(2)

res12: Array[Int] = Array(5, 4)

 

scala> rdd1.take(2)

res13: Array[Int] = Array(1, 2)

 

scala> rdd1.first

res14: Int = 1

 

scala>rdd1.takeOrdered(3)
原创粉丝点击