Spark RDD API 参考示例（三）

来源：互联网发布：百度五笔mac官方下载编辑：程序博客网时间：2024/06/07 01:42

本文参考Zhen He

28、getCheckpointFile

原型
def getCheckpointFile: Option[String]

含义
getCheckpointFile 返回RDD的checkpoint 文件的路径，主要用于对大型计算中恢复到指定的节点

示例

//设置CheckPoint的路径，前提是路径一定要存在sc.setCheckpointDir("hdfs://192.168.10.71:9000/wc")val a = sc.parallelize(1 to 500, 5)val b = a++a++a++a++a//获取b的历史 checkpoint 文件路径b.getCheckpointFile//目前没有checkpoint文件res5: Option[String] = None//设置checkpoint，但不会立马提交，rdd具有延迟的特点b.checkpointb.getCheckpointFileres10: Option[String] = None//使用action算子时，才会真正提交checkpointb.collect//获取上面提交的checkpoint文件路径b.getCheckpointFileres15: Option[String] = Some(hdfs://192.168.10.71:9000/wc/e7f2340a-b37b-4d97-8b48-58253e6e4464/rdd-133)

29、getStorageLevel

原型
def getStorageLevel

含义
getStorageLevel 返回RDD当前的存储级别，存储级别一旦确定，就不能再修改了。

示例

val a = sc.parallelize(1 to 100000, 2)//表示目前RDD使用的存储级别是存储在内存中，未序列化，存储1份a.getStorageLevelres1: org.apache.spark.storage.StorageLevel = StorageLevel(memory, deserialized, 1 replicas)//可以事先指定存储级别val a = sc.parallelize(1 to 100000, 2)a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)a.getStorageLevel//表示存储在磁盘中，存储1份res2: org.apache.spark.storage.StorageLevel = StorageLevel(disk, 1 replicas)

30、glom

原型
def glom(): RDD[Array[T]]

含义
glom 将RDD的每一个分区作为一个单独的包装，然后分区之间再包装起来

示例

val a = sc.parallelize(1 to 10, 3)a.glom.collect//每一个分区作为一个单独的包装，然后分区之间再包装起来res1:  Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))

31、groupBy

原型
def groupBy[K: ClassTag](f: T => K, numPartitions: Int): RDD[(K, Iterable[T])]

含义
groupBy 将RDD中的数据按照指定的函数和分区数量，来进行分组。

示例

val a = sc.parallelize(1 to 9, 3)//groupBy的第一个参数是一个函数，用于指定分组条件。分类标签由条件返回值给定//这里会根据条件返回 "even" 和 "odd"a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collectres1： Array((even,CompactBuffer(2, 8, 4, 6)), (odd,CompactBuffer(5, 1, 3, 7, 9)))//这里的返回标签为 0 ，1 ，2a.groupBy(x =>(x % 3)).collectres2:Array((0,CompactBuffer(3, 9, 6)), (1,CompactBuffer(4, 1, 7)), (2,CompactBuffer(2, 8, 5)))//自定义函数进行分组val a = sc.parallelize(1 to 9, 3)def myfunc(a: Int) : Int ={  a % 2}//groupBy中的第二个参数是指定，分组后将结果存储在几个分区中，默认分区数量和RDD元素分区数量相等a.groupBy(x => myfunc(x), 3).collectres2: Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))a.groupBy(x => myfunc(x), 3).partitions.lengthres4: Int = 3//指定结果分区数量为1a.groupBy(myfunc(_), 1).collectres3: Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))a.groupBy(myfunc(_), 1).partitions.lengthres5: Int = 1

32、groupByKey [Pair]

原型
def groupByKey(): RDD[(K, Iterable[V])]
def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]
def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]

含义
groupByKey 和 groupBy 非常相似，不提供函数功能，只是按照key来进行分组，相同的key分在一组，相比于groupBy 要简单

示例

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)//生成一个以单词长度作为key，单词作为value的 元组val b = a.keyBy(_.length)//groupByKey不提供函数功能，直接按照Key进行分类b.groupByKey.collectres1: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))

33、histogram [Double]

原型
def histogram(bucketCount: Int): Pair[Array[Double], Array[Long]]
def histogram(buckets: Array[Double], evenBuckets: Boolean = false): Array[Long]

含义
histogram 根据RDD中的数据生成一个随机的直方图，RDD中的数据作为横坐标，系统自动生成一个纵坐标，有两种方式生成横坐标，第一种指定需要几个柱，第二种，给定横坐标个数。

示例

//根据给定的柱子数量来确定坐标val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3, 5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)a.histogram(6)//表示需要7个横坐标点，生成6个柱res1: (Array[Double], Array[Long]) = (Array(1.0, 2.5, 4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3, 4))//根据用户指定的横坐标来确定val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 9.0), 3)a.histogram(Array(0.0, 3.0, 8.0))res2: Array[Long] = Array(5, 3)

34、id

原型
val id: Int

含义
id 获取系统分配给RDD的编号，这个编号可以用于查找指定的的RDD

示例

val y = sc.parallelize(1 to 10, 10)y.idres1: Int = 19

35、intersection

原型
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
def intersection(other: RDD[T]): RDD[T]

含义
intersection 求两个集合中相同的元素，也就是求二者的交集

示例

//普通元素求交集val x = sc.parallelize(1 to 20)val y = sc.parallelize(10 to 30)val z = x.intersection(y)//求两个集合的交集z.collectres1: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)//两个元组求交集val x = sc.parallelize(List(("cat",2),("wolf",1),("gnu",1)))val y = sc.parallelize(List(("cat",1),("wolf",1),("mouse",1)))val z = x.intersection(y)z.collect//只有完全相同的元组才算相同元素res2: Array[(String, Int)] = Array((wolf,1))

36、isCheckpointed

原型
def isCheckpointed: Boolean

含义
isCheckpointed 检测一个RDD是否已经存在检查点

示例

//设置检查点val c = sc.parallelize(1 to 10)sc.setCheckpointDir("hdfs://192.168.10.71:9000/wc")c.isCheckpointedres1: Boolean = false//延迟执行，只有执行action算子时，才会执行checkpointc.checkpointc.isCheckpointedres2: Boolean = false//执行action算子，生成checkpointc.collectc.isCheckpointedres3: Boolean = true

37、join [Pair]

原型
def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, W))]
def join[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, W))]

含义
join 用于两个key-value类型的RDD的内连接操作，类似于数据库中的内连接。只有两者的key相同时，才会连接

示例

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)val b = a.keyBy(_.length)val c = sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"), 3)//相同的key，就能连接在一起val d = c.keyBy(_.length)b.join(d).collect res0: Array[(Int, (String, String))] = Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)), (3,(dog,dog)), (3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(rat,dog)), (3,(rat,cat)), (3,(rat,gnu)), (3,(rat,bee)))

38、keyBy

原型
def keyBy[K](f: T => K): RDD[(K, T)]

含义
keyBy 指定一个函数产生特定的数据作为RDD的key，这个函数可以自定义，主要目的是产生一个元组。

示例

val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)//指定每个单词的长度作为RDD中元素的Keyval b = a.keyBy(_.length)b.collectres1: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))

39、keys [Pair]

原型
def keys: RDD[K]

含义
keys 获取RDD中元组的key，这些key可以重复出现

示例

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)val b = a.map(x => (x.length, x))b.keys.collect//可以重复出现res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

40、leftOuterJoin [Pair]

原型
def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], numPartitions: Int): RDD[(K, (V, Option[W]))]
def leftOuterJoin[W](other: RDD[(K, W)], partitioner: Partitioner): RDD[(K, (V, Option[W]))]

含义
leftOuterJoin 类似于数据库中的左外连接，以左边作为标准，右边没有的填缺失值，左边没有的右边有，舍弃掉。

示例

val a = sc.parallelize(List(("dog",2),("salmon",2),("rat",1),("elephant",10)),3)val b = sc.parallelize(List(("dog",2),("salmon",2),("rabbit",1),("cat",7)), 3)a.leftOuterJoin(b).collect//左边有的，在结果集中都有，左边没有的，右边都舍弃掉。以左边作为参考标准res1:Array((rat,(1,None)), (salmon,(2,Some(2))), (elephant,(10,None)), (dog,(2,Some(2))))

41、lookup

原型
def lookup(key: K): Seq[V]

含义
lookup 查看指定key的value值，通过全表扫描来实现

示例

val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)val b = a.map(x => (x.length, x))b.lookup(5)//通过全表扫描来查找 key=5 的值res1: Seq[String] = WrappedArray(tiger, eagle)

阅读全文

1 0