Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等

来源:互联网 发布:下载语文辅导软件 编辑:程序博客网 时间:2024/05/21 14:51

Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等


  • 1 count

count 返回的是在一个RDD里面存储的元素的个数

def count(): Long

val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)c.countres2: Long = 4

  • 2 countApproxDistinct

计算单一值的大概的出现的次数,假设有一个分布于很多节点的很大的一个RDD,大致的计算速度会快于其他的计算方式,
Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高

def countApproxDistinct(relativeSD: Double = 0.05): Long

val a = sc.parallelize(1 to 10000, 20)val b = a++a++a++a++ab.countApproxDistinct(0.1)res14: Long = 8224b.countApproxDistinct(0.05)res15: Long = 9750b.countApproxDistinct(0.01)res16: Long = 9947b.countApproxDistinct(0.001)res0: Long = 10000

  • 3 countApproxDistinctByKey [Pair]

这个作用于一个键值对类型的数据。它和之前的countApproxDistinct 是类似的。不过计算的是每个单独出现的key值的单独的value值出现的次数。RDD包含的元素的值也必须是tuple类型的元素。Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高。

def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)]def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]

val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)val c = sc.parallelize(1 to b.count().toInt, 20)val d = b.zip(c)d.countApproxDistinctByKey(0.1).collectres15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))d.countApproxDistinctByKey(0.01).collectres16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))d.countApproxDistinctByKey(0.001).collectres0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))
  • 4 countByKey

作用于键值对类型的元素,不过计算的是每个键对应出现的value的次数。

def countByKey(): Map[K, Long]

val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)c.countByKeyres3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
  • 5 countByValue

计算一个RDD中,每一个元素出现的次数,返回的结果为一个map型,表示的是每个值出现了几次。

def countByValue(): Map[T, Long]
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))b.countByValueres27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)
阅读全文
1 0
原创粉丝点击