Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
来源:互联网 发布:下载语文辅导软件 编辑:程序博客网 时间:2024/05/21 14:51
Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
- 1 count
count 返回的是在一个RDD里面存储的元素的个数
def count(): Long
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)c.countres2: Long = 4
- 2 countApproxDistinct
计算单一值的大概的出现的次数,假设有一个分布于很多节点的很大的一个RDD,大致的计算速度会快于其他的计算方式,
Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高
def countApproxDistinct(relativeSD: Double = 0.05): Long
val a = sc.parallelize(1 to 10000, 20)val b = a++a++a++a++ab.countApproxDistinct(0.1)res14: Long = 8224b.countApproxDistinct(0.05)res15: Long = 9750b.countApproxDistinct(0.01)res16: Long = 9947b.countApproxDistinct(0.001)res0: Long = 10000
- 3 countApproxDistinctByKey [Pair]
这个作用于一个键值对类型的数据。它和之前的countApproxDistinct
是类似的。不过计算的是每个单独出现的key值的单独的value值出现的次数。RDD包含的元素的值也必须是tuple类型的元素。Api中的参数relativeSD用于控制计算的精准度。 越小表示准确度越高。
def countApproxDistinctByKey(relativeSD: Double = 0.05): RDD[(K, Long)]def countApproxDistinctByKey(relativeSD: Double, numPartitions: Int): RDD[(K, Long)]def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)]
val a = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog"), 2)val b = sc.parallelize(a.takeSample(true, 10000, 0), 20)val c = sc.parallelize(1 to b.count().toInt, 20)val d = b.zip(c)d.countApproxDistinctByKey(0.1).collectres15: Array[(String, Long)] = Array((Rat,2567), (Cat,3357), (Dog,2414), (Gnu,2494))d.countApproxDistinctByKey(0.01).collectres16: Array[(String, Long)] = Array((Rat,2555), (Cat,2455), (Dog,2425), (Gnu,2513))d.countApproxDistinctByKey(0.001).collectres0: Array[(String, Long)] = Array((Rat,2562), (Cat,2464), (Dog,2451), (Gnu,2521))
- 4 countByKey
作用于键值对类型的元素,不过计算的是每个键对应出现的value的次数。
def countByKey(): Map[K, Long]
val c = sc.parallelize(List((3, "Gnu"), (3, "Yak"), (5, "Mouse"), (3, "Dog")), 2)c.countByKeyres3: scala.collection.Map[Int,Long] = Map(3 -> 3, 5 -> 1)
- 5 countByValue
计算一个RDD中,每一个元素出现的次数,返回的结果为一个map型,表示的是每个值出现了几次。
def countByValue(): Map[T, Long]
val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))b.countByValueres27: scala.collection.Map[Int,Long] = Map(5 -> 1, 8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)
阅读全文
1 0
- Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
- spark RDD算子(九)之基本的Action操作 first, take, collect, count, countByValue, reduce, aggregate, fold,top
- Spark编程之基本的RDD算子-aggregate和aggregateByKey
- Spark编程之基本的RDD算子coalesce, repartition, checkpoint
- spark RDD countApproxDistinct
- Spark编程之基本的RDD算子之fold,foldByKey,treeAggregate, treeReduce
- Spark编程之基本的RDD算子之map,mapPartitions, mapPartitionsWithIndex.
- Spark编程之基本的RDD算子之zip,zipPartitions,zipWithIndex,zipWithUniqueId
- Spark编程之基本的RDD算子之cogroup,groupBy,groupByKey
- Spark编程之基本的RDD算子之join,rightOuterJoin, leftOuterJoin
- Spark编程之基本的RDD算子之glom,substract,substractByKey,intersection,distinct,union
- Spark编程之基本的RDD算子之fold,foldByKey,treeAggregate, treeReduce
- spark rdd countByValue
- Spark编程之基本的RDD算子sparkContext,foreach,foreachPartition, collectAsMap
- 【spark,rdd,2】RDD基本转换算子
- Spark编程的基本的算子之:combineByKey,reduceByKey,groupByKey
- Spark RDD的aggregate算子
- Spark RDD的fold算子
- 第七篇:JAVA集合之LinkedHashmap源码剖析
- SpagoBI5.2搭建及开发指导
- js-面向对象------属性类型
- Just a Hook HDU
- NopCommerce学习笔记(一)-----Application_Start
- Spark编程之基本的RDD算子count, countApproxDistinct, countByValue等
- 输入URL后发生了什么?
- 常见内存错误及其对策
- 大话数据模式总结
- arm开发网络文件系统环境搭建 nfs
- KMP算法详解
- 2017.8.14 分手是祝愿 失败总结
- 士兵杀敌(三)||NYOJ119
- jQuery添加自定义扩展