RDD-Transformation——groupByKey
来源:互联网 发布:linux中chown r 编辑:程序博客网 时间:2024/06/07 05:18
简介
def groupByKey(): RDD[(K, Iterable[V])]def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])]def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])]
该函数用于将 RDD[K,V] 中每个K对应的V值,合并到一个集合 Iterable[V] 中,
参数numPartitions用于指定分区数;参数partitioner用于指定分区函数;
上手使用
scala> var rdd = sc.makeRDD(Array(('A',0),('A',2),('B',1),('C',4)))rdd: org.apache.spark.rdd.RDD[(Char, Int)] = ParallelCollectionRDD[4] at makeRDD at <console>:27scala> rdd.groupByKey().collectres2: Array[(Char, Iterable[Int])] = Array((A,CompactBuffer(0, 2)), (B,CompactBuffer(1)), (C,CompactBuffer(4)))
原理图
将元素通过函数生成相应的Key,数据就转化为Key-Value格式,之后将Key相同的元素分为一组。
图中,方框代表一个RDD分区,相同key的元素合并到一个组。 例如,V1,V2合并为一个Key-Value对,其中key为“ V” ,Value为“ V1,V2” ,形成V,Seq(V1,V2)。
源码
/** * Return an RDD of grouped items. Each group consists of a key and a sequence of elements * mapping to that key. The ordering of elements within each group is not guaranteed, and * may even differ each time the resulting RDD is evaluated. * * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. */def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = groupBy[K](f, defaultPartitioner(this))/** * Return an RDD of grouped elements. Each group consists of a key and a sequence of elements * mapping to that key. The ordering of elements within each group is not guaranteed, and * may even differ each time the resulting RDD is evaluated. * * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. */def groupBy[K](f: T => K, numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = groupBy(f, new HashPartitioner(numPartitions))/** * Return an RDD of grouped items. Each group consists of a key and a sequence of elements * mapping to that key. The ordering of elements within each group is not guaranteed, and * may even differ each time the resulting RDD is evaluated. * * Note: This operation may be very expensive. If you are grouping in order to perform an * aggregation (such as a sum or average) over each key, using [[PairRDDFunctions.aggregateByKey]] * or [[PairRDDFunctions.reduceByKey]] will provide much better performance. */def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null) : RDD[(K, Iterable[T])] = { val cleanF = sc.clean(f) this.map(t => (cleanF(t), t)).groupByKey(p)}
0 0
- RDD-Transformation——groupByKey
- RDD Transformation——cartesian
- RDD Transformation——reduceByKey
- RDD-Transformation——filter
- RDD Transformation —— sample
- RDD Transformation——takeSample
- 【Spark Java API】Transformation(10)—combineByKey、groupByKey
- 【Spark】RDD操作详解2——值型Transformation算子
- RDD操作详解1——Transformation和Actions概况
- RDD操作详解1——Transformation和Actions概况
- RDD操作详解2——值型Transformation算子
- RDD操作详解3——键值型Transformation算子
- Spark总结(二)——RDD的Transformation操作
- 【Spark】RDD操作详解2——值型Transformation算子
- 【Spark】RDD操作详解1——Transformation和Actions概况
- 【Spark】RDD操作详解3——键值型Transformation算子
- 【Spark】RDD操作详解1——Transformation和Actions概况
- 【Spark】RDD操作详解3——键值型Transformation算子
- 我的VIM 配置 part2
- 全国计算机二级C 错题/知识点整理
- 从头到尾打印链表
- 仿乐透购彩app(7)
- jQuery实现级联菜单<数据动态加载>
- RDD-Transformation——groupByKey
- Failed to load the LayoutLib: com/android/layoutlib/bridge/Bridge : Unsupported major.minor version
- HDU1251统计难题 trie树
- [php]include()和require()区别【学习笔记】
- 287. Find the Duplicate Number
- GDB的安装以及使用入门
- invalidate()和postInvalidate() 的区别
- 深入理解Java的接口和抽象类
- debian彻底卸载软件