【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId
来源:互联网 发布:反网络爬虫 编辑:程序博客网 时间:2024/05/18 14:12
zipWithIndex
官方文档描述:
Zips this RDD with its element indices. The ordering is first based on the partition index and then the ordering of items within each partition. So the first item in the first partition gets index 0, and the last item in the last partition receives the largest index. This is similar to Scala's zipWithIndex but it uses Long instead of Int as the index type.This method needs to trigger a spark job when this RDD contains more than one partitions.
函数原型:
def zipWithIndex(): JavaPairRDD[T, JLong]
该函数将RDD中的元素和这个元素在RDD中的indices组合起来,形成键/值对的RDD。
源码分析:
def zipWithIndex(): RDD[(T, Long)] = withScope { new ZippedWithIndexRDD(this)}/** The start index of each partition. */@transient private val startIndices: Array[Long] = { val n = prev.partitions.length if (n == 0) { Array[Long]() } else if (n == 1) { Array(0L) } else { prev.context.runJob( prev, Utils.getIteratorSize _, 0 until n - 1, // do not need to count the last partition allowLocal = false ).scanLeft(0L)(_ + _) }}override def compute(splitIn: Partition, context: TaskContext): Iterator[(T, Long)] = { val split = splitIn.asInstanceOf[ZippedWithIndexRDDPartition] firstParent[T].iterator(split.prev, context).zipWithIndex.map { x => (x._1, split.startIndex + x._2) }}
从源码中可以看出,该函数返回ZippedWithIndexRDD,在ZippedWithIndexRDD中通过计算startIndices获得index;然后在compute函数中利用scala的zipWithIndex计算index。
实例:
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2); JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3); List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7); JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);JavaPairRDD<Integer,Long> zipWithIndexRDD = javaRDD.zipWithIndex(); System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipWithIndexRDD.collect());
zipWithUniqueId
官方文档描述:
Zips this RDD with generated unique Long ids. Items in the kth partition will get ids k, n+k,2*n+k, ..., where n is the number of partitions. So there may exist gaps, but this method won't trigger a spark job, which is different from [[org.apache.spark.rdd.RDD#zipWithIndex]].
函数原型:
def zipWithUniqueId(): JavaPairRDD[T, JLong]
该函数将RDD中的元素和一个对应的唯一ID组合成键值对,其中ID的生成算法是每个分区的第一元素的ID是该分区索引号,每个分区中的第N个元素的ID是(N * 该RDD总的分区数) + (该分区索引号)。
源码分析:
def zipWithUniqueId(): RDD[(T, Long)] = withScope { val n = this.partitions.length.toLong this.mapPartitionsWithIndex { case (k, iter) => iter.zipWithIndex.map { case (item, i) => (item, i * n + k) } }}
*从源码中可以看出,zipWithUniqueId()函数是利用mapPartitionsWithIndex()函数获得每个元素的分区索引号,同时利用(i*n + k)进行相应的计算。
实例:
List<Integer> data = Arrays.asList(5, 1, 1, 4, 4, 2, 2);JavaRDD<Integer> javaRDD = javaSparkContext.parallelize(data,3);List<Integer> data1 = Arrays.asList(3,2,12,5,6,1,7);JavaRDD<Integer> javaRDD1 = javaSparkContext.parallelize(data1);JavaPairRDD<Integer,Long> zipWithIndexRDD = javaRDD.zipWithUniqueId();System.out.println("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~" + zipWithIndexRDD.collect());
0 0
- 【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId
- Spark算子:RDD基本转换操作(7)–zipWithIndex、zipWithUniqueId
- Spark算子:RDD基本转换操作(7)–zipWithIndex、zipWithUniqueId
- Spark算子:RDD基本转换操作(7)–zipWithIndex、zipWithUniqueId
- Spark算子[17]:zip、zipPartitions、zipWithIndex、zipWithUniqueId 实例详解
- Spark编程之基本的RDD算子之zip,zipPartitions,zipWithIndex,zipWithUniqueId
- 3.2 Spark RDD 基本转换操作6-zip、zipPartitions 、zipWithIndex、zipWithUniqueId
- 【Spark Java API】Transformation(1)—mapPartitions、mapPartitionsWithIndex
- 【Spark Java API】Transformation(2)—sample、randomSplit
- 【Spark Java API】Transformation(3)—union、intersection
- 【Spark Java API】Transformation(4)—coalesce、repartition
- 【Spark Java API】Transformation(5)—cartesian、distinct
- 【Spark Java API】Transformation(6)—aggregate、aggregateByKey
- 【Spark Java API】Transformation(7)—cogroup、join
- 【Spark Java API】Transformation(8)—fullOuterJoin、leftOuterJoin、rightOuterJoin
- 【Spark Java API】Transformation(9)—sortByKey、repartitionAndSortWithinPartitions
- 【Spark Java API】Transformation(10)—combineByKey、groupByKey
- 【Spark Java API】Transformation(11)—reduceByKey、foldByKey
- 引入第三方框架的几种方式
- 数据库表的设计
- [3.3.0]数据倾斜与shuffle类性能调优
- man命令的使用方法
- 第八章 多态总结 协变返回类型 向下转型 和运行时类型识别继承中 is-a 和 is-like-a及子类父类的方法调用(ClassCastException)
- 【Spark Java API】Transformation(13)—zipWithIndex、zipWithUniqueId
- malloc函数动态分配数组长度
- 静态内部类
- VS2010/MFC 利用OLE读写excel操作时,手动打开其他excel文档程序崩掉的问题解决
- poj3485
- Opencv+VS插件推荐:调试过程内存中图片Mat变量查看器Image Watch
- 最小生成树-两种算法复杂度比较 poj-1258,2485
- 串口、网口通信
- 关于.net一点总结