spark 算子-转化操作
来源:互联网 发布:中国国家标准数据库 编辑:程序博客网 时间:2024/06/14 17:38
本小结涉及到转化算操作
- map
- flatMap
- distinct
- coalesce
- repartition
- randomSplit
- glom
- union
- intersection
- subtract
- mapPartitions
- mapPartitionsWithIndex
- zip
- zipPartitions
- zipWithIndex
- zipWithUniqueId
map 函数
将一个RDD中的每个数据项,通过map中的函数映射变为一个新的元素。
输入分区与输出分区一对一,即,有多少个输入分区,就有多少个输出分区
//读取HDFS文件到RDDscala > val data = sc.textFile("text.txt")data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[1] at textFile at :21//使用map算子scala > val result = data.map(line => line.split(','))result: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at :23//运算map算子结果scala > result.collectres0: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive))使用flatMap时候需要注意:flatMap会将字符串看成是一个字符数组
观察下边的例子:
scala> data.map(_.toUpperCase).collectres32: Array[String] = Array(HELLO WORLD, HELLO SPARK, HELLO HIVE, HI SPARK)scala> data.flatMap(_.toUpperCase).collectres33: Array[Char] = Array(H, E, L, L, O, , W, O, R, L, D, H, E, L, L, O, , S, P, A, R, K, H, E, L, L, O, , H, I, V, E, H, I, , S, P, A, R, K
在观察下边的例子
scala> data.map(x => x.split("\\s+")).collectres34: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark), Array(hello, hive), Array(hi, spark))scala> data.flatMap(x => x.split("\\s+")).collectres35: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)
flatMap 函数
flatmap 属于转换算子,和map操作类似,最后将所有分区合并成一个。
类似于执行了map和flatten两个操作
// 使用flatmap算子计算 scala> val result = data.flatMap(line=>line.split(',')) result: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[3] at flatMap at :23 //运算flatMap算子结果 scala > result.collect res1: Array[String] = Array(hello, world, hello, spark, hello, hive)
这次的结果好像是预期的,最终结果里面并没有把字符串当成字符数组。
这是因为这次map函数中返回的类型为Array[String],并不是String。
flatMap只会将String扁平化成字符数组,并不会把Array[String]也扁平化成字符数组。
distinct 函数
对RDD的元素进行去重操作
scala> data.flatMap(line=>line.split(','))res61: Array[String] = Array(hello, world, hello, spark, hello, hive, hi, spark)scala> data.flatMap(line => line.split(',')).distinct.collectres62: Array[String] = Array(hive, hello, world, spark, hi)
coalesce 函数
def coalesce(numPartitions: Int, shuffle: Boolean = false)(implicit ord: Ordering[T] = null): RDD[T]
该函数用于将RDD进行重分区,使用HashPartitioner。
第一个参数为重分区的数目,第二个为是否进行shuffle,默认为false;
先上例子:
scala> val data = sc.textFile('text.txt')data: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[53] at textFile at :21scala> data.collectres37: Array[String] = Array(hello world, hello spark, hello hive, hi spark)scala> data.partitions.sizeres38: Int = 2 //RDD data默认有两个分区scala> val rdd1 = data.coalesce(1)rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[2] at coalesce at :23scala> rdd1.partitions.sizeres1: Int = 1 //rdd1的分区数为1scala> var rdd1 = data.coalesce(4)rdd1: org.apache.spark.rdd.RDD[String] = CoalescedRDD[3] at coalesce at :23scala> rdd1.partitions.sizeres2: Int = 2 //如果重分区的数目大于原来的分区数,那么必须指定shuffle参数为true,//否则,分区数不便scala> var rdd1 = data.coalesce(4,true)rdd1: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[7] at coalesce at :23scala> rdd1.partitions.sizeres3: Int = 4
repartition 函数
def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]
该函数其实就是coalesce函数第二个参数为true的实现。
scala> var rdd2 = data.repartition(1)rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[11] at repartition at :23scala> rdd2.partitions.sizeres4: Int = 1scala> var rdd2 = data.repartition(4)rdd2: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[15] at repartition at :23scala> rdd2.partitions.sizeres5: Int = 4
randomSplit
def randomSplit(weights: Array[Double], seed: Long = Utils.random.nextLong): Array[RDD[T]]
该函数根据weights权重,将一个RDD切分成多个RDD。
该权重参数为一个Double数组
第二个参数为random的种子,基本可忽略。
scala> var rdd = sc.makeRDD(1 to 10,10)rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[16] at makeRDD at :21scala> rdd.collectres6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10) scala> var splitRDD = rdd.randomSplit(Array(1.0,2.0,3.0,4.0))splitRDD: Array[org.apache.spark.rdd.RDD[Int]] = Array(MapPartitionsRDD[17] at randomSplit at :23, MapPartitionsRDD[18] at randomSplit at :23, MapPartitionsRDD[19] at randomSplit at :23, MapPartitionsRDD[20] at randomSplit at :23)//这里注意:randomSplit的结果是一个RDD数组scala> splitRDD.sizeres8: Int = 4//由于randomSplit的第一个参数weights中传入的值有4个,因此,就会切分成4个RDD,//把原来的rdd按照权重1.0,2.0,3.0,4.0,随机划分到这4个RDD中,权重高的RDD,划分到//的几率就大一些。//注意,权重的总和加起来为1,否则会不正常scala> splitRDD(0).collectres10: Array[Int] = Array(1, 4)scala> splitRDD(1).collectres11: Array[Int] = Array(3) scala> splitRDD(2).collectres12: Array[Int] = Array(5, 9)scala> splitRDD(3).collectres13: Array[Int] = Array(2, 6, 7, 8, 10)
glom
def glom(): RDD[Array[T]]
该函数是将RDD中每一个分区中类型为T的元素转换成Array[T],这样每一个分区就只有一个数组元素。
scala> var rdd = sc.makeRDD(1 to 10,3)rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[38] at makeRDD at :21scala> rdd.partitions.sizeres33: Int = 3 //该RDD有3个分区scala> rdd.glom().collectres35: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9, 10))//glom将每个分区中的元素放到一个数组中,这样,结果就变成了3个数组
union
def union(other: RDD[T]): RDD[T]
该函数比较简单,就是将两个RDD进行合并,不去重。
scala > val rdd1 = sc.makeRDD(1 to 2,1)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21scala> rdd1.collectres42: Array[Int] = Array(1, 2)scala> val rdd2 = sc.makeRDD(2 to 3,1)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21scala> rdd2.collectres43: Array[Int] = Array(2, 3)scala> rdd1.union(rdd2).collectres44: Array[Int] = Array(1, 2, 2, 3)
intersection
def intersection(other: RDD[T]): RDD[T]
def intersection(other: RDD[T], numPartitions: Int): RDD[T]
def intersection(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
该函数返回两个RDD的交集,并且去重。
参数numPartitions指定返回的RDD的分区数。
参数partitioner用于指定分区函数
scala> var rdd1 = sc.makeRDD(1 to 2,1)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[45] at makeRDD at :21scala> rdd1.collectres42: Array[Int] = Array(1, 2)scala> var rdd2 = sc.makeRDD(2 to 3,1)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[46] at makeRDD at :21scala> rdd2.collectres43: Array[Int] = Array(2, 3)scala> rdd1.intersection(rdd2).collectres45: Array[Int] = Array(2)scala> var rdd3 = rdd1.intersection(rdd2)rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[59] at intersection at :25scala> rdd3.partitions.sizeres46: Int = 1scala> var rdd3 = rdd1.intersection(rdd2,2)rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[65] at intersection at :25scala> rdd3.partitions.sizeres47: Int = 2
subtract
def subtract(other: RDD[T]): RDD[T]
def subtract(other: RDD[T], numPartitions: Int): RDD[T]
def subtract(other: RDD[T], partitioner: Partitioner)(implicit ord: Ordering[T] = null): RDD[T]
该函数类似于intersection,但返回在RDD中出现,并且不在otherRDD中出现的元素,不去重。
参数含义同intersection
scala> var rdd1 = sc.makeRDD(Seq(1,2,2,3))rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[66] at makeRDD at :21scala> rdd1.collectres48: Array[Int] = Array(1, 2, 2, 3)scala> var rdd2 = sc.makeRDD(3 to 4)rdd2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[67] at makeRDD at :21scala> rdd2.collectres49: Array[Int] = Array(3, 4)scala> rdd1.subtract(rdd2).collectres50: Array[Int] = Array(1, 2, 2)
mapPartitions
def mapPartitions[U](f: (Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicit arg0: ClassTag[U]): RDD[U]
该函数和map函数类似,只不过映射函数的参数由RDD中的每一个元素变成了RDD中每一个分区的迭代器。如果在映射的过程中需要频繁创建额外的对象,使用mapPartitions要比map高效的过。
比如,将RDD中的所有数据通过JDBC连接写入数据库,如果使用map函数,可能要为每一个元素都创建一个connection,这样开销很大,如果使用mapPartitions,那么只需要针对每一个分区建立一个connection。
参数preservesPartitioning表示是否保留父RDD的partitioner分区信息。
val rdd1 = sc.makeRDD(1 to 5,2)//rdd1有两个分区scala> var rdd3 = rdd1.mapPartitions{ x => { | var result = List[Int]() | var i = 0 | while(x.hasNext){ | i += x.next() | } | result.::(i).iterator | }}rdd3: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[84] at mapPartitions at :23//rdd3将rdd1中每个分区中的数值累加scala> rdd3.collectres65: Array[Int] = Array(3, 12)scala> rdd3.partitions.sizeres66: Int = 2
mapPartitionsWithIndex
def mapPartitionsWithIndex[U](f: (Int, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean = false)(implicitarg0: ClassTag[U]): RDD[U]
函数作用同mapPartitions,不过提供了两个参数,第一个参数为分区的索引。
ar rdd1 = sc.makeRDD(1 to 5,2)//rdd1有两个分区var rdd2 = rdd1.mapPartitionsWithIndex{ (x,iter) => { var result = List[String]() var i = 0 while(iter.hasNext){ i += iter.next() } result.::(x + "|" + i).iterator } }//rdd2将rdd1中每个分区的数字累加,并在每个分区的累加结果前面加了分区索引scala> rdd2.collectres13: Array[String] = Array(0|3, 1|12)
zip
def zip[U](other: RDD[U])(implicit arg0: ClassTag[U]): RDD[(T, U)]
zip函数用于将两个RDD组合成Key/Value形式的RDD,这里默认两个RDD的partition数量以及元素数量都相同,否则会抛出异常。
scala> var rdd1 = sc.makeRDD(1 to 5,2)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at makeRDD at :21scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at :21scala> rdd1.zip(rdd2).collectres0: Array[(Int, String)] = Array((1,A), (2,B), (3,C), (4,D), (5,E)) scala> rdd2.zip(rdd1).collectres1: Array[(String, Int)] = Array((A,1), (B,2), (C,3), (D,4), (E,5))scala> var rdd3 = sc.makeRDD(Seq("A","B","C","D","E"),3)rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[5] at makeRDD at :21scala> rdd1.zip(rdd3).collectjava.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions//如果两个RDD分区数不同,则抛出异常
zipPartitions
zipPartitions函数将多个RDD按照partition组合成为新的RDD,该函数需要组合的RDD具有相同的分区数,但对于每个分区内的元素数量没有要求。
该函数有好几种实现,可分为三类:
- 参数是一个RDD
def zipPartitions[B, V](rdd2: RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
def zipPartitions[B, V](rdd2: RDD[B], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[V]): RDD[V]
这两个区别就是参数preservesPartitioning,是否保留父RDD的partitioner分区信息
映射方法f参数为两个RDD的迭代器。
scala> var rdd1 = sc.makeRDD(1 to 5,2)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[22] at makeRDD at :21scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[23] at makeRDD at :21//rdd1两个分区中元素分布:scala> rdd1.mapPartitionsWithIndex{ | (x,iter) => { | var result = List[String]() | while(iter.hasNext){ | result ::= ("part_" + x + "|" + iter.next()) | } | result.iterator | | } | }.collectres17: Array[String] = Array(part_0|2, part_0|1, part_1|5, part_1|4, part_1|3)//rdd2两个分区中元素分布scala> rdd2.mapPartitionsWithIndex{ | (x,iter) => { | var result = List[String]() | while(iter.hasNext){ | result ::= ("part_" + x + "|" + iter.next()) | } | result.iterator | | } | }.collectres18: Array[String] = Array(part_0|B, part_0|A, part_1|E, part_1|D, part_1|C)//rdd1和rdd2做zipPartitionscala> rdd1.zipPartitions(rdd2){ | (rdd1Iter,rdd2Iter) => { | var result = List[String]() | while(rdd1Iter.hasNext && rdd2Iter.hasNext) { | result::=(rdd1Iter.next() + "_" + rdd2Iter.next()) | } | result.iterator | } | }.collectres19: Array[String] = Array(2_B, 1_A, 5_E, 4_D, 3_C)
- 参数是两个RDD
def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
def zipPartitions[B, C, V](rdd2: RDD[B], rdd3: RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[V]): RDD[V]
用法同上面,只不过该函数参数为两个RDD,映射方法f输入参数为两个RDD的迭代器。
scala> var rdd1 = sc.makeRDD(1 to 5,2)rdd1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[27] at makeRDD at :21scala> var rdd2 = sc.makeRDD(Seq("A","B","C","D","E"),2)rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at makeRDD at :21scala> var rdd3 = sc.makeRDD(Seq("a","b","c","d","e"),2)rdd3: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[29] at makeRDD at :21//rdd3中个分区元素分布scala> rdd3.mapPartitionsWithIndex{ | (x,iter) => { | var result = List[String]() | while(iter.hasNext){ | result ::= ("part_" + x + "|" + iter.next()) | } | result.iterator | | } | }.collectres21: Array[String] = Array(part_0|b, part_0|a, part_1|e, part_1|d, part_1|c)//三个RDD做zipPartitionsscala> var rdd4 = rdd1.zipPartitions(rdd2,rdd3){ | (rdd1Iter,rdd2Iter,rdd3Iter) => { | var result = List[String]() | while(rdd1Iter.hasNext && rdd2Iter.hasNext && rdd3Iter.hasNext) { | result::=(rdd1Iter.next() + "_" + rdd2Iter.next() + "_" + rdd3Iter.next()) | } | result.iterator | } | }rdd4: org.apache.spark.rdd.RDD[String] = ZippedPartitionsRDD3[33] at zipPartitions at :27scala> rdd4.collectres23: Array[String] = Array(2_B_b, 1_A_a, 5_E_e, 4_D_d, 3_C_c)
- 参数是三个RDD
def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
def zipPartitions[B, C, D, V](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V])(implicit arg0: ClassTag[B], arg1: ClassTag[C], arg2: ClassTag[D], arg3: ClassTag[V]): RDD[V]
用法同上面,只不过这里又多了个一个RDD而已。
zipWithIndex
def zipWithIndex(): RDD[(T, Long)]
该函数将RDD中的元素和这个元素在RDD中的ID(索引号)组合成键/值对。
scala> var rdd2 = sc.makeRDD(Seq("A","B","R","D","F"),2)rdd2: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[34] at makeRDD at :21scala> rdd2.zipWithIndex().collectres27: Array[(String, Long)] = Array((A,0), (B,1), (R,2), (D,3), (F,4))
zipWithUniqueId
def zipWithUniqueId(): RDD[(T, Long)]
该函数将RDD中元素和一个唯一ID组合成键/值对,该唯一ID生成算法如下:
每个分区中第一个元素的唯一ID值为:该分区索引号,
每个分区中第N个元素的唯一ID值为:(前一个元素的唯一ID值) + (该RDD总的分区数)
看下面的例子
scala> var rdd1 = sc.makeRDD(Seq("A","B","C","D","E","F"),2)rdd1: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[44] at makeRDD at :21//rdd1有两个分区,scala> rdd1.zipWithUniqueId().collectres32: Array[(String, Long)] = Array((A,0), (B,2), (C,4), (D,1), (E,3), (F,5))//总分区数为2//第一个分区第一个元素ID为0,第二个分区第一个元素ID为1//第一个分区第二个元素ID为0+2=2,第一个分区第三个元素ID为2+2=4//第二个分区第二个元素ID为1+2=3,第二个分区第三个元素ID为3+2=5
原文:http://lxw1234.com/archives/2015/07/363.htm
- spark 算子-转化操作
- Spark转化算子和操作算子
- Spark操作算子 转换算子
- Spark算子操作
- Spark算子:RDD创建操作
- Spark算子:RDD创建操作
- Spark 算子Java操作示例。
- spark的转换算子操作
- spark RDD操作算子详解(汇总)
- Spark中transformation算子的操作
- Spark 算子
- spark算子
- spark 算子
- Spark算子
- spark算子
- 常见的RDD转化和行动操作算子
- 【Spark】RDD操作详解2——值型Transformation算子
- 【Spark】RDD操作详解4——Action算子
- hylan:卸载oracle 11g时,关于计算机基础知识的思考:delete和shift+delete的区别
- flash AS3.0中有关于播放控制进度条控制声音的两种模式
- C++类的继承、虚函数
- 设计模式之外观模式
- Matlab连接Mysql
- spark 算子-转化操作
- unity 单击移动
- 古文观止卷七_陈情表_李密
- sqlserver 与VS2012注册码
- TCP三次握手与四次挥手的过程及原因
- scala语法(三)——trait
- Codeforces 816B & 816C & 816D Karen and ......(不正经专场)
- 数据结构笔记——绪论
- 【LeetCode】521 Longest Uncommon Subsequence I