【实践】Spark RDD API实战
来源:互联网 发布:怎么删除kingroot软件 编辑:程序博客网 时间:2024/05/18 06:32
- map
Applies a transformation function on each item of the RDD and returns the result as a new RDD.
//3表示指定为3个Partitionsvar a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)//以a各元素的长度建议新的RDDvar b = a.map(_.length)//将两个RDD组合新一个新的RDDvar c = a.zip(b)c.collectres0: Array[(String, Int)] = Array((dog,3), (salmon,6), (salmon,6), (rat,3), (elephant,8))*
- zip
Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-va lue pairs by the methods provided by the PairRDDFunctions extension.
var a1 = sc.parallelize(1 to 10, 3)var b1 = sc.parallelize(11 to 20, 3)a1.zip(b1).collectres1: Array[(Int, Int)] = Array((1,11), (2,12), (3,13), (4,14), \(5,15), (6,16), (7,17), (8,18), (9,19), (10,20))var a2 = sc.parallelize(1 to 10, 3)var b2 = sc.parallelize(11 to 20, 3)var c2 = sc.parallelize(21 to 30, 3)a2.zip(b2).zip(c2).collectres3: Array[((Int, Int), Int)] = Array(((1,11),21), ((2,12),22),((3,13),23), ((4,14),24), ((5,15),25), ((6,16),26), ((7,17),27),((8,18),28), ((9,19),29), ((10,20),30))a2.zip(b2).zip(c2).map((x) => (x._1._1, x._1._2, x._2 )).collectres2: Array[(Int, Int, Int)] = Array((1,11,21), (2,12,22), (3,13,23),(4,14,24), (5,15,25), (6,16,26), (7,17,27), (8,18,28), (9,19,29), (10,20,30))
- filter
Evaluates a boolean function for each data item of the RDD and puts the items fo r which the function returned true into the resulting RDD.Joins two RDDs by combining the i-th of either partition with each other. The resulting RDD will consist of two-component tuples which are interpreted as key-va lue pairs by the methods provided by the PairRDDFunctions extension.
val a = sc.parallelize(1 to 10, 3)val b = a.filter(_ % 2 == 0)b.collectres4: Array[Int] = Array(2, 4, 6, 8, 10)
- flatMap
Similar to map, but allows emitting more than one item in the map function. map是一个元素,变成另一个元素。flatMap是一个元素变成1个或多个元素。
var a = sc.parallelize(1 to 10, 5)a.flatMap(1 to _).collectres8: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4,5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8,1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)sc.parallelize(List(1, 2, 3), 2).flatMap(x => List(x, x, x)).collectres9: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)var x = sc.parallelize(1 to 5, 3)x.flatMap(List.fill(scala.util.Random.nextInt(5))(_)).collectres10: Array[Int] = Array(1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5)
- mapPartitions
This is a specialized map that is called only once for each partition. The entir e content of the respective partitions is available as a sequential stream of va lues via the input argument (Iterarator[T]). The custom function must return yet another Iterator[U]. The combined result iterators are automatically converted into a new RDD. Please note, that the tuples (3,4) and (6,7) are missing from th e following result due to the partitioning we chos 对每一个Partiion中的各个元素,以指定的函数进行处理,生成新的RDD。
val a = sc.parallelize(1 to 9, 3)def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = { var res = List[(T, T)]() var pre = iter.next while (iter.hasNext) { val cur = iter.next; res .::= (pre, cur) pre = cur; } res.iterator}a.mapPartitions(myfunc).collectres0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8))
- mapPartitionsWithIndex
Similar to mapPartitions, but takes two parameters. The first parameter is the i ndex of the partition and the second is an iterator through all the items within this partition. The output is an iterator containing the list of items after ap plying whatever transformation the function encodes.
val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)def myfunc(index: Int, iter: Iterator[Int]) : Iterator[String] = { iter.toList.map(x => index + "," + x).iterator}x.mapPartitionsWithIndex(myfunc).collect()res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10)
- sample
Randomly selects a fraction of the items of a RDD and returns them in a new RDD.
val a = sc.parallelize(1 to 10000, 3)a.sample(false, 0.1, 0).countres24: Long = 960a.sample(true, 0.3, 0).countres25: Long = 2888a.sample(true, 0.3, 13).countres26: Long = 2985
- union, ++
Performs the standard set operation: A union B. union,++是两个RDD中的元素,都直接作为新RDD的元素。zip是两个RDD中的元素组合成tupl e,tuple作为新RDD的元素。
val a = sc.parallelize(1 to 3, 1)val b = sc.parallelize(5 to 7, 1)(a ++ b).collectres0: Array[Int] = Array(1, 2, 3, 5, 6, 7)
- intersection
Returns the elements in the two RDDs which are the same.
val x = sc.parallelize(1 to 20)val y = sc.parallelize(10 to 30)val z = x.intersection(y)z.collectres74: Array[Int] = Array(16, 12, 20, 13, 17, 14, 18, 10, 19, 15, 11)
- distinct
Returns a new RDD that contains each unique value only once.
val c = sc.parallelize(List("Gnu", "Cat", "Rat", "Dog", "Gnu", "Rat"), 2)c.distinct.collectres6: Array[String] = Array(Dog, Gnu, Cat, Rat)val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))a.distinct(2).partitions.lengthres16: Int = 2a.distinct(3).partitions.lengthres17: Int = 3
- groupBy
val a = sc.parallelize(1 to 9, 3)a.groupBy(x => { if (x % 2 == 0) "even" else "odd" }).collectres42: Array[(String, Seq[Int])] = Array((even,ArrayBuffer(2, 4, 6, 8)), (odd,ArrayBuffer(1, 3, 5, 7, 9)))val a = sc.parallelize(1 to 9, 3)def myfunc(a: Int) : Int ={ a % 2}a.groupBy(myfunc).collectres3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))val a = sc.parallelize(1 to 9, 3)def myfunc(a: Int) : Int ={ a % 2}a.groupBy(x => myfunc(x), 3).collecta.groupBy(myfunc(_), 1).collectres7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2, 4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))
- keyBy
Constructs two-component tuples (key-value pairs) by applying a function on each data item. The result of the function becomes the key and the original data item becomes the value of the newly created tuples.
val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3)val b = a.keyBy(_.length)b.collectres26: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))
- groupByKey
Very similar to groupBy, but instead of supplying a function, the key-component of each pair will automatically be presented to the partitioner.
val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "spider", "eagle"), 2)val b = a.keyBy(_.length)b.groupByKey.collectres11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)), (6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger, eagle)))
- reduceByKey
This function provides the well-known reduce functionality in Spark. Please note that any function f you provide, should be commutative in order to generate reproducible results.
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)val b = a.map(x => (x.length, x))b.reduceByKey(_ + _).collectres86: Array[(Int, String)] = Array((3,dogcatowlgnuant))val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)val b = a.map(x => (x.length, x))b.reduceByKey(_ + _).collectres87: Array[(Int, String)] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))
- aggregate
val z = sc.parallelize(List(1,2,3,4,5,6), 2)z.aggregate(0)(math.max(_, _), _ + _)res40: Int = 9val z = sc.parallelize(List("a","b","c","d","e","f"),2)z.aggregate("")(_ + _, _+_)res115: String = abcdefz.aggregate("x")(_ + _, _+_)res116: String = xxdefxabcval z = sc.parallelize(List("12","23","345","4567"),2)z.aggregate("")((x,y) => math.max(x.length, y.length).toString, (x,y) => x + y)res141: String = 42z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)res142: String = 11val z = sc.parallelize(List("12","23","345",""),2)z.aggregate("")((x,y) => math.min(x.length, y.length).toString, (x,y) => x + y)res143: String = 10
- sortByKey
This function sorts the input RDD’s data and stores it in a new RDD. The output RDD is a shuffled RDD because it stores data that is output by a reducer which has been shuffled. The implementation of this function is actually very clever. First, it uses a range partitioner to partition the data in ranges within the shuffled RDD. Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)val b = sc.parallelize(1 to a.count.toInt, 2)val c = a.zip(b)c.sortByKey(true).collectres74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1), (gnu,4), (owl,3))c.sortByKey(false).collectres75: Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2), (ant,5))val a = sc.parallelize(1 to 100, 5)val b = a.cartesian(a)val c = sc.parallelize(b.takeSample(true, 5, 13), 2)val d = c.sortByKey(false)res56: Array[(Int, Int)] = Array((96,9), (84,76), (59,59), (53,65), (52,4))
- cogroup
cogroup对两个RDD数据集按key进行group by,并对每个RDD的value进行单独group by。
val a = sc.parallelize(List(1, 2, 1, 3), 1)val b = a.map((_, "b"))val c = a.map((_, "c"))b.cogroup(c).collectres7: Array[(Int, (Iterable[String], Iterable[String]))] = Array((2,(ArrayBuffer(b),ArrayBuffer(c))),(3,(ArrayBuffer(b),ArrayBuffer(c))),(1,(ArrayBuffer(b, b),ArrayBuffer(c, c))))val d = a.map((_, "d"))b.cogroup(c, d).collectres9: Array[(Int, (Iterable[String], Iterable[String], Iterable[String]))] = Array((2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d, d))))val x = sc.parallelize(List((1, "apple"), (2, "banana"), (3, "orange"), (4, "kiwi")), 2)val y = sc.parallelize(List((5, "computer"), (1, "laptop"), (1, "desktop"), (4, "iPad")), 2)x.cogroup(y).collectres23: Array[(Int, (Iterable[String], Iterable[String]))] = Array((4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),(2,(ArrayBuffer(banana),ArrayBuffer())),(3,(ArrayBuffer(orange),ArrayBuffer())),(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),(5,(ArrayBuffer(),ArrayBuffer(computer))))
- pipe
对每一个Partition的数据应用指定的shell命令,并输出到stdin:
val a = sc.parallelize(1 to 9, 3)a.pipe("head -n 1").collectres2: Array[String] = Array(1, 4, 7)
- coalesce,repartition
调整RDD的Partition个数,生成新的RDD。repartition固定会执行shuffle操作,coalesce可以指定是否shuffle。
val y = sc.parallelize(1 to 10, 10)val z = y.coalesce(2, false)z.partitions.lengthres9: Int = 2
- 【实践】Spark RDD API实战
- Spark RDD API解析及实战
- spark RDD API详解
- Spark-RDD API
- spark rdd api
- spark rdd api
- Spark RDD API详解
- spark-rdd-api
- Spark RDD API 详解
- Spark RDD API详解
- spark rdd操作API
- Spark RDD API
- Spark RDD API
- spark RDD api
- Spark源码核心与开发实战---Spark RDD与Spark API编程实例
- Spark 之RDD API大全
- Spark RDD---api(map&reduce)
- Spark RDD API 基本操作
- js中三种作用域详解(全局,函数,块级)
- [bigdata-110] spring-cloud-06 Hystrix断路器
- 网络编程
- linux的netlink接口详解(中)
- JavaScript的作用域和块级作用域概念理解
- 【实践】Spark RDD API实战
- 快捷安装ZendStudio13.5简体中文语言包
- 第2章 基本语法
- 迷宫问题
- 深入理解JS中的变量作用域
- servlet中添加log4j
- BZOJ 2440: [中山市选2011]完全平方数 莫比乌斯 容斥原理 二分
- ThinkPHP5-简单的批量查询
- Javascript继承机制的设计思想