RDD学习笔记

来源:互联网 发布:奇葩男士网络剧 编辑:程序博客网 时间:2024/06/10 06:33

1. 驱动程序(driver program----> 运行main行数

共享变量:有的时候在不同节点上,需要同时运行一系列的任务,将每一个函数中用到的变量进行共享

              1.广播变量:缓存到各个节点的内存中,而不是task

              2.累加器:只能用于加法的变量

Master URLs:

local:本地

local[K]:K个线程进行并行运算

aggregate:

     函数有三个入参,一是初始值ZeroValue,二是seqOp,三为combOp.

    seqOp seqOp会被并行执行,具体由各个executor上的task来完成计算

    combOpcombOp则是串行执行,其中combOp操作在JobWaitertaskSucceeded函数中被调用

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

cartesian:

生成笛卡尔积:

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

coalesce:

重新分区:

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup:

一个据听说很强大的功能,最多允许三个value,键值自己会共享,不能够太多组合

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def groupWith[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

b.cogroup(c).collect

res7: Array[(Int, (Seq[String], Seq[String]))] =Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

val d = a.map((_, "d"))

b.cogroup(c, d).collect

res9: Array[(Int, (Seq[String], Seq[String],Seq[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d,d)))

)

val x = sc.parallelize(List((1, "apple"),(2, "banana"), (3, "orange"), (4, "kiwi")), 2)

val y = sc.parallelize(List((5, "computer"),(1, "laptop"), (1, "desktop"), (4, "iPad")), 2)

x.cogroup(y).collect

res23: Array[(Int, (Seq[String], Seq[String]))] =Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

Collect,toArray:

转换RDD成为scala的数组

def collect(): Array[T]

def collect[U: ClassTag](f: PartialFunction[T, U]):RDD[U]

def toArray(): Array[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.collect

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu,Rat)

collectAsMap:

类似于collect,但是key-values转换成scala的时候,保存了映射结构

def collectAsMap(): Map[K, V]

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.zip(a)

b.collectAsMap

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1-> 1, 3 -> 3)

combineByKey:

自动把相同key的整理成一个Array,最后所有的Array就由各自不同keyarray组成的.

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner:Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

val a =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

val c = b.zip(a)

val d = c.combineByKey(List(_), (x:List[String],y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)

d.collect

res16: Array[(Int, List[String])] = Array((1,List(cat,dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

:::::都是Array中的组合方式

compute:

执行以来关系,计算RDD的实际表达.不由用户直接调用

def compute(split: Partition, context: TaskContext):Iterator[T]

context, sparkContext:

返回创建使用的RDD.

def compute(split: Partition, context: TaskContext):Iterator[T]

count:

返回RDD元组中items的数量

def count(): Long

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.count

res2: Long = 4

countApprox:

这个不知道.

def (timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

countByKey [Pair]:

类似count,这个是计算key的数量的,并且返回map

def countByKey(): Map[K, Long]

val c = sc.parallelize(List((3, "Gnu"), (3,"Yak"), (5, "Mouse"), (3, "Dog")), 2)

c.countByKey

res3: scala.collection.Map[Int,Long] = Map(3 -> 3,5 -> 1)

countByValue:

计算valuecount,然后返回count->valuemap

def countByValue(): Map[T, Long]

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b.countByValue

res27: scala.collection.Map[Int,Long] = Map(5 -> 1,8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox:

功能尚不知.

def countByValueApprox(timeout: Long, confidence:Double = 0.95): PartialResult[Map[T, BoundedDouble]]

countApproxDistinct:

近似计数,当数据量大的时候很有用

def countApproxDistinct(relativeSD: Double = 0.05): Long

val a = sc.parallelize(1 to 10000, 20)

val b = a++a++a++a++a

b.countApproxDistinct(0.1)

val a = sc.parallelize(1 to 30000, 30)

val b = a++a++a++a++a

b.countApproxDistinct(0.05)

res28: Long = 30097

可以看出,会计算出a的大概值范围

countApproxDistinctByKey [Pair]:

类似countApproxDistinct,但计算不同值的不同key的数量,所以RDD必须是key-value的形式,执行计算的速度快.

def countApproxDistinctByKey(relativeSD: Double =0.05): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,numPartitions: Int): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,partitioner: Partitioner): RDD[(K, Long)]

val a = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

val b = sc.parallelize(a.takeSample(true, 10000, 0),20)

val c = sc.parallelize(1 to b.count().toInt, 20)

val d = b.zip(c)

d.countApproxDistinctByKey(0.1).collect

res15: Array[(String, Long)] = Array((Rat,2567),(Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect

res16: Array[(String, Long)] = Array((Rat,2555),(Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect

res0: Array[(String, Long)] = Array((Rat,2562),(Cat,2464), (Dog,2451), (Gnu,2521))

可以看出可以计算出value的大致范围

dependencies:

返回当前RDD所依赖的RDD

final def dependencies: Seq[Dependency[_]]

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[32] at parallelize at <console>:12

b.dependencies.length

Int = 0

b.map(a => a).dependencies.length

res40: Int = 1

b.cartesian(a).dependencies.length

res41: Int = 2

b.cartesian(a).dependencies

res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct:

返回一个新的RDD,这个RDD包含的是唯一值

def distinct(): RDD[T]

def distinct(numPartitions: Int): RDD[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.distinct.collect

res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))

a.distinct(2).partitions.length

res16: Int = 2

a.distinct(3).partitions.length

res17: Int = 3

中间的数字是,开启的Partitions个数

first:

返回RDD中的第一个数据

def first(): T

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.first

res1: String = Gnu

filter:

非常常用的功能,内部使用返回布尔值的方法,RDD中的每个data使用该方法,返回result RDD

def filter(f: T => Boolean): RDD[T]

val a = sc.parallelize(1 to 10, 3)

a.filter(_ % 2 == 0)

b.collect

res3: Array[Int] = Array(2, 4, 6, 8, 10)

注意:他必须能够处理RDD中所有的数据项.scala提供了一些方法来处理混合数据类型.如果右一些数据是损坏的,你不想处理,但是对于其他没有损坏的数据你想使用类似map()的方法

Examples for mixed data without partial functions:

val b = sc.parallelize(1 to 8)

b.filter(_ < 4).collect

res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.filter(_ < 4).collect

<console>:15: error: value < is not a memberof Any

失败原因:

操作符不支持有的字符

对混合类型的处理:

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.collect({case a: Int    => "is integer" |

           caseb: String => "is string" }).collect

res17: Array[String] = Array(is string, is string, isinteger, is string)

val myfunc: PartialFunction[Any, Any] = {

  case a:Int    => "is integer" |

  case b: String=> "is string" }

myfunc.isDefinedAt("")    判断myfunc是否支持

res21: Boolean = true

myfunc.isDefinedAt(1)

res22: Boolean = true

myfunc.isDefinedAt(1.5)     不支持

res23: Boolean = false

   

   

   

Our research group has a very strong focus on usingand improving Apache Spark to solve real world programs. In order to do this weneed to have a very solid understanding of the capabilities of Spark. So one ofthe first things we have done is to go through the entire Spark RDD API andwrite examples to test their functionality. This has been a very usefulexercise and we would like to share the examples with everyone.

Authors of examples: Matthias Langer and Zhen He

Emails addresses: m.langer@latrobe.edu.au,z.he@latrobe.edu.au

These examples have only been tested for Spark version0.9. We assume the functionality of Spark is stable and therefore the examplesshould be valid for later releases.

Here is a pdf of the all the examples: SparkExamples

The RDD API By Example

RDD is short for Resilient Distributed Dataset. RDDsare the workhorse of the Spark system. As a user, one can consider a RDD as ahandle for a collection of individual data partitions, which are the result ofsome computation.

However, an RDD is actually more than that. On clusterinstallations, separate data partitions can be on separate nodes. Using the RDDas a handle one can access all partitions and perform computations andtransformations using the contained data. Whenever a part of a RDD or an entireRDD is lost, the system is able to reconstruct the data of lost partitions byusing lineage information. Lineage refers to the sequence of transformationsused to produce the current RDD. As a result, Spark is able to recoverautomatically from most failures.

All RDDs available in Spark derive either directly orindirectly from the class RDD. This class comes with a large set of methodsthat perform operations on the data within the associated partitions. The classRDD is abstract. Whenever, one uses a RDD, one is actually using a concertizedimplementation of RDD. These implementations have to overwrite some corefunctions to make the RDD behave as expected.

One reason why Spark has lately become a very popularsystem for processing big data is that it does not impose restrictionsregarding what data can be stored within RDD partitions. The RDD API alreadycontains many useful operations. But, because the creators of Spark had to keepthe core API of RDDs common enough to handle arbitrary data-types, manyconvenience functions are missing.

The basic RDD API considers each data item as a singlevalue. However, users often want to work with key-value pairs. Therefore Sparkextended the interface of RDD to provide additional functions(PairRDDFunctions), which explicitly work on key-value pairs. Currently, thereare four extensions to the RDD API available in spark. They are as follows:

DoubleRDDFunctions

This extension contains many useful methods foraggregating numeric values. They become available if the data items of an RDDare implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined in this interface extension becomeavailable when the data items have a two component tuple structure. Spark willinterpret the first tuple item (i.e. tuplename. 1) as the key and the seconditem (i.e. tuplename. 2) as the associated value.

OrderedRDDFunctions

Methods defined in this interface extension becomeavailable if the data items are two-component tuples where the key isimplicitly sortable.

SequenceFileRDDFunctions

This extension contains several methods that allowusers to create Hadoop sequence- les from RDDs. The data items must be twocompo- nent key-value tuples as required by the PairRDDFunctions. However,there are additional requirements considering the convertibility of the tuplecomponents to Writable types.

Since Spark will make methods with extendedfunctionality automatically available to users when the data items fulfill theabove described requirements, we decided to list all possible availablefunctions in strictly alphabetical order. We will append either of thefollowingto the function-name to indicate it belongs to an extension thatrequires the data items to conform to a certain format or type.

[Double] - Double RDD Functions

[Ordered] - OrderedRDDFunctions

[Pair] - PairRDDFunctions

[SeqFile] - SequenceFileRDDFunctions

aggregate

The aggregate-method provides an interface forperforming highly customized reductions and aggregations with a RDD. However,due to the way Scala and Spark execute and process data, care must be taken toachieve deterministic behavior. The following list contains a few observationswe made while experimenting with aggregate:

    The reduceand combine functions have to be commutative and associative.

    As can beseen from the function definition below, the output of the combiner must beequal to its input. This is necessary because Spark will chain-execute it.

    The zerovalue is the initial value of the U component when either seqOp or combOp areexecuted for the first element of their domain of influence. Depending on whatyou want to achieve, you may have to change it. However, to make your codedeterministic, make sure that your code will yield the same result regardlessof the number or size of partitions.

    Do notassume any execution order for either partition computations or combiningpartitions.

    The neutralzeroValue is applied at the beginning of each sequence of reduces within theindividual partitions and again when the output of separate partitions iscombined.

    Why have twoseparate combine functions? The first functions maps the input values into theresult space. Note that the aggregation data type (1st input and output) can bedifferent (U != T). The second function reduces these mapped values in theresult space.

    Why wouldone want to use two input data types? Let us assume we do an archaeologicalsite survey using a metal detector. While walking through the site we take GPScoordinates of important findings based on the output of the metal detector.Later, we intend to draw an image of a map that highlights these locationsusing the aggregate function. In this case the zeroValue could be an area mapwith no highlights. The possibly huge set of input data is stored as GPScoordinates across many partitions. seqOp could convert the GPS coordinates tomap coordinates and put a marker on the map at the respective position. combOpwill receive these highlights as partial maps and combine them into a singlefinal output map.

Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T)=> U, combOp: (U, U) => U): U

Examples 1

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

z.aggregate("x")(_ + _, _+_)

res116: String = xxdefxabc

val z =sc.parallelize(List("12","23","345","4567"),2)

z.aggregate("")((x,y) =>math.max(x.length, y.length).toString, (x,y) => x + y)

res141: String = 42

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res142: String = 11

val z =sc.parallelize(List("12","23","345",""),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res143: String = 10

The main issue with the code above is that the resultof the inner min is a string of length 1.

The zero in the output is due to the empty string beingthe last string in the list. We see this result because we are not recursivelyreducing any further within the partition for the final string.

Examples 2

val z =sc.parallelize(List("12","23","","345"),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res144: String = 11

In contrast to the previous example, this example hasthe empty string at the beginning of the second partition. This results inlength of zero being input to the second reduce which then upgrades it a lengthof 1. (Warning: The above example shows bad design since the output isdependent on the order of the data inside the partitions.)

cartesian

Computes the cartesian product between two RDDs (i.e.Each item of the first RDD is joined with each item of the second RDD) andreturns them as a new RDD. (Warning: Be careful when using this function.!Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val x = sc.parallelize(List(1,2,3,4,5))

val y = sc.parallelize(List(6,7,8,9,10))

x.cartesian(y).collect

res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8),(1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9),(3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computednext. Checkpointed RDDs are stored as a binary file within the checkpointdirectory which can be specified using the Spark context. (Warning: Spark applieslazy evaluation. Checkpointing will not occur until an action is invoked.)

Important note: the directory  "my_directory_name" should exist inall slaves. As an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")

val a = sc.parallelize(1 to 4)

a.checkpoint

a.count

14/02/25 18:13:53 INFO SparkContext: Starting job:count at <console>:15

...

14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5stored as values to memory (estimated size 115.7 KB, free 296.3 MB)

14/02/25 18:13:53 INFO RDDCheckpointData: Donecheckpointing RDD 11 tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,new parent is RDD 12

res23: Long = 4

coalesce, repartition

Coalesces the associated data into a given number ofpartitions. repartition(numPartitions) is simply an abbreviation forcoalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup [Pair], groupWith [Pair]

A very powerful set of functions that allow groupingup to 3 key-value RDDs together using their keys.

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def groupWith[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

b.cogroup(c).collect

res7: Array[(Int, (Seq[String], Seq[String]))] =Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

val d = a.map((_, "d"))

b.cogroup(c, d).collect

res9: Array[(Int, (Seq[String], Seq[String],Seq[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d,d)))

)

val x = sc.parallelize(List((1, "apple"),(2, "banana"), (3, "orange"), (4, "kiwi")), 2)

val y = sc.parallelize(List((5, "computer"),(1, "laptop"), (1, "desktop"), (4, "iPad")), 2)

x.cogroup(y).collect

res23: Array[(Int, (Seq[String], Seq[String]))] =Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

collect, toArray

Converts the RDD into a Scala array and returns it. Ifyou provide a standard map-function (i.e. f = T -> U) it will be appliedbefore inserting the values into the result array.

Listing Variants

def collect(): Array[T]

def collect[U: ClassTag](f: PartialFunction[T, U]):RDD[U]

def toArray(): Array[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.collect

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu,Rat)

collectAsMap [Pair]

Similar to collect, but works on key-value RDDs andconverts them into Scala maps to preserve their key-value structure.

Listing Variants

def collectAsMap(): Map[K, V]

Example

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.zip(a)

b.collectAsMap

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1-> 1, 3 -> 3)

combineByKey[Pair]

Very efficient implementation that combines the valuesof a RDD consisting of two-component tuples by applying multiple aggregatorsone after another.

Listing Variants

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner:Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

Example

val a =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

val c = b.zip(a)

val d = c.combineByKey(List(_), (x:List[String],y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)

d.collect

res16: Array[(Int, List[String])] = Array((1,List(cat,dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

compute

Executes dependencies and computes the actualrepresentation of the RDD. This function should not be called directly byusers.

Listing Variants

def compute(split: Partition, context: TaskContext):Iterator[T]

context, sparkContext

Returns the SparkContext that was used to create theRDD.

Listing Variants

def compute(split: Partition, context: TaskContext):Iterator[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.context

res8: org.apache.spark.SparkContext =org.apache.spark.SparkContext@58c1c2f1

count

Returns the number of items stored within a RDD.

Listing Variants

def count(): Long

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.count

res2: Long = 4

countApprox

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def (timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

countByKey [Pair]

Very similar to count, but counts the values of a RDDconsisting of two-component tuples for each distinct key separately.

Listing Variants

def countByKey(): Map[K, Long]

Example

val c = sc.parallelize(List((3, "Gnu"), (3,"Yak"), (5, "Mouse"), (3, "Dog")), 2)

c.countByKey

res3: scala.collection.Map[Int,Long] = Map(3 -> 3,5 -> 1)

countByKeyApprox [Pair]

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def countByKeyApprox(timeout: Long, confidence: Double= 0.95): PartialResult[Map[K, BoundedDouble]]

countByValue

Returns a map that contains all unique values of theRDD and their respective occurrence counts. (Warning: This operation willfinally aggregate the information in a single reducer.)

Listing Variants

def countByValue(): Map[T, Long]

Example

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b.countByValue

res27: scala.collection.Map[Int,Long] = Map(5 -> 1,8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def countByValueApprox(timeout: Long, confidence:Double = 0.95): PartialResult[Map[T, BoundedDouble]]

countApproxDistinct

Computes the approximate number of distinct values. Forlarge RDDs which are spread across many nodes, this function may execute fasterthan other counting methods. The parameter relativeSD controls the accuracy ofthe computation.

Listing Variants

def countApproxDistinct(relativeSD: Double = 0.05):Long

Example

val a = sc.parallelize(1 to 10000, 20)

val b = a++a++a++a++a

b.countApproxDistinct(0.1)

res14: Long = 10784

b.countApproxDistinct(0.05)

res15: Long = 11055

b.countApproxDistinct(0.01)

res16: Long = 10040

b.countApproxDistinct(0.001)

res0: Long = 10001

countApproxDistinctByKey [Pair]

 

Similar to countApproxDistinct, but computes theapproximate number of distinct values for each distinct key. Hence, the RDDmust consist of two-component tuples. For large RDDs which are spread acrossmany nodes, this function may execute faster than other counting methods. Theparameter relativeSD controls the accuracy of the computation.

Listing Variants

def countApproxDistinctByKey(relativeSD: Double =0.05): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,numPartitions: Int): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,partitioner: Partitioner): RDD[(K, Long)]

Example

val a = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

val b = sc.parallelize(a.takeSample(true, 10000, 0),20)

val c = sc.parallelize(1 to b.count().toInt, 20)

val d = b.zip(c)

d.countApproxDistinctByKey(0.1).collect

res15: Array[(String, Long)] = Array((Rat,2567),(Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect

res16: Array[(String, Long)] = Array((Rat,2555),(Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect

res0: Array[(String, Long)] = Array((Rat,2562),(Cat,2464), (Dog,2451), (Gnu,2521))

dependencies

 

Returns the RDD on which this RDD depends.

Listing Variants

final def dependencies: Seq[Dependency[_]]

Example

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[32] at parallelize at <console>:12

b.dependencies.length

Int = 0

b.map(a => a).dependencies.length

res40: Int = 1

b.cartesian(a).dependencies.length

res41: Int = 2

b.cartesian(a).dependencies

res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa, org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct

 

Returns a new RDD that contains each unique value onlyonce.

Listing Variants

def distinct(): RDD[T]

def distinct(numPartitions: Int): RDD[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.distinct.collect

res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))

a.distinct(2).partitions.length

res16: Int = 2

a.distinct(3).partitions.length

res17: Int = 3

first

 

Looks for the very first data item of the RDD andreturns it.

Listing Variants

def first(): T

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.first

res1: String = Gnu

filter

 

Evaluates a boolean function for each data item of theRDD and puts the items for which the function returned true into the resultingRDD.

Listing Variants

def filter(f: T => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 10, 3)

a.filter(_ % 2 == 0)

b.collect

res3: Array[Int] = Array(2, 4, 6, 8, 10)

When you provide a filter function, it must be able tohandle all data items contained in the RDD. Scala provides so-called partialfunctions to deal with mixed data-types. (Tip: Partial functions are veryuseful if you have some data which may be bad and you do not want to handle butfor the good data (matching data) you want to apply some kind of map function.The following article is good. It teaches you about partial functions in a verynice way and explains why case has to be used for partial functions:  article)

Examples for mixed data without partial functions

val b = sc.parallelize(1 to 8)

b.filter(_ < 4).collect

res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.filter(_ < 4).collect

<console>:15: error: value < is not a memberof Any

This fails because some components of a are notimplicitly comparable against integers. Collect uses the isDefinedAt propertyof a function-object to determine whether the test-function is compatible with eachdata item. Only data items that pass this test (=filter) are then mapped usingthe function-object.

Examples for mixed data with partial functions

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.collect({case a: Int    => "is integer" |

           caseb: String => "is string" }).collect

res17: Array[String] = Array(is string, is string, isinteger, is string)

val myfunc: PartialFunction[Any, Any] = {

  case a:Int    => "is integer" |

  case b: String=> "is string" }

myfunc.isDefinedAt("")

res21: Boolean = true

myfunc.isDefinedAt(1)

res22: Boolean = true

myfunc.isDefinedAt(1.5)

res23: Boolean = false

Be careful! The above code works because it onlychecks the type itself! If you use operations on this type, you have to explicitlydeclare what type you want instead of any. Otherwise the compiler does(apparently) not know what bytecode it should produce:

val myfunc2: PartialFunction[Any, Any] = {case x if (x< 4) => "x"}

<console>:10: error: value < is not a memberof Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x< 4) => "x"}

myfunc2: PartialFunction[Int,Any] = <function1>

filterWith:

filterwith的扩展版本,第一个参数是Int->T的形式,其中  Int代表的是partition的索引,T代表你要转换成的类型,第二个参数是(U,T)->boolean的形式,Tpartition的索引,U是值

def filterWith[A: ClassTag](constructA: Int =>A)(p: (T, A) => Boolean): RDD[T]

val a = sc.parallelize(1 to 9, 3)

val b = a.filterWith(i => i)((x,i) => x % 2 == 0|| i % 2 == 0)

b.collect

res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)

a.filterWith(x=> x)((a, b) =>  b == 0).collect

res30: Array[Int] = Array(1, 2)

a.filterWith(x=> x)((a, b) =>  a % (b+1) == 0).collect

res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)

a.filterWith(x=> x.toString)((a, b) =>  b == "2").collect

res34: Array[Int] = Array(5, 6)

注意:第一个参数是对partion索引做修改,第二个参数,是对索引和其相对的data进行过滤

   

   

   

Our research group has a very strong focus on usingand improving Apache Spark to solve real world programs. In order to do this weneed to have a very solid understanding of the capabilities of Spark. So one ofthe first things we have done is to go through the entire Spark RDD API andwrite examples to test their functionality. This has been a very usefulexercise and we would like to share the examples with everyone.

Authors of examples: Matthias Langer and Zhen He

Emails addresses: m.langer@latrobe.edu.au,z.he@latrobe.edu.au

These examples have only been tested for Spark version0.9. We assume the functionality of Spark is stable and therefore the examplesshould be valid for later releases.

Here is a pdf of the all the examples: SparkExamples

The RDD API By Example

RDD is short for Resilient Distributed Dataset. RDDsare the workhorse of the Spark system. As a user, one can consider a RDD as ahandle for a collection of individual data partitions, which are the result ofsome computation.

However, an RDD is actually more than that. On clusterinstallations, separate data partitions can be on separate nodes. Using the RDDas a handle one can access all partitions and perform computations andtransformations using the contained data. Whenever a part of a RDD or an entireRDD is lost, the system is able to reconstruct the data of lost partitions byusing lineage information. Lineage refers to the sequence of transformationsused to produce the current RDD. As a result, Spark is able to recoverautomatically from most failures.

All RDDs available in Spark derive either directly orindirectly from the class RDD. This class comes with a large set of methods thatperform operations on the data within the associated partitions. The class RDDis abstract. Whenever, one uses a RDD, one is actually using a concertizedimplementation of RDD. These implementations have to overwrite some corefunctions to make the RDD behave as expected.

One reason why Spark has lately become a very popularsystem for processing big data is that it does not impose restrictionsregarding what data can be stored within RDD partitions. The RDD API alreadycontains many useful operations. But, because the creators of Spark had to keepthe core API of RDDs common enough to handle arbitrary data-types, manyconvenience functions are missing.

The basic RDD API considers each data item as a singlevalue. However, users often want to work with key-value pairs. Therefore Sparkextended the interface of RDD to provide additional functions(PairRDDFunctions), which explicitly work on key-value pairs. Currently, thereare four extensions to the RDD API available in spark. They are as follows:

DoubleRDDFunctions

This extension contains many useful methods foraggregating numeric values. They become available if the data items of an RDDare implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined in this interface extension becomeavailable when the data items have a two component tuple structure. Spark willinterpret the first tuple item (i.e. tuplename. 1) as the key and the seconditem (i.e. tuplename. 2) as the associated value.

OrderedRDDFunctions

Methods defined in this interface extension becomeavailable if the data items are two-component tuples where the key isimplicitly sortable.

SequenceFileRDDFunctions

This extension contains several methods that allowusers to create Hadoop sequence- les from RDDs. The data items must be twocompo- nent key-value tuples as required by the PairRDDFunctions. However,there are additional requirements considering the convertibility of the tuplecomponents to Writable types.

Since Spark will make methods with extended functionalityautomatically available to users when the data items fulfill the abovedescribed requirements, we decided to list all possible available functions instrictly alphabetical order. We will append either of the followingto thefunction-name to indicate it belongs to an extension that requires the dataitems to conform to a certain format or type.

[Double] - Double RDD Functions

[Ordered] - OrderedRDDFunctions

[Pair] - PairRDDFunctions

[SeqFile] - SequenceFileRDDFunctions

aggregate

The aggregate-method provides an interface forperforming highly customized reductions and aggregations with a RDD. However,due to the way Scala and Spark execute and process data, care must be taken toachieve deterministic behavior. The following list contains a few observationswe made while experimenting with aggregate:

    The reduceand combine functions have to be commutative and associative.

    As can beseen from the function definition below, the output of the combiner must beequal to its input. This is necessary because Spark will chain-execute it.

    The zerovalue is the initial value of the U component when either seqOp or combOp areexecuted for the first element of their domain of influence. Depending on whatyou want to achieve, you may have to change it. However, to make your codedeterministic, make sure that your code will yield the same result regardlessof the number or size of partitions.

    Do notassume any execution order for either partition computations or combiningpartitions.

    The neutralzeroValue is applied at the beginning of each sequence of reduces within theindividual partitions and again when the output of separate partitions iscombined.

    Why have twoseparate combine functions? The first functions maps the input values into theresult space. Note that the aggregation data type (1st input and output) can bedifferent (U != T). The second function reduces these mapped values in theresult space.

    Why wouldone want to use two input data types? Let us assume we do an archaeologicalsite survey using a metal detector. While walking through the site we take GPScoordinates of important findings based on the output of the metal detector.Later, we intend to draw an image of a map that highlights these locationsusing the aggregate function. In this case the zeroValue could be an area mapwith no highlights. The possibly huge set of input data is stored as GPScoordinates across many partitions. seqOp could convert the GPS coordinates tomap coordinates and put a marker on the map at the respective position. combOpwill receive these highlights as partial maps and combine them into a singlefinal output map.

Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T)=> U, combOp: (U, U) => U): U

Examples 1

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

z.aggregate("x")(_ + _, _+_)

res116: String = xxdefxabc

val z =sc.parallelize(List("12","23","345","4567"),2)

z.aggregate("")((x,y) =>math.max(x.length, y.length).toString, (x,y) => x + y)

res141: String = 42

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res142: String = 11

val z =sc.parallelize(List("12","23","345",""),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res143: String = 10

The main issue with the code above is that the resultof the inner min is a string of length 1.

The zero in the output is due to the empty stringbeing the last string in the list. We see this result because we are notrecursively reducing any further within the partition for the final string.

Examples 2

val z =sc.parallelize(List("12","23","","345"),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res144: String = 11

In contrast to the previous example, this example hasthe empty string at the beginning of the second partition. This results inlength of zero being input to the second reduce which then upgrades it a lengthof 1. (Warning: The above example shows bad design since the output isdependent on the order of the data inside the partitions.)

cartesian

Computes the cartesian product between two RDDs (i.e.Each item of the first RDD is joined with each item of the second RDD) andreturns them as a new RDD. (Warning: Be careful when using this function.!Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val x = sc.parallelize(List(1,2,3,4,5))

val y = sc.parallelize(List(6,7,8,9,10))

x.cartesian(y).collect

res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8),(1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9),(3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computednext. Checkpointed RDDs are stored as a binary file within the checkpointdirectory which can be specified using the Spark context. (Warning: Sparkapplies lazy evaluation. Checkpointing will not occur until an action isinvoked.)

Important note: the directory  "my_directory_name" should exist inall slaves. As an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")

val a = sc.parallelize(1 to 4)

a.checkpoint

a.count

14/02/25 18:13:53 INFO SparkContext: Starting job:count at <console>:15

...

14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5stored as values to memory (estimated size 115.7 KB, free 296.3 MB)

14/02/25 18:13:53 INFO RDDCheckpointData: Donecheckpointing RDD 11 tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,new parent is RDD 12

res23: Long = 4

coalesce, repartition

Coalesces the associated data into a given number ofpartitions. repartition(numPartitions) is simply an abbreviation forcoalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup [Pair], groupWith [Pair]

A very powerful set of functions that allow groupingup to 3 key-value RDDs together using their keys.

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def groupWith[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

b.cogroup(c).collect

res7: Array[(Int, (Seq[String], Seq[String]))] =Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

val d = a.map((_, "d"))

b.cogroup(c, d).collect

res9: Array[(Int, (Seq[String], Seq[String],Seq[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d,d)))

)

val x = sc.parallelize(List((1, "apple"),(2, "banana"), (3, "orange"), (4, "kiwi")), 2)

val y = sc.parallelize(List((5, "computer"),(1, "laptop"), (1, "desktop"), (4, "iPad")), 2)

x.cogroup(y).collect

res23: Array[(Int, (Seq[String], Seq[String]))] =Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

collect, toArray

Converts the RDD into a Scala array and returns it. Ifyou provide a standard map-function (i.e. f = T -> U) it will be appliedbefore inserting the values into the result array.

Listing Variants

def collect(): Array[T]

def collect[U: ClassTag](f: PartialFunction[T, U]):RDD[U]

def toArray(): Array[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.collect

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu,Rat)

collectAsMap [Pair]

Similar to collect, but works on key-value RDDs andconverts them into Scala maps to preserve their key-value structure.

Listing Variants

def collectAsMap(): Map[K, V]

Example

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.zip(a)

b.collectAsMap

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1-> 1, 3 -> 3)

combineByKey[Pair]

Very efficient implementation that combines the valuesof a RDD consisting of two-component tuples by applying multiple aggregatorsone after another.

Listing Variants

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner:Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

Example

val a =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

val c = b.zip(a)

val d = c.combineByKey(List(_), (x:List[String],y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)

d.collect

res16: Array[(Int, List[String])] = Array((1,List(cat,dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

compute

Executes dependencies and computes the actualrepresentation of the RDD. This function should not be called directly byusers.

Listing Variants

def compute(split: Partition, context: TaskContext):Iterator[T]

context, sparkContext

Returns the SparkContext that was used to create theRDD.

Listing Variants

def compute(split: Partition, context: TaskContext):Iterator[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.context

res8: org.apache.spark.SparkContext =org.apache.spark.SparkContext@58c1c2f1

count

Returns the number of items stored within a RDD.

Listing Variants

def count(): Long

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.count

res2: Long = 4

countApprox

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def (timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

countByKey [Pair]

Very similar to count, but counts the values of a RDDconsisting of two-component tuples for each distinct key separately.

Listing Variants

def countByKey(): Map[K, Long]

Example

val c = sc.parallelize(List((3, "Gnu"), (3,"Yak"), (5, "Mouse"), (3, "Dog")), 2)

c.countByKey

res3: scala.collection.Map[Int,Long] = Map(3 -> 3,5 -> 1)

countByKeyApprox [Pair]

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def countByKeyApprox(timeout: Long, confidence: Double= 0.95): PartialResult[Map[K, BoundedDouble]]

countByValue

Returns a map that contains all unique values of theRDD and their respective occurrence counts. (Warning: This operation willfinally aggregate the information in a single reducer.)

Listing Variants

def countByValue(): Map[T, Long]

Example

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b.countByValue

res27: scala.collection.Map[Int,Long] = Map(5 -> 1,8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def countByValueApprox(timeout: Long, confidence:Double = 0.95): PartialResult[Map[T, BoundedDouble]]

countApproxDistinct

Computes the approximate number of distinct values.For large RDDs which are spread across many nodes, this function may executefaster than other counting methods. The parameter relativeSD controls theaccuracy of the computation.

Listing Variants

def countApproxDistinct(relativeSD: Double = 0.05):Long

Example

val a = sc.parallelize(1 to 10000, 20)

val b = a++a++a++a++a

b.countApproxDistinct(0.1)

res14: Long = 10784

b.countApproxDistinct(0.05)

res15: Long = 11055

b.countApproxDistinct(0.01)

res16: Long = 10040

b.countApproxDistinct(0.001)

res0: Long = 10001

countApproxDistinctByKey [Pair]

 

Similar to countApproxDistinct, but computes theapproximate number of distinct values for each distinct key. Hence, the RDDmust consist of two-component tuples. For large RDDs which are spread acrossmany nodes, this function may execute faster than other counting methods. Theparameter relativeSD controls the accuracy of the computation.

Listing Variants

def countApproxDistinctByKey(relativeSD: Double = 0.05):RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,numPartitions: Int): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,partitioner: Partitioner): RDD[(K, Long)]

Example

val a = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

val b = sc.parallelize(a.takeSample(true, 10000, 0),20)

val c = sc.parallelize(1 to b.count().toInt, 20)

val d = b.zip(c)

d.countApproxDistinctByKey(0.1).collect

res15: Array[(String, Long)] = Array((Rat,2567),(Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect

res16: Array[(String, Long)] = Array((Rat,2555),(Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect

res0: Array[(String, Long)] = Array((Rat,2562),(Cat,2464), (Dog,2451), (Gnu,2521))

dependencies

 

Returns the RDD on which this RDD depends.

Listing Variants

final def dependencies: Seq[Dependency[_]]

Example

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[32] at parallelize at <console>:12

b.dependencies.length

Int = 0

b.map(a => a).dependencies.length

res40: Int = 1

b.cartesian(a).dependencies.length

res41: Int = 2

b.cartesian(a).dependencies

res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct

 

Returns a new RDD that contains each unique value onlyonce.

Listing Variants

def distinct(): RDD[T]

def distinct(numPartitions: Int): RDD[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.distinct.collect

res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))

a.distinct(2).partitions.length

res16: Int = 2

a.distinct(3).partitions.length

res17: Int = 3

first

 

Looks for the very first data item of the RDD andreturns it.

Listing Variants

def first(): T

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.first

res1: String = Gnu

filter

 

Evaluates a boolean function for each data item of theRDD and puts the items for which the function returned true into the resultingRDD.

Listing Variants

def filter(f: T => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 10, 3)

a.filter(_ % 2 == 0)

b.collect

res3: Array[Int] = Array(2, 4, 6, 8, 10)

When you provide a filter function, it must be able tohandle all data items contained in the RDD. Scala provides so-called partialfunctions to deal with mixed data-types. (Tip: Partial functions are veryuseful if you have some data which may be bad and you do not want to handle butfor the good data (matching data) you want to apply some kind of map function.The following article is good. It teaches you about partial functions in a verynice way and explains why case has to be used for partial functions:  article)

Examples for mixed data without partial functions

val b = sc.parallelize(1 to 8)

b.filter(_ < 4).collect

res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat", "horse",4.0, 3.5, 2, "dog"))

a.filter(_ < 4).collect

<console>:15: error: value < is not a memberof Any

This fails because some components of a are notimplicitly comparable against integers. Collect uses the isDefinedAt propertyof a function-object to determine whether the test-function is compatible witheach data item. Only data items that pass this test (=filter) are then mappedusing the function-object.

Examples for mixed data with partial functions

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.collect({case a: Int    => "is integer" |

           caseb: String => "is string" }).collect

res17: Array[String] = Array(is string, is string, isinteger, is string)

val myfunc: PartialFunction[Any, Any] = {

  case a:Int    => "is integer" |

  case b: String=> "is string" }

myfunc.isDefinedAt("")

res21: Boolean = true

myfunc.isDefinedAt(1)

res22: Boolean = true

myfunc.isDefinedAt(1.5)

res23: Boolean = false

Be careful! The above code works because it onlychecks the type itself! If you use operations on this type, you have toexplicitly declare what type you want instead of any. Otherwise the compilerdoes (apparently) not know what bytecode it should produce:

val myfunc2: PartialFunction[Any, Any] = {case x if (x< 4) => "x"}

<console>:10: error: value < is not a memberof Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x< 4) => "x"}

myfunc2: PartialFunction[Int,Any] = <function1>

filterWith

 

This is an extended version of filter. It takes twofunction arguments. The first argument must conform to Int -> T and isexecuted once per partition. It will transform the partition index to type T.The second function looks like (U, T) -> Boolean. T is the transformedpartition index and U are the data items from the RDD. Finally the function hasto return either true or false (i.e. Apply the filter).

Listing Variants

def filterWith[A: ClassTag](constructA: Int =>A)(p: (T, A) => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)

val b = a.filterWith(i => i)((x,i) => x % 2 == 0|| i % 2 == 0)

b.collect

res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)

a.filterWith(x=> x)((a, b) =>  b == 0).collect

res30: Array[Int] = Array(1, 2)

a.filterWith(x=> x)((a, b) =>  a % (b+1) == 0).collect

res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)

a.filterWith(x=> x.toString)((a, b) =>  b == "2").collect

res34: Array[Int] = Array(5, 6)

flatMap:

类似于map,但是flatmap把所有item都压平为一个

def flatMap[U: ClassTag](f: T => TraversableOnce[U]):RDD[U]

val a = sc.parallelize(1 to 10, 5)

a.flatMap(1 to _).collect

res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3,4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7,8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x =>List(x, x, x)).collect

res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

// The program below generates a random number ofcopies (up to 10) of the items in the list.

val x  =sc.parallelize(1 to 10, 3)

x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4,4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9,9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

flatMapValues:

类似于mapVlues,但是这个map只对key-value中的value起效果

def flatMapValues[U](f: V => TraversableOnce[U]):RDD[(K, U)]

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.flatMapValues("x" + _ +"x").collect

res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o),(3,g), (3,x), (5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l),(4,i), (4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p),(7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g),(5,l), (5,e), (5,x))

   

Our research group has a very strong focus on usingand improving Apache Spark to solve real world programs. In order to do this weneed to have a very solid understanding of the capabilities of Spark. So one ofthe first things we have done is to go through the entire Spark RDD API andwrite examples to test their functionality. This has been a very usefulexercise and we would like to share the examples with everyone.

Authors of examples: Matthias Langer and Zhen He

Emails addresses: m.langer@latrobe.edu.au,z.he@latrobe.edu.au

These examples have only been tested for Spark version0.9. We assume the functionality of Spark is stable and therefore the examplesshould be valid for later releases.

Here is a pdf of the all the examples: SparkExamples

The RDD API By Example

RDD is short for Resilient Distributed Dataset. RDDsare the workhorse of the Spark system. As a user, one can consider a RDD as ahandle for a collection of individual data partitions, which are the result ofsome computation.

However, an RDD is actually more than that. On clusterinstallations, separate data partitions can be on separate nodes. Using the RDDas a handle one can access all partitions and perform computations andtransformations using the contained data. Whenever a part of a RDD or an entireRDD is lost, the system is able to reconstruct the data of lost partitions byusing lineage information. Lineage refers to the sequence of transformationsused to produce the current RDD. As a result, Spark is able to recoverautomatically from most failures.

All RDDs available in Spark derive either directly orindirectly from the class RDD. This class comes with a large set of methodsthat perform operations on the data within the associated partitions. The classRDD is abstract. Whenever, one uses a RDD, one is actually using a concertizedimplementation of RDD. These implementations have to overwrite some corefunctions to make the RDD behave as expected.

One reason why Spark has lately become a very popularsystem for processing big data is that it does not impose restrictionsregarding what data can be stored within RDD partitions. The RDD API alreadycontains many useful operations. But, because the creators of Spark had to keepthe core API of RDDs common enough to handle arbitrary data-types, manyconvenience functions are missing.

The basic RDD API considers each data item as a singlevalue. However, users often want to work with key-value pairs. Therefore Sparkextended the interface of RDD to provide additional functions(PairRDDFunctions), which explicitly work on key-value pairs. Currently, thereare four extensions to the RDD API available in spark. They are as follows:

DoubleRDDFunctions

This extension contains many useful methods foraggregating numeric values. They become available if the data items of an RDDare implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined in this interface extension becomeavailable when the data items have a two component tuple structure. Spark willinterpret the first tuple item (i.e. tuplename. 1) as the key and the seconditem (i.e. tuplename. 2) as the associated value.

OrderedRDDFunctions

Methods defined in this interface extension becomeavailable if the data items are two-component tuples where the key isimplicitly sortable.

SequenceFileRDDFunctions

This extension contains several methods that allowusers to create Hadoop sequence- les from RDDs. The data items must be two compo-nent key-value tuples as required by the PairRDDFunctions. However, there areadditional requirements considering the convertibility of the tuple componentsto Writable types.

Since Spark will make methods with extendedfunctionality automatically available to users when the data items fulfill theabove described requirements, we decided to list all possible availablefunctions in strictly alphabetical order. We will append either of thefollowingto the function-name to indicate it belongs to an extension thatrequires the data items to conform to a certain format or type.

[Double] - Double RDD Functions

[Ordered] - OrderedRDDFunctions

[Pair] - PairRDDFunctions

[SeqFile] - SequenceFileRDDFunctions

aggregate

The aggregate-method provides an interface forperforming highly customized reductions and aggregations with a RDD. However,due to the way Scala and Spark execute and process data, care must be taken toachieve deterministic behavior. The following list contains a few observationswe made while experimenting with aggregate:

    The reduceand combine functions have to be commutative and associative.

    As can beseen from the function definition below, the output of the combiner must beequal to its input. This is necessary because Spark will chain-execute it.

    The zerovalue is the initial value of the U component when either seqOp or combOp areexecuted for the first element of their domain of influence. Depending on whatyou want to achieve, you may have to change it. However, to make your codedeterministic, make sure that your code will yield the same result regardlessof the number or size of partitions.

    Do notassume any execution order for either partition computations or combiningpartitions.

    The neutralzeroValue is applied at the beginning of each sequence of reduces within theindividual partitions and again when the output of separate partitions iscombined.

    Why have twoseparate combine functions? The first functions maps the input values into theresult space. Note that the aggregation data type (1st input and output) can bedifferent (U != T). The second function reduces these mapped values in theresult space.

    Why wouldone want to use two input data types? Let us assume we do an archaeologicalsite survey using a metal detector. While walking through the site we take GPScoordinates of important findings based on the output of the metal detector.Later, we intend to draw an image of a map that highlights these locationsusing the aggregate function. In this case the zeroValue could be an area mapwith no highlights. The possibly huge set of input data is stored as GPScoordinates across many partitions. seqOp could convert the GPS coordinates tomap coordinates and put a marker on the map at the respective position. combOpwill receive these highlights as partial maps and combine them into a singlefinal output map.

Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T)=> U, combOp: (U, U) => U): U

Examples 1

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

z.aggregate("x")(_ + _, _+_)

res116: String = xxdefxabc

val z = sc.parallelize(List("12","23","345","4567"),2)

z.aggregate("")((x,y) =>math.max(x.length, y.length).toString, (x,y) => x + y)

res141: String = 42

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res142: String = 11

val z = sc.parallelize(List("12","23","345",""),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res143: String = 10

The main issue with the code above is that the resultof the inner min is a string of length 1.

The zero in the output is due to the empty stringbeing the last string in the list. We see this result because we are notrecursively reducing any further within the partition for the final string.

Examples 2

val z =sc.parallelize(List("12","23","","345"),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res144: String = 11

In contrast to the previous example, this example hasthe empty string at the beginning of the second partition. This results inlength of zero being input to the second reduce which then upgrades it a lengthof 1. (Warning: The above example shows bad design since the output isdependent on the order of the data inside the partitions.)

cartesian

Computes the cartesian product between two RDDs (i.e.Each item of the first RDD is joined with each item of the second RDD) andreturns them as a new RDD. (Warning: Be careful when using this function.!Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val x = sc.parallelize(List(1,2,3,4,5))

val y = sc.parallelize(List(6,7,8,9,10))

x.cartesian(y).collect

res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8),(1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9),(3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computednext. Checkpointed RDDs are stored as a binary file within the checkpointdirectory which can be specified using the Spark context. (Warning: Sparkapplies lazy evaluation. Checkpointing will not occur until an action isinvoked.)

Important note: the directory  "my_directory_name" should exist inall slaves. As an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")

val a = sc.parallelize(1 to 4)

a.checkpoint

a.count

14/02/25 18:13:53 INFO SparkContext: Starting job:count at <console>:15

...

14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5stored as values to memory (estimated size 115.7 KB, free 296.3 MB)

14/02/25 18:13:53 INFO RDDCheckpointData: Donecheckpointing RDD 11 tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,new parent is RDD 12

res23: Long = 4

coalesce, repartition

Coalesces the associated data into a given number ofpartitions. repartition(numPartitions) is simply an abbreviation forcoalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup [Pair], groupWith [Pair]

A very powerful set of functions that allow groupingup to 3 key-value RDDs together using their keys.

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def groupWith[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

b.cogroup(c).collect

res7: Array[(Int, (Seq[String], Seq[String]))] =Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

val d = a.map((_, "d"))

b.cogroup(c, d).collect

res9: Array[(Int, (Seq[String], Seq[String],Seq[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d,d)))

)

val x = sc.parallelize(List((1, "apple"),(2, "banana"), (3, "orange"), (4, "kiwi")), 2)

val y = sc.parallelize(List((5, "computer"),(1, "laptop"), (1, "desktop"), (4, "iPad")), 2)

x.cogroup(y).collect

res23: Array[(Int, (Seq[String], Seq[String]))] =Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

collect, toArray

Converts the RDD into a Scala array and returns it. Ifyou provide a standard map-function (i.e. f = T -> U) it will be appliedbefore inserting the values into the result array.

Listing Variants

def collect(): Array[T]

def collect[U: ClassTag](f: PartialFunction[T, U]):RDD[U]

def toArray(): Array[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.collect

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu,Rat)

collectAsMap [Pair]

Similar to collect, but works on key-value RDDs andconverts them into Scala maps to preserve their key-value structure.

Listing Variants

def collectAsMap(): Map[K, V]

Example

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.zip(a)

b.collectAsMap

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1-> 1, 3 -> 3)

combineByKey[Pair]

Very efficient implementation that combines the valuesof a RDD consisting of two-component tuples by applying multiple aggregatorsone after another.

Listing Variants

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner:Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

Example

val a =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

val c = b.zip(a)

val d = c.combineByKey(List(_), (x:List[String],y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)

d.collect

res16: Array[(Int, List[String])] = Array((1,List(cat,dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

compute

Executes dependencies and computes the actualrepresentation of the RDD. This function should not be called directly byusers.

Listing Variants

def compute(split: Partition, context: TaskContext):Iterator[T]

context, sparkContext

Returns the SparkContext that was used to create theRDD.

Listing Variants

def compute(split: Partition, context: TaskContext):Iterator[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.context

res8: org.apache.spark.SparkContext =org.apache.spark.SparkContext@58c1c2f1

count

Returns the number of items stored within a RDD.

Listing Variants

def count(): Long

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.count

res2: Long = 4

countApprox

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def (timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

countByKey [Pair]

Very similar to count, but counts the values of a RDDconsisting of two-component tuples for each distinct key separately.

Listing Variants

def countByKey(): Map[K, Long]

Example

val c = sc.parallelize(List((3, "Gnu"), (3,"Yak"), (5, "Mouse"), (3, "Dog")), 2)

c.countByKey

res3: scala.collection.Map[Int,Long] = Map(3 -> 3,5 -> 1)

countByKeyApprox [Pair]

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def countByKeyApprox(timeout: Long, confidence: Double= 0.95): PartialResult[Map[K, BoundedDouble]]

countByValue

Returns a map that contains all unique values of theRDD and their respective occurrence counts. (Warning: This operation willfinally aggregate the information in a single reducer.)

Listing Variants

def countByValue(): Map[T, Long]

Example

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b.countByValue

res27: scala.collection.Map[Int,Long] = Map(5 -> 1,8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox

Marked as experimental feature! Experimental featuresare currently not covered by this document!

Listing Variants

def countByValueApprox(timeout: Long, confidence:Double = 0.95): PartialResult[Map[T, BoundedDouble]]

countApproxDistinct

Computes the approximate number of distinct values.For large RDDs which are spread across many nodes, this function may executefaster than other counting methods. The parameter relativeSD controls theaccuracy of the computation.

Listing Variants

def countApproxDistinct(relativeSD: Double = 0.05):Long

Example

val a = sc.parallelize(1 to 10000, 20)

val b = a++a++a++a++a

b.countApproxDistinct(0.1)

res14: Long = 10784

b.countApproxDistinct(0.05)

res15: Long = 11055

b.countApproxDistinct(0.01)

res16: Long = 10040

b.countApproxDistinct(0.001)

res0: Long = 10001

countApproxDistinctByKey [Pair]

 

Similar to countApproxDistinct, but computes theapproximate number of distinct values for each distinct key. Hence, the RDDmust consist of two-component tuples. For large RDDs which are spread acrossmany nodes, this function may execute faster than other counting methods. Theparameter relativeSD controls the accuracy of the computation.

Listing Variants

def countApproxDistinctByKey(relativeSD: Double =0.05): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,numPartitions: Int): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,partitioner: Partitioner): RDD[(K, Long)]

Example

val a = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

val b = sc.parallelize(a.takeSample(true, 10000, 0),20)

val c = sc.parallelize(1 to b.count().toInt, 20)

val d = b.zip(c)

d.countApproxDistinctByKey(0.1).collect

res15: Array[(String, Long)] = Array((Rat,2567),(Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect

res16: Array[(String, Long)] = Array((Rat,2555),(Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect

res0: Array[(String, Long)] = Array((Rat,2562),(Cat,2464), (Dog,2451), (Gnu,2521))

dependencies

 

Returns the RDD on which this RDD depends.

Listing Variants

final def dependencies: Seq[Dependency[_]]

Example

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[32] at parallelize at <console>:12

b.dependencies.length

Int = 0

b.map(a => a).dependencies.length

res40: Int = 1

b.cartesian(a).dependencies.length

res41: Int = 2

b.cartesian(a).dependencies

res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct

 

Returns a new RDD that contains each unique value onlyonce.

Listing Variants

def distinct(): RDD[T]

def distinct(numPartitions: Int): RDD[T]

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.distinct.collect

res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))

a.distinct(2).partitions.length

res16: Int = 2

a.distinct(3).partitions.length

res17: Int = 3

first

 

Looks for the very first data item of the RDD andreturns it.

Listing Variants

def first(): T

Example

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.first

res1: String = Gnu

filter

 

Evaluates a boolean function for each data item of theRDD and puts the items for which the function returned true into the resultingRDD.

Listing Variants

def filter(f: T => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 10, 3)

a.filter(_ % 2 == 0)

b.collect

res3: Array[Int] = Array(2, 4, 6, 8, 10)

When you provide a filter function, it must be able tohandle all data items contained in the RDD. Scala provides so-called partialfunctions to deal with mixed data-types. (Tip: Partial functions are veryuseful if you have some data which may be bad and you do not want to handle butfor the good data (matching data) you want to apply some kind of map function.The following article is good. It teaches you about partial functions in a verynice way and explains why case has to be used for partial functions:  article)

Examples for mixed data without partial functions

val b = sc.parallelize(1 to 8)

b.filter(_ < 4).collect

res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.filter(_ < 4).collect

<console>:15: error: value < is not a memberof Any

This fails because some components of a are notimplicitly comparable against integers. Collect uses the isDefinedAt propertyof a function-object to determine whether the test-function is compatible witheach data item. Only data items that pass this test (=filter) are then mappedusing the function-object.

Examples for mixed data with partial functions

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.collect({case a: Int    => "is integer" |

           caseb: String => "is string" }).collect

res17: Array[String] = Array(is string, is string, isinteger, is string)

val myfunc: PartialFunction[Any, Any] = {

  case a:Int    => "is integer" |

  case b: String=> "is string" }

myfunc.isDefinedAt("")

res21: Boolean = true

myfunc.isDefinedAt(1)

res22: Boolean = true

myfunc.isDefinedAt(1.5)

res23: Boolean = false

Be careful! The above code works because it onlychecks the type itself! If you use operations on this type, you have toexplicitly declare what type you want instead of any. Otherwise the compilerdoes (apparently) not know what bytecode it should produce:

val myfunc2: PartialFunction[Any, Any] = {case x if (x< 4) => "x"}

<console>:10: error: value < is not a memberof Any

val myfunc2: PartialFunction[Int, Any] = {case x if (x< 4) => "x"}

myfunc2: PartialFunction[Int,Any] = <function1>

filterWith

 

This is an extended version of filter. It takes twofunction arguments. The first argument must conform to Int -> T and isexecuted once per partition. It will transform the partition index to type T.The second function looks like (U, T) -> Boolean. T is the transformedpartition index and U are the data items from the RDD. Finally the function hasto return either true or false (i.e. Apply the filter).

Listing Variants

def filterWith[A: ClassTag](constructA: Int =>A)(p: (T, A) => Boolean): RDD[T]

Example

val a = sc.parallelize(1 to 9, 3)

val b = a.filterWith(i => i)((x,i) => x % 2 == 0|| i % 2 == 0)

b.collect

res37: Array[Int] = Array(1, 2, 3, 4, 6, 7, 8, 9)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 5)

a.filterWith(x=> x)((a, b) =>  b == 0).collect

res30: Array[Int] = Array(1, 2)

a.filterWith(x=> x)((a, b) =>  a % (b+1) == 0).collect

res33: Array[Int] = Array(1, 2, 4, 6, 8, 10)

a.filterWith(x=> x.toString)((a, b) =>  b == "2").collect

res34: Array[Int] = Array(5, 6)

flatMap

 

Similar to map, but allows emitting more than one itemin the map function.

Listing Variants

def flatMap[U: ClassTag](f: T =>TraversableOnce[U]): RDD[U]

Example

val a = sc.parallelize(1 to 10, 5)

a.flatMap(1 to _).collect

res47: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3,4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7,8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

sc.parallelize(List(1, 2, 3), 2).flatMap(x =>List(x, x, x)).collect

res85: Array[Int] = Array(1, 1, 1, 2, 2, 2, 3, 3, 3)

// The program below generates a random number ofcopies (up to 10) of the items in the list.

val x  =sc.parallelize(1 to 10, 3)

x.flatMap(List.fill(scala.util.Random.nextInt(10))(_)).collect

res1: Array[Int] = Array(1, 2, 3, 3, 3, 4, 4, 4, 4, 4,4, 4, 4, 4, 5, 5, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 8, 8, 8, 8, 8, 8, 8, 8, 9,9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 10)

flatMapValues

 

Very similar to mapValues, but collapses the inherentstructure of the values during mapping.

Listing Variants

def flatMapValues[U](f: V => TraversableOnce[U]):RDD[(K, U)]

Example

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.flatMapValues("x" + _ +"x").collect

res6: Array[(Int, Char)] = Array((3,x), (3,d), (3,o),(3,g), (3,x), (5,x), (5,t), (5,i), (5,g), (5,e), (5,r), (5,x), (4,x), (4,l),(4,i), (4,o), (4,n), (4,x), (3,x), (3,c), (3,a), (3,t), (3,x), (7,x), (7,p),(7,a), (7,n), (7,t), (7,h), (7,e), (7,r), (7,x), (5,x), (5,e), (5,a), (5,g),(5,l), (5,e), (5,x))

flatMapWith:

类似flatmap,但是这个函数引入了partition的索引,第一个参数就是

def flatMapWith[A: ClassTag, U: ClassTag](constructA:Int => A, preservesPartitioning: Boolean = false)(f: (T, A) => Seq[U]):RDD[U]

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9), 3)

a.flatMapWith(x => x, true)((x, y) => List(y,x)).collect

res58: Array[Int] = Array(0, 1, 0, 2, 0, 3, 1, 4, 1,5, 1, 6, 2, 7, 2, 8, 2, 9)

:返回了(索引,data)的形式

fold:

把所有data的值都聚合在一起,每个partition的值初始化为zeroValue的值

val a = sc.parallelize(List(1,2,3), 3)

a.fold(0)(_ + _)

res59: Int = 6

a.fold(1)(_ + _)

res59: Int = 10

a.fold(2)(_ + _)

res59: Int = 14

a.fold(1)(_ - _)

res59: Int = 4

a.fold(2)(_ - _)

res59: Int = 2

foldByKey:

按照key进行组合,相同key组合到一起,只能用在两个元组的RDD中使用

def foldByKey(zeroValue: V)(func: (V, V) => V):RDD[(K, V)]

def foldByKey(zeroValue: V, numPartitions: Int)(func:(V, V) => V): RDD[(K, V)]

def foldByKey(zeroValue: V, partitioner:Partitioner)(func: (V, V) => V): RDD[(K, V)]

val a = sc.parallelize(List("dog","cat", "owl", "gnu", "ant"), 2)

val b = a.map(x => (x.length, x))

b.foldByKey("")(_ + _).collect

res84: Array[(Int, String)] =Array((3,dogcatowlgnuant)

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.foldByKey("")(_ + _).collect

res85: Array[(Int, String)] = Array((4,lion),(3,dogcat), (7,panther), (5,tigereagle))

foreach:

对每一个data item都执行这个方法

def foreach(f: T => Unit)

val c = sc.parallelize(List("cat","dog", "tiger", "lion", "gnu","crocodile", "ant", "whale", "dolphin","spider"), 3)

c.foreach(x => println(x + "s areyummy"))

lions are yummy

gnus are yummy

crocodiles are yummy

ants are yummy

whales are yummy

dolphins are yummy

spiders are yummy

foreachPartition:

对每个partition都执行一个方法,对每个partition的访问方式都是迭代.

def foreachPartition(f: Iterator[T] => Unit)

val b = sc.parallelize(List(1, 2, 3, 4, 5, 6, 7, 8,9), 3)

b.foreachPartition(x => println(x.reduce(_ + _)))

6

15

24

foreachWith:

forachPartition类似,第一个参数是partition的索引号,第二个参数是对应partition对应的data

 defforeachWith[A: ClassTag](constructA: Int => A)(f: (T, A) => Unit)

val a = sc.parallelize(1 to 9, 3)

a.foreachWith(i => i)((x,i) => if (x % 2 == 1&& i % 2 == 0) println(x) )

1

3

7

9

getCheckpointFile:

如果RDD没有被检查,哪买返回检查点路径或者null

def getCheckpointFile: Option[String]

sc.setCheckpointDir("/home/cloudera/Documents")

val a = sc.parallelize(1 to 500, 5)

val b = a++a++a++a++a

b.getCheckpointFile

res49: Option[String] = None

b.checkpoint

b.getCheckpointFile

res54: Option[String] = None

b.collect

b.getCheckpointFile

res57: Option[String] =Some(file:/home/cloudera/Documents/cb978ffb-a346-4820-b3ba-d56580787b20/rdd-40)

getStorageLevel:

得到数据的存储级别

val a = sc.parallelize(1 to 100000, 2)

a.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)

a.getStorageLevel.description

String = Disk Serialized 1x Replicated

a.cache

java.lang.UnsupportedOperationException: Cannot changestorage level of an RDD after it was already assigned a level

glom:

把所有partition的元素都组合到一个RDD中来

def glom(): RDD[Array[T]]

val a = sc.parallelize(1 to 100, 3)

a.glom.collect

res8: Array[Array[Int]] = Array(Array(1, 2, 3, 4, 5, 6,7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,27, 28, 29, 30, 31, 32, 33), Array(34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,65, 66), Array(67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82,83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100))

groupBy:

按照下面的K进行分组.

def groupBy[K: ClassTag](f: T => K): RDD[(K,Seq[T])]

def groupBy[K: ClassTag](f: T => K, numPartitions:Int): RDD[(K, Seq[T])]

def groupBy[K: ClassTag](f: T => K, p:Partitioner): RDD[(K, Seq[T])]

val a = sc.parallelize(1 to 9, 3)

a.groupBy(x => { if (x % 2 == 0) "even"else "odd" }).collect

res42: Array[(String, Seq[Int])] =Array((even,ArrayBuffer(2, 4, 6, 8)), (odd,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)

def myfunc(a: Int) : Int =

{

  a % 2

}

a.groupBy(myfunc).collect

res3: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2,4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))

val a = sc.parallelize(1 to 9, 3)

def myfunc(a: Int) : Int =

{

  a % 2

}

a.groupBy(x => myfunc(x), 3).collect     //此处的3是指定partition

a.groupBy(myfunc(_), 1).collect

res7: Array[(Int, Seq[Int])] = Array((0,ArrayBuffer(2,4, 6, 8)), (1,ArrayBuffer(1, 3, 5, 7, 9)))

import org.apache.spark.Partitioner

class MyPartitioner extends Partitioner {

def numPartitions: Int = 2

def getPartition(key: Any): Int =

{

    key match

    {

      casenull     => 0

      case key:Int => key          % numPartitions

      case _        => key.hashCode % numPartitions

    }

  }

  override defequals(other: Any): Boolean =

  {

    other match

    {

      case h:MyPartitioner => true

      case_                => false

    }

  }

}

val a = sc.parallelize(1 to 9, 3)

val p = new MyPartitioner()

val b = a.groupBy((x:Int) => { x }, p)

val c = b.mapWith(i => i)((a, b) => (b, a))

c.collect

res42: Array[(Int, (Int, Seq[Int]))] =Array((0,(4,ArrayBuffer(4))), (0,(2,ArrayBuffer(2))), (0,(6,ArrayBuffer(6))),(0,(8,ArrayBuffer(8))), (1,(9,ArrayBuffer(9))), (1,(3,ArrayBuffer(3))),(1,(1,ArrayBuffer(1))), (1,(7,ArrayBuffer(7))), (1,(5,ArrayBuffer(5))))

groupByKey [Pair]:

group类似,但是有所不同,会按照key进行分组,value组成一个ArrayBuffer

def groupByKey(): RDD[(K, Seq[V])]

def groupByKey(numPartitions: Int): RDD[(K, Seq[V])]

def groupByKey(partitioner: Partitioner): RDD[(K,Seq[V])]

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "spider","eagle"), 2)

val b = a.keyBy(_.length)

b.groupByKey.collect

res11: Array[(Int, Seq[String])] = Array((4,ArrayBuffer(lion)),(6,ArrayBuffer(spider)), (3,ArrayBuffer(dog, cat)), (5,ArrayBuffer(tiger,eagle)))

histogram [Double]:

在指定数的上界,下界之间,自动生成指定数字的一些列数值

def histogram(bucketCount: Int): Pair[Array[Double],Array[Long]]

def histogram(buckets: Array[Double], evenBuckets:Boolean = false): Array[Long]

有均匀间距:

val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1,7.4, 7.5, 7.6, 8.8, 9.0), 3)

a.histogram(5)

res11: (Array[Double], Array[Long]) = (Array(1.1,2.68, 4.26, 5.84, 7.42, 9.0),Array(5, 0, 0, 1, 4))

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3,5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)

a.histogram(6)

res18: (Array[Double], Array[Long]) = (Array(1.0, 2.5,4.0, 5.5, 7.0, 8.5, 10.0),Array(6, 0, 1, 1, 3, 4))

自己定义间距:

val a = sc.parallelize(List(1.1, 1.2, 1.3, 2.0, 2.1,7.4, 7.5, 7.6, 8.8, 9.0), 3)

a.histogram(Array(0.0, 3.0, 8.0))

res14: Array[Long] = Array(5, 3)

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3,5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)

a.histogram(Array(0.0, 5.0, 10.0))

res1: Array[Long] = Array(6, 9)

a.histogram(Array(0.0, 5.0, 10.0, 15.0))

res1: Array[Long] = Array(6, 8, 1)

id:

检索分配给RDDid

val y = sc.parallelize(1 to 10, 10)

y.id

res16: Int = 19

isCheckpointed:

检查RDD是否创建了检查点

sc.setCheckpointDir("/home/cloudera/Documents")

c.isCheckpointed

res6: Boolean = false

c.checkpoint

c.isCheckpointed

res8: Boolean = false

c.collect

c.isCheckpointed

res9: Boolean = true

iterator:

返回RDD分区的迭代对象.这个方法从来不直接用

join [Pair]:

执行两个key-value类型RDD的内连接.和数据库中的join很相似

def join[W](other: RDD[(K, W)]): RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], numPartitions: Int):RDD[(K, (V, W))]

def join[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (V, W))]

val a = sc.parallelize(List("dog","salmon", "salmon", "rat", "elephant"),3)

val b = a.keyBy(_.length)

val c =

sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val d = c.keyBy(_.length)

b.join(d).collect

res17: Array[(Int, (String, String))] =Array((6,(salmon,salmon)), (6,(salmon,rabbit)), (6,(salmon,turkey)),(6,(rabbit,salmon)), (6,(rabbit,rabbit)), (6,(rabbit,turkey)),(6,(turkey,salmon)), (6,(turkey,rabbit)), (6,(turkey,turkey)), (3,(dog,dog)),(3,(dog,cat)), (3,(dog,gnu)), (3,(dog,bee)), (3,(cat,dog)), (3,(cat,cat)), (3,(cat,gnu)),(3,(cat,bee)), (3,(gnu,dog)), (3,(gnu,cat)), (3,(gnu,gnu)), (3,(gnu,bee)),(3,(bee,dog)), (3,(bee,cat)), (3,(bee,gnu)), (3,(bee,bee)), (4,(wolf,wolf)),(4,(wolf,bear)), (4,(bear,wolf)), (4,(bear,bear)))

keyBy:

通过对每个数据项应用一个函数构造一个key-value的元组,方法的结果是key并且和原始值组成的一个新的元组

def keyBy[K](f: T => K): RDD[(K, T)]

val a = sc.parallelize(List("dog","salmon", "salmon", "rat", "elephant"),3)

val b = a.keyBy(_.length)

b.collect

res26: Array[(Int, String)] = Array((3,dog),(6,salmon), (6,salmon), (3,rat), (8,elephant))

keys [Pair]:

抽取出所有的keys,并且返回一个新的RDD

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.keys.collect

res2: Array[Int] = Array(3, 5, 4, 3, 7, 5)

leftOuterJoin [Pair]:

def leftOuterJoin[W](other: RDD[(K, W)]): RDD[(K, (V,Option[W]))]

def leftOuterJoin[W](other: RDD[(K, W)],numPartitions: Int): RDD[(K, (V, Option[W]))]

def leftOuterJoin[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (V, Option[W]))]

val a = sc.parallelize(List("dog","salmon", "salmon", "rat", "elephant"),3)

val b = a.keyBy(_.length)

val c =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val d = c.keyBy(_.length)

b.leftOuterJoin(d).collect

res1: Array[(Int, (String, Option[String]))] =Array((6,(salmon,Some(salmon))), (6,(salmon,Some(rabbit))),(6,(salmon,Some(turkey))), (6,(salmon,Some(salmon))),(6,(salmon,Some(rabbit))), (6,(salmon,Some(turkey))), (3,(dog,Some(dog))),(3,(dog,Some(cat))), (3,(dog,Some(gnu))), (3,(dog,Some(bee))),(3,(rat,Some(dog))), (3,(rat,Some(cat))), (3,(rat,Some(gnu))),(3,(rat,Some(bee))), (8,(elephant,None)))

lookup:

扫描RDD中所有的key,将符合要求的value返回成seq序列

def lookup(key: K): Seq[V]

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.lookup(5)

res0: Seq[String] = WrappedArray(tiger, eagle)

map:

RDD中的每个data都应用转换方法,并且把结果RDD返回

def map[U: ClassTag](f: T => U): RDD[U]

val a = sc.parallelize(List("dog","salmon", "salmon", "rat", "elephant"),3)

val b = a.map(_.length)

val c = a.zip(b)

c.collect

res0: Array[(String, Int)] = Array((dog,3),(salmon,6), (salmon,6), (rat,3), (elephant,8))

mapPartitions:

类似于hadoop中的reduce,类似于上面的map,独立的在RDD的每一个分块上运行---不太懂

def mapPartitions[U: ClassTag](f: Iterator[T] =>Iterator[U], preservesPartitioning: Boolean = false): RDD[U]

Example 1

val a = sc.parallelize(1 to 9, 3)

def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] ={

  var res =List[(T, T)]()

  var pre =iter.next

  while(iter.hasNext)

  {

    val cur =iter.next;

    res .::=(pre, cur)

    pre = cur;

  }

  res.iterator

}

a.mapPartitions(myfunc).collect

res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6),(4,5), (8,9), (7,8))

Example 2

val x = sc.parallelize(List("1","2", "3", "4", "5", "6","7", "8", "10"), 3)

def myfunc(iter: Iterator[Int]) : Iterator[Int] = {

  var res =List[Int]()

  while(iter.hasNext) {

    val cur =iter.next;

    res = res::: List.fill(scala.util.Random.nextInt(10))(cur)

  }

  res.iterator

}

x.mapPartitions(myfunc).collect

// some of the number are not outputted at all. Thisis because the random number generated for it is zero.

res8: Array[Int] = Array(1, 2, 2, 2, 2, 3, 3, 3, 3, 3,3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 5, 7, 7, 7, 9, 9, 10)

mapPartitionsWithContext:

mapParitions类似,但是允许获得运行状态的信息.

def mapPartitionsWithContext[U: ClassTag](f:(TaskContext, Iterator[T]) => Iterator[U], preservesPartitioning: Boolean =false): RDD[U]

mapPartitionsWithIndex:

  类似于mapPartitions,有两个参数,第一个是改分区索引,第二个是改分区所有项的迭代器

Example

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)

def myfunc(index: Int, iter: Iterator[Int]) :Iterator[String] = {

 iter.toList.map(x => index + "," + x).iterator

}

x.mapPartitionsWithIndex(myfunc).collect()

res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5,1,6, 2,7, 2,8, 2,9, 2,10)

val x = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10), 3)

def myfunc(index: Int, iter: Iterator[Int]) :Iterator[String] = {

 iter.toList.map(x => index + "," + x).iterator

}

x.mapPartitionsWithIndex(myfunc).collect()

res10: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5,1,6, 2,7, 2,8, 2,9, 2,10)

mapValues:

必须提供key-values形式的RDD,并且对于每个提供的RDD,使用提供的方法对其value是进行处理.

def mapValues[U](f: V => U): RDD[(K, U)]

例子:

val a = sc.parallelize(List("dog", "tiger","lion", "cat", "panther", "eagle"), 2)

val b = a.map(x => (x.length, x))

b.mapValues("x" + _ + "x").collect

res5: Array[(Int, String)] = Array((3,xdogx),(5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx), (5,xeaglex))

mean [Double]  meanApprox [Double]:

计算平均值

def mean(): Double

def meanApprox(timeout: Long, confidence: Double =0.95): PartialResult[BoundedDouble]

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3,5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)

a.mean

res0: Double = 5.3

name, setName:

修改EDD的名称

@transient var name: String

def setName(_name: String)

val y = sc.parallelize(1 to 10, 10)

y.name

res13: String = null

y.setName("Fancy RDD Name")

y.name

res15: String = Fancy RDD Name

partitions:

返回分区对象关联RDD的信息

final def partitions: Array[Partition]

b.partitions

res1: Array[org.apache.spark.Partition] =Array(org.apache.spark.rdd.ParallelCollectionPartition@691,org.apache.spark.rdd.ParallelCollectionPartition@692,org.apache.spark.rdd.ParallelCollectionPartition@693)

persist,cache

persist:指定RDD的存储级别

cache:RDD进行缓存

def cache(): RDD[T]

def persist(): RDD[T]

def persist(newLevel: StorageLevel): RDD[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

scala> c.getStorageLevel

res0: org.apache.spark.storage.StorageLevel =StorageLevel(false, false, false, 1)

c.cache

c.getStorageLevel

res2: org.apache.spark.storage.StorageLevel =StorageLevel(false, true, true, 1)

pipe :

就是在plpe中能够使用shell中的命令,并将结果返回未RDD的格式

def pipe(command: String): RDD[String]

def pipe(command: String, env: Map[String, String]):RDD[String]

def pipe(command: Seq[String], env: Map[String,String] = Map(), printPipeContext: (String => Unit) => Unit = null,printRDDElement: (T, String => Unit) => Unit = null): RDD[String]

val a = sc.parallelize(1 to 9, 3)

a.pipe("head -n 1").collect

res2: Array[String] = Array(1, 4, 7)

reduce :

非常常用的命令,但是在参数中的方法,一定是迭代类型的

def reduce(f: (T, T) => T): T

val a = sc.parallelize(1 to 100, 3)

a.reduce(_ + _)

res41: Int = 5050

reduceByKey [Pair], reduceByKeyLocally [Pair], reduceByKeyToDriver [Pair]:

reduce类似,就是他根据key进行操作

def reduceByKey(func: (V, V) => V): RDD[(K, V)]

def reduceByKey(func: (V, V) => V, numPartitions:Int): RDD[(K, V)]

def reduceByKey(partitioner: Partitioner, func: (V, V)=> V): RDD[(K, V)]

def reduceByKeyLocally(func: (V, V) => V): Map[K,V]

def reduceByKeyToDriver(func: (V, V) => V): Map[K,V]

Example

val a = sc.parallelize(List("dog","cat", "owl", "gnu", "ant"), 2)

val b = a.map(x => (x.length, x))

b.reduceByKey(_ + _).collect

res86: Array[(Int, String)] =Array((3,dogcatowlgnuant))

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.reduceByKey(_ + _).collect

res87: Array[(Int, String)] = Array((4,lion), (3,dogcat),(7,panther), (5,tigereagle))

rightOuterJoin [Pair]:

执行右外连接.

def rightOuterJoin[W](other: RDD[(K, W)]): RDD[(K,(Option[V], W))]

def rightOuterJoin[W](other: RDD[(K, W)],numPartitions: Int): RDD[(K, (Option[V], W))]

def rightOuterJoin[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Option[V], W))]

val a = sc.parallelize(List("dog","salmon", "salmon", "rat", "elephant"),3)

val b = a.keyBy(_.length)

val c =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val d = c.keyBy(_.length)

b.rightOuterJoin(d).collect

res2: Array[(Int, (Option[String], String))] =Array((6,(Some(salmon),salmon)), (6,(Some(salmon),rabbit)),(6,(Some(salmon),turkey)), (6,(Some(salmon),salmon)),(6,(Some(salmon),rabbit)), (6,(Some(salmon),turkey)), (3,(Some(dog),dog)),(3,(Some(dog),cat)), (3,(Some(dog),gnu)), (3,(Some(dog),bee)),(3,(Some(rat),dog)), (3,(Some(rat),cat)), (3,(Some(rat),gnu)),(3,(Some(rat),bee)), (4,(None,wolf)), (4,(None,bear)))

sample:

RDD中随机选择一个items

def sample(withReplacement: Boolean, fraction: Double,seed: Int): RDD[T]

val a = sc.parallelize(1 to 10000, 3)

a.sample(false, 0.1, 0).count

res24: Long = 960

a.sample(true, 0.3, 0).count

res25: Long = 2888

a.sample(true, 0.3, 13).count

res26: Long = 2985

saveAsHadoopFile [Pair], saveAsHadoopDataset [Pair],saveAsNewAPIHadoopFile [Pair]:

用于和Hadoop进行整合的RDD

saveAsObjectFile:

用二进制的格式存储RDD

def saveAsObjectFile(path: String)

val x = sc.parallelize(1 to 100, 3)

x.saveAsObjectFile("objFile")

val y = sc.objectFile[Array[Int]]("objFile")

y.collect

res52: Array[Int] = Array(67, 68, 69, 70, 71, 72, 73,74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93,94, 95, 96, 97, 98, 99, 100, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46,47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22,23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33)

saveAsSequenceFile:

val v = sc.parallelize(Array(("owl",3),("gnu",4), ("dog",1), ("cat",2),("ant",5)), 2)

v.saveAsSequenceFile("hd_seq_file")

saveAsTextFile:(文本,压缩包,HDFS)

保存为文本格式

val a = sc.parallelize(1 to 10000, 3)

a.saveAsTextFile("mydata_a")

stats [Double]:

计算平均值,方差和RDD的所有值的标准偏差

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0,19.02, 19.29, 11.09, 21.0), 2)

x.stats

res16: org.apache.spark.util.StatCounter = (count: 9,mean: 11.266667, stdev: 8.126859)

sortByKey [Ordered]:

RDD的打他进行排序.按照key进行排序

def sortByKey(ascending: Boolean = true, numPartitions:Int = self.partitions.size): RDD[P]

val a = sc.parallelize(List("dog","cat", "owl", "gnu", "ant"), 2)

val b = sc.parallelize(1 to a.count.toInt, 2)

val c = a.zip(b)

c.sortByKey(true).collect

res74: Array[(String, Int)] = Array((ant,5), (cat,2), (dog,1),(gnu,4), (owl,3))

c.sortByKey(false).collect

res75: Array[(String, Int)] = Array((owl,3), (gnu,4),(dog,1), (cat,2), (ant,5))

val a = sc.parallelize(1 to 100, 5)

val b = a.cartesian(a)

val c = sc.parallelize(b.takeSample(true, 5, 13), 2)

val d = c.sortByKey(false)

res56: Array[(Int, Int)] = Array((96,9), (84,76),(59,59), (53,65), (52,4))

subtract:

减法,就是A-B

def subtract(other: RDD[T]): RDD[T]

def subtract(other: RDD[T], numPartitions: Int):RDD[T]

def subtract(other: RDD[T], p: Partitioner): RDD[T]

val a = sc.parallelize(1 to 9, 3)

val b = sc.parallelize(1 to 3, 3)

val c = a.subtract(b)

c.collect

res3: Array[Int] = Array(6, 9, 4, 7, 5, 8)

subtractByKey:

subtract类似,不过这个要求的key-value的格式,比较的key,key进行减

def subtractByKey[W: ClassTag](other: RDD[(K, W)]):RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)],numPartitions: Int): RDD[(K, V)]

def subtractByKey[W: ClassTag](other: RDD[(K, W)], p:Partitioner): RDD[(K, V)]

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "spider","eagle"), 2)

val b = a.keyBy(_.length)

val c = sc.parallelize(List("ant","falcon", "squid"), 2)

val d = c.keyBy(_.length)

b.subtractByKey(d).collect

res15: Array[(Int, String)] = Array((4,lion))

sum [Double], sumApprox [Double]:

sum:计算总和

sumApprox:粗略计算总和

def sum(): Double

def sumApprox(timeout: Long, confidence: Double =0.95): PartialResult[BoundedDouble]

val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0,19.02, 19.29, 11.09, 21.0), 2)

x.sum

res17: Double = 101.39999999999999

take:抽取nitems,并且返回一个数组.

def take(num: Int): Array[T]

val b = sc.parallelize(List("dog","cat", "ape", "salmon", "gnu"), 2)

b.take(2)

res18: Array[String] = Array(dog, cat)

val b = sc.parallelize(1 to 10000, 5000)

b.take(100)

res6: Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9,10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69,70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89,90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100)

takeOrdered:

选取nitems,并且进行排序.

def takeOrdered(num: Int)(implicit ord: Ordering[T]):Array[T]

val b = sc.parallelize(List("dog","cat", "ape", "salmon", "gnu"), 2)

b.takeOrdered(2)

res19: Array[String] = Array(ape, cat)

takeSample:

返回你指定的item,返回Array,不是RDD,随机排序的

def takeSample(withReplacement: Boolean, num: Int,seed: Int): Array[T]

val x = sc.parallelize(1 to 1000, 3)

x.takeSample(true, 100, 1)

res3: Array[Int] = Array(339, 718, 810, 105, 71, 268,333, 360, 341, 300, 68, 848, 431, 449, 773, 172, 802, 339, 431, 285, 937, 301,167, 69, 330, 864, 40, 645, 65, 349, 613, 468, 982, 314, 160, 675, 232, 794,577, 571, 805, 317, 136, 860, 522, 45, 628, 178, 321, 482, 657, 114, 332, 728,901, 290, 175, 876, 227, 130, 863, 773, 559, 301, 694, 460, 839, 952, 664, 851,260, 729, 823, 880, 792, 964, 614, 821, 683, 364, 80, 875, 813, 951, 663, 344,546, 918, 436, 451, 397, 670, 756, 512, 391, 70, 213, 896, 123, 858)

toDebugString:

列出来RDD之间的依赖关系

def toDebugString: String

val a = sc.parallelize(1 to 9, 3)

val b = sc.parallelize(1 to 3, 3)

val c = a.subtract(b)

c.toDebugString

res6: String =

MappedRDD[15] at subtract at <console>:16 (3partitions)

  SubtractedRDD[14]at subtract at <console>:16 (3 partitions)

   MappedRDD[12] at subtract at <console>:16 (3 partitions)

     ParallelCollectionRDD[10] at parallelize at <console>:12 (3partitions)

   MappedRDD[13] at subtract at <console>:16 (3 partitions)

     ParallelCollectionRDD[11] at parallelize at <console>:12 (3partitions)

toJavaRDD:

返回javaRDD

def toJavaRDD() : JavaRDD[T]

top:

采用隐式排序,返回最大的你指定的数

ddef top(num: Int)(implicit ord: Ordering[T]):Array[T]

val c = sc.parallelize(Array(6, 9, 4, 7, 5, 8), 2)

c.top(2)

res28: Array[Int] = Array(9, 8)

toString:

RDD转换成可以看的序列

override def toString: String

val a = sc.parallelize(1 to 9, 3)

val b = sc.parallelize(1 to 3, 3)

val c = a.subtract(b)

c.toString

res7: String = MappedRDD[15] at subtract at<console>:16

union,++:

A 联合B ,类似数据库中的union

def ++(other: RDD[T]): RDD[T]

def union(other: RDD[T]): RDD[T]

val a = sc.parallelize(1 to 3, 1)

val b = sc.parallelize(5 to 7, 1)

(a ++ b).collect

res0: Array[Int] = Array(1, 2, 3, 5, 6, 7)

unpersist:

擦除硬盘和内存中所有的data,如果在计算中被引用,会再次生成,自动生成graph

def unpersist(blocking: Boolean = true): RDD[T]

val y = sc.parallelize(1 to 10, 10)

val z = (y++y)

z.collect

z.unpersist(true)

14/04/19 03:04:57 INFO UnionRDD: Removing RDD 22 frompersistence list

14/04/19 03:04:57 INFO BlockManager: Removing RDD 22

values:

列出key-value中的所有value

val a = sc.parallelize(List("dog","tiger", "lion", "cat", "panther","eagle"), 2)

val b = a.map(x => (x.length, x))

b.values.collect

res3: Array[String] = Array(dog, tiger, lion, cat,panther, eagle)

variance [Double], sampleVariance [Double]:

variance:方差

sampleVariance :样本方差

val a = sc.parallelize(List(9.1, 1.0, 1.2, 2.1, 1.3,5.0, 2.0, 2.1, 7.4, 7.5, 7.6, 8.8, 10.0, 8.9, 5.5), 3)a.varianceres70: Double =10.605333333333332 val x = sc.parallelize(List(1.0, 2.0, 3.0, 5.0, 20.0, 19.02,19.29, 11.09, 21.0), 2)x.varianceres14: Double = 66.04584444444443x.sampleVarianceres13: Double = 74.30157499999999

Zip:

组合两个RDD成为一个key-valuerdd

def zip[U: ClassTag](other: RDD[U]): RDD[(T, U)]

val a = sc.parallelize(1 to 100, 3)val b =sc.parallelize(101 to 200, 3)a.zip(b).collectres1: Array[(Int, Int)] =Array((1,101), (2,102), (3,103), (4,104), (5,105), (6,106), (7,107), (8,108),(9,109), (10,110), (11,111), (12,112), (13,113), (14,114), (15,115), (16,116),(17,117), (18,118), (19,119), (20,120), (21,121), (22,122), (23,123), (24,124),(25,125), (26,126), (27,127), (28,128), (29,129), (30,130), (31,131), (32,132),(33,133), (34,134), (35,135), (36,136), (37,137), (38,138), (39,139), (40,140),(41,141), (42,142), (43,143), (44,144), (45,145), (46,146), (47,147), (48,148),(49,149), (50,150), (51,151), (52,152), (53,153), (54,154), (55,155), (56,156),(57,157), (58,158), (59,159), (60,160), (61,161), (62,162), (63,163), (64,164),(65,165), (66,166), (67,167), (68,168), (69,169), (70,170), (71,171), (72,172),(73,173), (74,174), (75,175), (76,176), (77,177), (78,... val a =sc.parallelize(1 to 100, 3)val b = sc.parallelize(101 to 200, 3)val c =sc.parallelize(201 to 300, 3)a.zip(b).zip(c).map((x) => (x._1._1, x._1._2,x._2 )).collectres12: Array[(Int, Int, Int)] = Array((1,101,201), (2,102,202),(3,103,203), (4,104,204), (5,105,205), (6,106,206), (7,107,207), (8,108,208),(9,109,209), (10,110,210), (11,111,211), (12,112,212), (13,113,213), (14,114,214),(15,115,215), (16,116,216), (17,117,217), (18,118,218), (19,119,219),(20,120,220), (21,121,221), (22,122,222), (23,123,223), (24,124,224),(25,125,225), (26,126,226), (27,127,227), (28,128,228), (29,129,229),(30,130,230), (31,131,231), (32,132,232), (33,133,233), (34,134,234),(35,135,235), (36,136,236), (37,137,237), (38,138,238), (39,139,239),(40,140,240), (41,141,241), (42,142,242), (43,143,243), (44,144,244),(45,145,245), (46,146,246), (47,147,247), (48,148,248), (49,149,249),(50,150,250), (51,151,251), (52,152,252), (53,153,253), (54,154,254),(55,155,255)...

zipParititions:

类似于zip,但是提供了更多的控制.

def zipPartitions[B: ClassTag, V: ClassTag](rdd2:RDD[B])(f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]defzipPartitions[B: ClassTag, V: ClassTag](rdd2: RDD[B], preservesPartitioning:Boolean)(f: (Iterator[T], Iterator[B]) => Iterator[V]): RDD[V]defzipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3:RDD[C])(f: (Iterator[T], Iterator[B], Iterator[C]) => Iterator[V]): RDD[V]defzipPartitions[B: ClassTag, C: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3:RDD[C], preservesPartitioning: Boolean)(f: (Iterator[T], Iterator[B],Iterator[C]) => Iterator[V]): RDD[V]def zipPartitions[B: ClassTag, C:ClassTag, D: ClassTag, V: ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4:RDD[D])(f: (Iterator[T], Iterator[B], Iterator[C], Iterator[D]) =>Iterator[V]): RDD[V]def zipPartitions[B: ClassTag, C: ClassTag, D: ClassTag, V:ClassTag](rdd2: RDD[B], rdd3: RDD[C], rdd4: RDD[D], preservesPartitioning: Boolean)(f:(Iterator[T], Iterator[B], Iterator[C], Iterator[D]) => Iterator[V]): RDD[V]

val a = sc.parallelize(0 to 9, 3)val b =sc.parallelize(10 to 19, 3)val c = sc.parallelize(100 to 109, 3)defmyfunc(aiter: Iterator[Int], biter: Iterator[Int], citer: Iterator[Int]):Iterator[String] ={  var res =List[String]()  while (aiter.hasNext&& biter.hasNext && citer.hasNext)  {   val x = aiter.next + " " + biter.next + " " +citer.next    res ::= x  }  res.iterator}a.zipPartitions(b,c)(myfunc).collectres50: Array[String] = Array(2 12 102, 1 11 101, 0 10 100, 515 105, 4 14 104, 3 13 103, 9 19 109, 8 18 108, 7 17 107, 6 16 106)

0 0
原创粉丝点击