RDD学习笔记

来源：互联网发布：奇葩男士网络剧编辑：程序博客网时间：2024/06/10 06:33

1. 驱动程序（driver program）----> 运行main行数

共享变量:有的时候在不同节点上,需要同时运行一系列的任务,将每一个函数中用到的变量进行共享

1.广播变量:缓存到各个节点的内存中,而不是task中

2.累加器:只能用于加法的变量

Master URLs:

local:本地

local[K]:K个线程进行并行运算

aggregate:

函数有三个入参,一是初始值ZeroValue,二是seqOp,三为combOp.

seqOp seqOp会被并行执行,具体由各个executor上的task来完成计算

combOpcombOp则是串行执行,其中combOp操作在JobWaiter的taskSucceeded函数中被调用

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

cartesian:

生成笛卡尔积:

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

coalesce:

重新分区:

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup:

一个据听说很强大的功能,最多允许三个value,键值自己会共享,不能够太多组合

Listing Variants

def cogroup[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def cogroup[W](other: RDD[(K, W)], numPartitions:Int): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W](other: RDD[(K, W)], partitioner:Partitioner): RDD[(K, (Seq[V], Seq[W]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], numPartitions: Int): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def cogroup[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)], partitioner: Partitioner): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

def groupWith[W](other: RDD[(K, W)]): RDD[(K, (Seq[V],Seq[W]))]

def groupWith[W1, W2](other1: RDD[(K, W1)], other2:RDD[(K, W2)]): RDD[(K, (Seq[V], Seq[W1], Seq[W2]))]

Examples

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.map((_, "b"))

val c = a.map((_, "c"))

b.cogroup(c).collect

res7: Array[(Int, (Seq[String], Seq[String]))] =Array(

(2,(ArrayBuffer(b),ArrayBuffer(c))),

(3,(ArrayBuffer(b),ArrayBuffer(c))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c)))

)

val d = a.map((_, "d"))

b.cogroup(c, d).collect

res9: Array[(Int, (Seq[String], Seq[String],Seq[String]))] = Array(

(2,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(3,(ArrayBuffer(b),ArrayBuffer(c),ArrayBuffer(d))),

(1,(ArrayBuffer(b, b),ArrayBuffer(c, c),ArrayBuffer(d,d)))

)

val x = sc.parallelize(List((1, "apple"),(2, "banana"), (3, "orange"), (4, "kiwi")), 2)

val y = sc.parallelize(List((5, "computer"),(1, "laptop"), (1, "desktop"), (4, "iPad")), 2)

x.cogroup(y).collect

res23: Array[(Int, (Seq[String], Seq[String]))] =Array(

(4,(ArrayBuffer(kiwi),ArrayBuffer(iPad))),

(2,(ArrayBuffer(banana),ArrayBuffer())),

(3,(ArrayBuffer(orange),ArrayBuffer())),

(1,(ArrayBuffer(apple),ArrayBuffer(laptop, desktop))),

(5,(ArrayBuffer(),ArrayBuffer(computer))))

Collect,toArray:

转换RDD成为scala的数组

def collect(): Array[T]

def collect[U: ClassTag](f: PartialFunction[T, U]):RDD[U]

def toArray(): Array[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.collect

res29: Array[String] = Array(Gnu, Cat, Rat, Dog, Gnu,Rat)

collectAsMap:

类似于collect,但是key-values转换成scala的时候,保存了映射结构

def collectAsMap(): Map[K, V]

val a = sc.parallelize(List(1, 2, 1, 3), 1)

val b = a.zip(a)

b.collectAsMap

res1: scala.collection.Map[Int,Int] = Map(2 -> 2, 1-> 1, 3 -> 3)

combineByKey:

自动把相同key的整理成一个Array,最后所有的Array就由各自不同key的array组成的.

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions:Int): RDD[(K, C)]

def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner:Partitioner, mapSideCombine: Boolean = true, serializerClass: String = null):RDD[(K, C)]

val a =sc.parallelize(List("dog","cat","gnu","salmon","rabbit","turkey","wolf","bear","bee"),3)

val b = sc.parallelize(List(1,1,2,2,2,1,2,2,2), 3)

val c = b.zip(a)

val d = c.combineByKey(List(_), (x:List[String],y:String) => y :: x, (x:List[String], y:List[String]) => x ::: y)

d.collect

res16: Array[(Int, List[String])] = Array((1,List(cat,dog, turkey)), (2,List(gnu, rabbit, salmon, bee, bear, wolf)))

::和:::都是Array中的组合方式

compute:

执行以来关系,计算RDD的实际表达.不由用户直接调用

def compute(split: Partition, context: TaskContext):Iterator[T]

context, sparkContext:

返回创建使用的RDD.

def compute(split: Partition, context: TaskContext):Iterator[T]

count:

返回RDD元组中items的数量

def count(): Long

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.count

res2: Long = 4

countApprox:

这个不知道.

def (timeout: Long, confidence: Double = 0.95):PartialResult[BoundedDouble]

countByKey [Pair]:

类似count,这个是计算key的数量的,并且返回map

def countByKey(): Map[K, Long]

val c = sc.parallelize(List((3, "Gnu"), (3,"Yak"), (5, "Mouse"), (3, "Dog")), 2)

c.countByKey

res3: scala.collection.Map[Int,Long] = Map(3 -> 3,5 -> 1)

countByValue:

计算value的count,然后返回count->value的map

def countByValue(): Map[T, Long]

val b =sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b.countByValue

res27: scala.collection.Map[Int,Long] = Map(5 -> 1,8 -> 1, 3 -> 1, 6 -> 1, 1 -> 6, 2 -> 3, 4 -> 2, 7 -> 1)

countByValueApprox:

功能尚不知.

def countByValueApprox(timeout: Long, confidence:Double = 0.95): PartialResult[Map[T, BoundedDouble]]

countApproxDistinct:

近似计数，当数据量大的时候很有用

def countApproxDistinct(relativeSD: Double = 0.05): Long

val a = sc.parallelize(1 to 10000, 20)

val b = a++a++a++a++a

b.countApproxDistinct(0.1)

val a = sc.parallelize(1 to 30000, 30)

val b = a++a++a++a++a

b.countApproxDistinct(0.05)

res28: Long = 30097

可以看出,会计算出a的大概值范围

countApproxDistinctByKey [Pair]:

类似countApproxDistinct,但计算不同值的不同key的数量,所以RDD必须是key-value的形式,执行计算的速度快.

def countApproxDistinctByKey(relativeSD: Double =0.05): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,numPartitions: Int): RDD[(K, Long)]

def countApproxDistinctByKey(relativeSD: Double,partitioner: Partitioner): RDD[(K, Long)]

val a = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

val b = sc.parallelize(a.takeSample(true, 10000, 0),20)

val c = sc.parallelize(1 to b.count().toInt, 20)

val d = b.zip(c)

d.countApproxDistinctByKey(0.1).collect

res15: Array[(String, Long)] = Array((Rat,2567),(Cat,3357), (Dog,2414), (Gnu,2494))

d.countApproxDistinctByKey(0.01).collect

res16: Array[(String, Long)] = Array((Rat,2555),(Cat,2455), (Dog,2425), (Gnu,2513))

d.countApproxDistinctByKey(0.001).collect

res0: Array[(String, Long)] = Array((Rat,2562),(Cat,2464), (Dog,2451), (Gnu,2521))

可以看出可以计算出value的大致范围

dependencies:

返回当前RDD所依赖的RDD

final def dependencies: Seq[Dependency[_]]

val b = sc.parallelize(List(1,2,3,4,5,6,7,8,2,4,2,1,1,1,1,1))

b: org.apache.spark.rdd.RDD[Int] =ParallelCollectionRDD[32] at parallelize at <console>:12

b.dependencies.length

Int = 0

b.map(a => a).dependencies.length

res40: Int = 1

b.cartesian(a).dependencies.length

res41: Int = 2

b.cartesian(a).dependencies

res42: Seq[org.apache.spark.Dependency[_]] =List(org.apache.spark.rdd.CartesianRDD$$anon$1@576ddaaa,org.apache.spark.rdd.CartesianRDD$$anon$2@6d2efbbd)

distinct:

返回一个新的RDD,这个RDD包含的是唯一值

def distinct(): RDD[T]

def distinct(numPartitions: Int): RDD[T]

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog", "Gnu","Rat"), 2)

c.distinct.collect

res6: Array[String] = Array(Dog, Gnu, Cat, Rat)

val a = sc.parallelize(List(1,2,3,4,5,6,7,8,9,10))

a.distinct(2).partitions.length

res16: Int = 2

a.distinct(3).partitions.length

res17: Int = 3

中间的数字是,开启的Partitions个数

first:

返回RDD中的第一个数据

def first(): T

val c = sc.parallelize(List("Gnu","Cat", "Rat", "Dog"), 2)

c.first

res1: String = Gnu

filter:

非常常用的功能,内部使用返回布尔值的方法,对RDD中的每个data使用该方法,返回result RDD

def filter(f: T => Boolean): RDD[T]

val a = sc.parallelize(1 to 10, 3)

a.filter(_ % 2 == 0)

b.collect

res3: Array[Int] = Array(2, 4, 6, 8, 10)

注意:他必须能够处理RDD中所有的数据项.scala提供了一些方法来处理混合数据类型.如果右一些数据是损坏的,你不想处理,但是对于其他没有损坏的数据你想使用类似map()的方法

Examples for mixed data without partial functions:

val b = sc.parallelize(1 to 8)

b.filter(_ < 4).collect

res15: Array[Int] = Array(1, 2, 3)

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.filter(_ < 4).collect

<console>:15: error: value < is not a memberof Any

失败原因:

操作符不支持有的字符

对混合类型的处理:

val a = sc.parallelize(List("cat","horse", 4.0, 3.5, 2, "dog"))

a.collect({case a: Int => "is integer" |

caseb: String => "is string" }).collect

res17: Array[String] = Array(is string, is string, isinteger, is string)

val myfunc: PartialFunction[Any, Any] = {

case a:Int => "is integer" |

case b: String=> "is string" }

myfunc.isDefinedAt("") 判断myfunc是否支持

res21: Boolean = true

myfunc.isDefinedAt(1)

res22: Boolean = true

myfunc.isDefinedAt(1.5) 不支持

res23: Boolean = false

Our research group has a very strong focus on usingand improving Apache Spark to solve real world programs. In order to do this weneed to have a very solid understanding of the capabilities of Spark. So one ofthe first things we have done is to go through the entire Spark RDD API andwrite examples to test their functionality. This has been a very usefulexercise and we would like to share the examples with everyone.

Authors of examples: Matthias Langer and Zhen He

Emails addresses: m.langer@latrobe.edu.au,z.he@latrobe.edu.au

These examples have only been tested for Spark version0.9. We assume the functionality of Spark is stable and therefore the examplesshould be valid for later releases.

Here is a pdf of the all the examples: SparkExamples

The RDD API By Example

RDD is short for Resilient Distributed Dataset. RDDsare the workhorse of the Spark system. As a user, one can consider a RDD as ahandle for a collection of individual data partitions, which are the result ofsome computation.

However, an RDD is actually more than that. On clusterinstallations, separate data partitions can be on separate nodes. Using the RDDas a handle one can access all partitions and perform computations andtransformations using the contained data. Whenever a part of a RDD or an entireRDD is lost, the system is able to reconstruct the data of lost partitions byusing lineage information. Lineage refers to the sequence of transformationsused to produce the current RDD. As a result, Spark is able to recoverautomatically from most failures.

All RDDs available in Spark derive either directly orindirectly from the class RDD. This class comes with a large set of methodsthat perform operations on the data within the associated partitions. The classRDD is abstract. Whenever, one uses a RDD, one is actually using a concertizedimplementation of RDD. These implementations have to overwrite some corefunctions to make the RDD behave as expected.

One reason why Spark has lately become a very popularsystem for processing big data is that it does not impose restrictionsregarding what data can be stored within RDD partitions. The RDD API alreadycontains many useful operations. But, because the creators of Spark had to keepthe core API of RDDs common enough to handle arbitrary data-types, manyconvenience functions are missing.

The basic RDD API considers each data item as a singlevalue. However, users often want to work with key-value pairs. Therefore Sparkextended the interface of RDD to provide additional functions(PairRDDFunctions), which explicitly work on key-value pairs. Currently, thereare four extensions to the RDD API available in spark. They are as follows:

DoubleRDDFunctions

This extension contains many useful methods foraggregating numeric values. They become available if the data items of an RDDare implicitly convertible to the Scala data-type double.

PairRDDFunctions

Methods defined in this interface extension becomeavailable when the data items have a two component tuple structure. Spark willinterpret the first tuple item (i.e. tuplename. 1) as the key and the seconditem (i.e. tuplename. 2) as the associated value.

OrderedRDDFunctions

Methods defined in this interface extension becomeavailable if the data items are two-component tuples where the key isimplicitly sortable.

SequenceFileRDDFunctions

This extension contains several methods that allowusers to create Hadoop sequence- les from RDDs. The data items must be twocompo- nent key-value tuples as required by the PairRDDFunctions. However,there are additional requirements considering the convertibility of the tuplecomponents to Writable types.

Since Spark will make methods with extendedfunctionality automatically available to users when the data items fulfill theabove described requirements, we decided to list all possible availablefunctions in strictly alphabetical order. We will append either of thefollowingto the function-name to indicate it belongs to an extension thatrequires the data items to conform to a certain format or type.

[Double] - Double RDD Functions

[Ordered] - OrderedRDDFunctions

[Pair] - PairRDDFunctions

[SeqFile] - SequenceFileRDDFunctions

aggregate

The aggregate-method provides an interface forperforming highly customized reductions and aggregations with a RDD. However,due to the way Scala and Spark execute and process data, care must be taken toachieve deterministic behavior. The following list contains a few observationswe made while experimenting with aggregate:

The reduceand combine functions have to be commutative and associative.

As can beseen from the function definition below, the output of the combiner must beequal to its input. This is necessary because Spark will chain-execute it.

The zerovalue is the initial value of the U component when either seqOp or combOp areexecuted for the first element of their domain of influence. Depending on whatyou want to achieve, you may have to change it. However, to make your codedeterministic, make sure that your code will yield the same result regardlessof the number or size of partitions.

Do notassume any execution order for either partition computations or combiningpartitions.

The neutralzeroValue is applied at the beginning of each sequence of reduces within theindividual partitions and again when the output of separate partitions iscombined.

Why have twoseparate combine functions? The first functions maps the input values into theresult space. Note that the aggregation data type (1st input and output) can bedifferent (U != T). The second function reduces these mapped values in theresult space.

Why wouldone want to use two input data types? Let us assume we do an archaeologicalsite survey using a metal detector. While walking through the site we take GPScoordinates of important findings based on the output of the metal detector.Later, we intend to draw an image of a map that highlights these locationsusing the aggregate function. In this case the zeroValue could be an area mapwith no highlights. The possibly huge set of input data is stored as GPScoordinates across many partitions. seqOp could convert the GPS coordinates tomap coordinates and put a marker on the map at the respective position. combOpwill receive these highlights as partial maps and combine them into a singlefinal output map.

Listing Variants

def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T)=> U, combOp: (U, U) => U): U

Examples 1

val z = sc.parallelize(List(1,2,3,4,5,6), 2)

z.aggregate(0)(math.max(_, _), _ + _)

res40: Int = 9

val z =sc.parallelize(List("a","b","c","d","e","f"),2)

z.aggregate("")(_ + _, _+_)

res115: String = abcdef

z.aggregate("x")(_ + _, _+_)

res116: String = xxdefxabc

val z =sc.parallelize(List("12","23","345","4567"),2)

z.aggregate("")((x,y) =>math.max(x.length, y.length).toString, (x,y) => x + y)

res141: String = 42

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res142: String = 11

val z =sc.parallelize(List("12","23","345",""),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res143: String = 10

The main issue with the code above is that the resultof the inner min is a string of length 1.

The zero in the output is due to the empty string beingthe last string in the list. We see this result because we are not recursivelyreducing any further within the partition for the final string.

Examples 2

val z =sc.parallelize(List("12","23","","345"),2)

z.aggregate("")((x,y) =>math.min(x.length, y.length).toString, (x,y) => x + y)

res144: String = 11

In contrast to the previous example, this example hasthe empty string at the beginning of the second partition. This results inlength of zero being input to the second reduce which then upgrades it a lengthof 1. (Warning: The above example shows bad design since the output isdependent on the order of the data inside the partitions.)

cartesian

Computes the cartesian product between two RDDs (i.e.Each item of the first RDD is joined with each item of the second RDD) andreturns them as a new RDD. (Warning: Be careful when using this function.!Memory consumption can quickly become an issue!)

Listing Variants

def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)]

Example

val x = sc.parallelize(List(1,2,3,4,5))

val y = sc.parallelize(List(6,7,8,9,10))

x.cartesian(y).collect

res0: Array[(Int, Int)] = Array((1,6), (1,7), (1,8),(1,9), (1,10), (2,6), (2,7), (2,8), (2,9), (2,10), (3,6), (3,7), (3,8), (3,9),(3,10), (4,6), (5,6), (4,7), (5,7), (4,8), (5,8), (4,9), (4,10), (5,9), (5,10))

checkpoint

Will create a checkpoint when the RDD is computednext. Checkpointed RDDs are stored as a binary file within the checkpointdirectory which can be specified using the Spark context. (Warning: Spark applieslazy evaluation. Checkpointing will not occur until an action is invoked.)

Important note: the directory "my_directory_name" should exist inall slaves. As an alternative you could use an HDFS directory URL as well.

Listing Variants

def checkpoint()

Example

sc.setCheckpointDir("my_directory_name")

val a = sc.parallelize(1 to 4)

a.checkpoint

a.count

14/02/25 18:13:53 INFO SparkContext: Starting job:count at <console>:15

...

14/02/25 18:13:53 INFO MemoryStore: Block broadcast_5stored as values to memory (estimated size 115.7 KB, free 296.3 MB)

14/02/25 18:13:53 INFO RDDCheckpointData: Donecheckpointing RDD 11 tofile:/home/cloudera/Documents/spark-0.9.0-incubating-bin-cdh4/bin/my_directory_name/65407913-fdc6-4ec1-82c9-48a1656b95d6/rdd-11,new parent is RDD 12

res23: Long = 4

coalesce, repartition

Coalesces the associated data into a given number ofpartitions. repartition(numPartitions) is simply an abbreviation forcoalesce(numPartitions, shuffle = true).

Listing Variants

def coalesce ( numPartitions : Int , shuffle : Boolean= false ): RDD [T]

def repartition ( numPartitions : Int ): RDD [T]

Example

val y = sc.parallelize(1 to 10, 10)

val z = y.coalesce(2, false)

z.partitions.length

res9: Int = 2

cogroup [Pair], groupWith [Pair]

A very powerful set of functions that allow groupingup to 3 key-value RDDs together using their keys.