Spark的Transform与Action操作(important)

来源：互联网发布：象过河软件视频教程编辑：程序博客网时间：2024/05/16 10:24

hadoop提供的接口核心是map和reduce函数，spark是mapreduce的扩展，提供两类操作(Transformation、Action)，而不是两个，使得使用更方便，开发时的代码会被spark的这种多样的API减少数十倍。——来自网络

RDD:

RDD是Spark中的抽象数据结构类型，任何数据在Spark中都被表示为RDD。从编程的角度来看，RDD可以简单看成是一个数组。和普通数组的区别是，RDD中的数据是分区存储的，这样不同分区的数据就可以分布在不同的机器上，同时可以被并行处理。因此，Spark应用程序所做的无非是把需要处理的数据转换为RDD，然后对RDD进行一系列的变换和操作从而得到结果。

RDD两个操作:

Each RDD has 2 sets of parallel operations: transformation and action.

(1)Transformation:Return a MappedRDD[U] by applying function f to each element

(2)Action:return T by reducing the elements using specified commutative and associative binary operator

transformation:

Transformation

Meaning

map(func)

Return a new distributed dataset formed by passing each element of the source through a function func.

对调用map的RDD数据集中的每个element都使用func，然后返回一个新的RDD,这个返回的数据集是分布式的数据集

filter(func)

Return a new dataset formed by selecting those elements of the source on which funcreturns true.

对调用filter的RDD数据集中的每个元素都使用func，然后返回一个包含使func为true的元素构成的RDD

flatMap(func)

Similar to map, but each input item can be mapped to 0 or more output items (so funcshould return a Seq rather than a single item).

和map差不多，但是flatMap生成的是多个结果,返回值是一个Seq

mapPartitions(func)

Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator<T> => Iterator<U> when running on an RDD of type T.

和map很像，但是map是每个element，而mapPartitions是每个partition

mapPartitionsWithIndex(func)

Similar to mapPartitions, but also provides func with an integer value representing the index of the partition, so func must be of type (Int, Iterator<T>) => Iterator<U> when running on an RDD of type T.

和mapPartitions很像，但是func作用的是其中一个split上，所以func中应该有index

sample(withReplacement, fraction, seed)

Sample a fraction fraction of the data, with or without replacement, using a given random number generator seed.

抽样

union(otherDataset)

Return a new dataset that contains the union of the elements in the source dataset and the argument.

返回一个新的dataset，包含源dataset和给定dataset的元素的集合

intersection(otherDataset)

Return a new RDD that contains the intersection of elements in the source dataset and the argument.

distinct([numTasks]))

Return a new dataset that contains the distinct elements of the source dataset.

返回一个新的dataset，这个dataset含有的是源dataset中的distinct的element

groupByKey([numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs.
Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.
Note: By default, the level of parallelism in the output depends on the number of partitions of the parent RDD. You can pass an optional numTasks argument to set a different number of tasks.

返回(K,Seq[V])，也就是hadoop中reduce函数接受的key-valuelist

reduceByKey(func, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

就是用一个给定的reduce func再作用在groupByKey产生的(K,Seq[V]),比如求和，求平均数

aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])

When called on a dataset of (K, V) pairs, returns a dataset of (K, U) pairs where the values for each key are aggregated using the given combine functions and a neutral "zero" value. Allows an aggregated value type that is different than the input value type, while avoiding unnecessary allocations. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

sortByKey([ascending], [numTasks])

When called on a dataset of (K, V) pairs where K implements Ordered, returns a dataset of (K, V) pairs sorted by keys in ascending or descending order, as specified in the boolean ascending argument.

按照key来进行排序，是升序还是降序，ascending是boolean类型

join(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin,rightOuterJoin, and fullOuterJoin.

当有两个KV的dataset(K,V)和(K,W)，返回的是(K,(V,W))的dataset,numTasks为并发的任务数

cogroup(otherDataset, [numTasks])

When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (Iterable<V>, Iterable<W>)) tuples. This operation is also called groupWith.

当有两个KV的dataset(K,V)和(K,W)，返回的是(K,Seq[V],Seq[W])的dataset,numTasks为并发的任务数

cartesian(otherDataset)

When called on datasets of types T and U, returns a dataset of (T, U) pairs (all pairs of elements).

笛卡尔积就是m*n

pipe(command, [envVars])

Pipe each partition of the RDD through a shell command, e.g. a Perl or bash script. RDD elements are written to the process's stdin and lines output to its stdout are returned as an RDD of strings.

coalesce(numPartitions)

Decrease the number of partitions in the RDD to numPartitions. Useful for running operations more efficiently after filtering down a large dataset.

repartition(numPartitions)

Reshuffle the data in the RDD randomly to create either more or fewer partitions and balance it across them. This always shuffles all data over the network.

repartitionAndSortWithinPartitions(partitioner)

Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. This is more efficient than calling repartition and then sorting within each partition because it can push the sorting down into the shuffle machinery.

Action:

Action

Meaning

reduce(func)

Aggregate the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.

说白了就是聚集，但是传入的函数是两个参数输入返回一个值，这个函数必须是满足交换律和结合律的

collect()

Return all the elements of the dataset as an array at the driver program. This is usually useful after a filter or other operation that returns a sufficiently small subset of the data.

一般在filter或者足够小的结果的时候，再用collect封装返回一个数组

count()

Return the number of elements in the dataset. 返回的是dataset中的element的个数

first()

Return the first element of the dataset (similar to take(1)). 返回的是dataset中的第一个元素

take(n)

Return an array with the first n elements of the dataset. 返回前n个elements，这个士driver program返回的

takeSample(withReplacement,num, [seed])

Return an array with a random sample of num elements of the dataset, with or without replacement, optionally pre-specifying a random number generator seed. 抽样返回一个dataset中的num个元素，随机种子seed

takeOrdered(n, [ordering])

Return the first n elements of the RDD using either their natural order or a custom comparator.

saveAsTextFile(path)

Write the elements of the dataset as a text file (or set of text files) in a given directory in the local filesystem, HDFS or any other Hadoop-supported file system. Spark will call toString on each element to convert it to a line of text in the file. 把dataset写到一个text file中，或者hdfs，或者hdfs支持的文件系统中，spark把每条记录都转换为一行记录，然后写到file中

saveAsSequenceFile(path)
(Java and Scala)

Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is available on RDDs of key-value pairs that implement Hadoop's Writable interface. In Scala, it is also available on types that are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). 只能用在key-value对上，然后生成SequenceFile写到本地或者hadoop文件系统

saveAsObjectFile(path)
(Java and Scala)

Write the elements of the dataset in a simple format using Java serialization, which can then be loaded usingSparkContext.objectFile().

countByKey()

Only available on RDDs of type (K, V). Returns a hashmap of (K, Int) pairs with the count of each key. 返回的是key对应的个数的一个map，作用于一个RDD

foreach(func)

Run a function func on each element of the dataset. This is usually done for side effects such as updating anAccumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details. :对dataset中的每个元素都使用func

常见操作示例:

1.map&mapValuesmap原RDD中的元素经map处理后只能生成一个元素val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", " eagle"), 2)//2为设置的task数量,可以加速val b = a.map(x => (x.length, x))//ab映射(2,ab)  一个元素映射为一个元组元素b.mapValues("x" + _ + "x").collect//(2,ab)=>(2,xabx) 仅仅对value做映射 collect为action操作 得到返回值2.flatMap原RDD中的元素经flatmap处理后可生成多个元素来构建新RDD。 scala> val a = sc.parallelize(1 to 4, 2)scala> val b = a.flatMap(x => 1 to x)//比如将3会映射为1,2,3 一个元素变成多个元素scala> b.collectres12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)//(1) (1,2) (1,2,3) (1,2,3,4)3.mappartionval mappartionaa = sc.parallelize(1 to 9, 3)//分了三个区def myfunc[T](iter: Iterator[T]) : Iterator[(T, T)] = {      var res = List[(T, T)]()      var pre = iter.next()//初始化 得到第一个      while (iter.hasNext) {        val cur = iter.next()        res .::= (pre, cur)// res .::为拼接元素  (pre, cur)        pre = cur      }      res.iterator//转换成迭代类型    }println("mappartion")mappartionaa.mapPartitions(myfunc).collect().foreach(println)//以每个分区为一个map上述例子中的函数myfunc是把分区中一个元素和它的下一个元素组成一个Tuple。因为分区中最后一个元素没有下一个元素了，所以(3,4)和(6,7)不在结果中。mapPartitions还有些变种，比如mapPartitionsWithContext，它能把处理过程中的一些状态信息传递给用户指定的输入函数。还有mapPartitionsWithIndex，它能把分区的index传递给用户指定的输入函数。4.reducereduce将RDD中元素两两传递给输入函数，同时产生一个新的值，新产生的值与RDD中下一个元素再被传递给输入函数直到最后只有一个值为止。scala> val c = sc.parallelize(1 to 10)scala> c.reduce((x, y) => x + y)res4: Int = 555reduceByKey顾名思义，reduceByKey就是对元素为KV对的RDD中Key相同的元素的Value进行reduce，因此，Key相同的多个元素的值被reduce为一个值，然后与原RDD中的Key组成一个新的KV对。scala> val a = sc.parallelize(List((1,2),(3,4),(3,6)))scala> a.reduceByKey((x,y) => x + y).collectres7: Array[(Int, Int)] = Array((1,2), (3,10))5num.reduce (_ + _)#num Array[Int] = Array(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)num.take(5)#选择前5个数字num.first#选择第一个元素num.count#数据集元素个数num.take(5).foreach(println)#前面5元素打印出来6K-V演示  groupByKey  sortByKey  reduceByKey#先创立KV类型的RDDval kv1=sc.parallelize(List(("A",1),("B",2),("C",3),("A",4),("B",5)))(1)kv1.sortByKey().collect //注意sortByKey的小括号不能省#结果 Array((A,1), (A,4), (B,2), (B,5), (C,3))(2)kv1.groupByKey().collect #不作合并  类似于hadoop中reduce的输入端#Array((A,CompactBuffer(1, 4)), (B,CompactBuffer(2, 5)), (C,CompactBuffer(3)))(3)kv1.reduceByKey(_+_).collect#做合并Array[(String, Int)] = Array((A,5), (B,7), (C,3))7去重val kv2=sc.parallelize(List(("A",4),("A",4),("C",3),("A",4),("B",5)))kv2.distinct.collect#去重#Array[(String, Int)] = Array((A,4), (B,5), (C,3))#如果val kv2=sc.parallelize(List(("A",4),("A",3),("C",3),("A",4),("B",5)))#则： Array((A,4), (B,5), (A,3), (C,3))8union   是a与b纵向拼接kv1.union(kv2).collect  #联合两个集合#kv1   Array[(String, Int)] = Array((A,1), (B,2), (C,3), (A,4), (B,5))#kv2   Array[(String, Int)] = Array((A,4), (A,4), (C,3), (A,4), (B,5))#结果不会去重,完全拼接  Array[(String, Int)] = Array((A,1), (B,2), (C,3), (A,4), (B,5), (A,4), (A,4), (C,3), (A,4), (B,5))9join    是默认按照key横向拼接,类似数据库中多表联结查询//join演示  类似结果的拼接val format = new java.text.SimpleDateFormat("yyyy-MM-dd")case class Register (d: java.util.Date, uuid: String, cust_id: String, lat: Float,lng: Float)case class Click (d: java.util.Date, uuid: String, landing_page: Int)val reg = sc.textFile("F:/HDFSinputfile//reg.tsv").map(_.split("\t")).map(r => (r(1), Register(format.parse(r(0)), r(1), r(2), r(3).toFloat, r(4).toFloat)))reg.foreach(println)val clk = sc.textFile("F:/HDFSinputfile/clk.tsv").map(_.split("\t")).map(c => (c(1), Click(format.parse(c(0)), c(1), c(2).trim.toInt)))clk.foreach(println)println("test join")reg.join(clk).take(2).foreach(println)#结果reg(15dfb8e6cc4111e3a5bb600308919594,Register(Sun Mar 02 00:00:00 CST 2014,15dfb8e6cc4111e3a5bb600308919594,1,33.659943,-117.95812))(81da510acc4111e387f3600308919594,Register(Tue Mar 04 00:00:00 CST 2014,81da510acc4111e387f3600308919594,2,33.85701,-117.85574))clk(15dfb8e6cc4111e3a5bb600308919594,Click(Tue Mar 04 00:00:00 CST 2014,15dfb8e6cc4111e3a5bb600308919594,11))(81da510acc4111e387f3600308919594,Click(Thu Mar 06 00:00:00 CST 2014,81da510acc4111e387f3600308919594,61))test join(81da510acc4111e387f3600308919594,(Register(Tue Mar 04 00:00:00 CST 2014,81da510acc4111e387f3600308919594,2,33.85701,-117.85574),Click(Thu Mar 06 00:00:00 CST 2014,81da510acc4111e387f3600308919594,61)))(15dfb8e6cc4111e3a5bb600308919594,(Register(Sun Mar 02 00:00:00 CST 2014,15dfb8e6cc4111e3a5bb600308919594,1,33.659943,-117.95812),Click(Tue Mar 04 00:00:00 CST 2014,15dfb8e6cc4111e3a5bb600308919594,11)))

参考：

1Transform、action官方文档

2.Transform、action操作示例大全

0 0