Spark函数讲解:coalesce

来源:互联网 发布:辣条 网络用语 编辑:程序博客网 时间:2024/05/18 01:51

函数原型

1defcoalesce(numPartitions:Int, shuffle:Boolean =false)
2    (implicitord:Ordering[T] =null):RDD[T]

  返回一个新的RDD,且该RDD的分区个数等于numPartitions个数。如果shuffle设置为true,则会进行shuffle。

实例

01/**
02 * User: 过往记忆
03 * Date: 15-03-09
04 * Time: 上午06:30
05 * bolg: http://www.iteblog.com
06 * 本文地址:http://www.iteblog.com/archives/1279
07 * 过往记忆博客,专注于hadoop、hive、spark、shark、flume的技术博客,大量的干货
08 * 过往记忆博客微信公共帐号:iteblog_hadoop
09 */
10scala> vardata =sc.parallelize(List(1,2,3,4))
11data:org.apache.spark.rdd.RDD[Int] =
12    ParallelCollectionRDD[45] at parallelize at <console>:12
13 
14scala> data.partitions.length
15res68:Int =30
16 
17scala> valresult =data.coalesce(2false)
18result:org.apache.spark.rdd.RDD[Int] =CoalescedRDD[57] at coalesce at <console>:14
19 
20scala> result.partitions.length
21res77:Int =2
22 
23scala> result.toDebugString
24res75:String =
25(2) CoalescedRDD[57] at coalesce at <console>:14[]
26 |  ParallelCollectionRDD[45] at parallelize at <console>:12[]
27 
28scala> valresult1= data.coalesce(2true)
29result1:org.apache.spark.rdd.RDD[Int] =MappedRDD[61] at coalesce at <console>:14
30 
31scala> result1.toDebugString
32res76:String =
33(2) MappedRDD[61] at coalesce at <console>:14[]
34 |  CoalescedRDD[60] at coalesce at <console>:14[]
35 |  ShuffledRDD[59] at coalesce at <console>:14[]
36 +-(30) MapPartitionsRDD[58] at coalesce at <console>:14[]
37    |   ParallelCollectionRDD[45] at parallelize at <console>:12[]

  从上面可以看出shuffle为false的时候并不进行shuffle操作;而为true的时候会进行shuffle操作。RDD.partitions.length可以获取相关RDD的分区数。

0 0
原创粉丝点击