spark val b = a.flatMap(x => 1 to x)详解
来源:互联网 发布:人工智能产业链全景图 编辑:程序博客网 时间:2024/06/09 23:11
flatMap
与map类似,区别是原RDD中的元素经map处理后只能生成一个元素,而原RDD中的元素经flatmap处理后可生成多个元素来构建新RDD。 举例:对原RDD中的每个元素x产生y个元素(从1到y,y为元素x的值)
val b = a.flatMap(x => 1 to x)
根据a中的每个元素的值从1开始每次累加1,直到等于该元素值,生成列表。例如:元素是1,列表是1;元素是2,列表是1、2;
例如:
scala> val a = sc.parallelize(1 to 4, 2)
1.生成4个列表:
1
1、2
1、2、3
1、2、3、4
2.合并4个列表
1、1、2、1、2、3、1、2、3、4
scala> val a = sc.parallelize(1 to 4, 2)scala> val b = a.flatMap(x => 1 to x)scala> b.collectres12: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)
scala> val a = sc.parallelize(1 to 4, 2)a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[73] at parallelize at <console>:22scala> a.collect16/08/28 15:25:28 INFO spark.SparkContext: Starting job: collect at <console>:2516/08/28 15:25:28 INFO scheduler.DAGScheduler: Got job 34 (collect at <console>:25) with 2 output partitions (allowLocal=false)16/08/28 15:25:28 INFO scheduler.DAGScheduler: Final stage: Stage 37(collect at <console>:25)16/08/28 15:25:28 INFO scheduler.DAGScheduler: Parents of final stage: List()16/08/28 15:25:28 INFO scheduler.DAGScheduler: Missing parents: List()16/08/28 15:25:28 INFO scheduler.DAGScheduler: Submitting Stage 37 (ParallelCollectionRDD[73] at parallelize at <console>:22), which has no missing parents16/08/28 15:25:28 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 37 (ParallelCollectionRDD[73] at parallelize at <console>:22)16/08/28 15:25:28 INFO scheduler.TaskSchedulerImpl: Adding task set 37.0 with 2 tasks16/08/28 15:25:28 INFO scheduler.TaskSetManager: Starting task 37.0:0 as TID 401 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:25:28 INFO scheduler.TaskSetManager: Serialized task 37.0:0 as 1089 bytes in 6 ms16/08/28 15:25:28 INFO scheduler.TaskSetManager: Starting task 37.0:1 as TID 402 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:25:28 INFO scheduler.TaskSetManager: Serialized task 37.0:1 as 1089 bytes in 3 ms16/08/28 15:25:28 INFO executor.Executor: Running task ID 40116/08/28 15:25:28 INFO executor.Executor: Running task ID 40216/08/28 15:25:28 INFO executor.Executor: Serialized size of result for 402 is 55016/08/28 15:25:28 INFO executor.Executor: Serialized size of result for 401 is 55016/08/28 15:25:28 INFO executor.Executor: Sending result for 402 directly to driver16/08/28 15:25:28 INFO executor.Executor: Finished task ID 40216/08/28 15:25:28 INFO executor.Executor: Sending result for 401 directly to driver16/08/28 15:25:28 INFO executor.Executor: Finished task ID 40116/08/28 15:25:28 INFO scheduler.TaskSetManager: Finished TID 402 in 179 ms on localhost (progress: 1/2)16/08/28 15:25:28 INFO scheduler.DAGScheduler: Completed ResultTask(37, 1)16/08/28 15:25:28 INFO scheduler.TaskSetManager: Finished TID 401 in 207 ms on localhost (progress: 2/2)16/08/28 15:25:28 INFO scheduler.DAGScheduler: Completed ResultTask(37, 0)16/08/28 15:25:28 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 37.0, whose tasks have all completed, from pool 16/08/28 15:25:28 INFO scheduler.DAGScheduler: Stage 37 (collect at <console>:25) finished in 0.242 s16/08/28 15:25:28 INFO spark.SparkContext: Job finished: collect at <console>:25, took 0.49719503 sres56: Array[Int] = Array(1, 2, 3, 4)scala> val b = a.flatMap(x => 1 to x)b: org.apache.spark.rdd.RDD[Int] = FlatMappedRDD[74] at flatMap at <console>:24scala> b.collect16/08/28 15:25:54 INFO spark.SparkContext: Starting job: collect at <console>:2716/08/28 15:25:54 INFO scheduler.DAGScheduler: Got job 35 (collect at <console>:27) with 2 output partitions (allowLocal=false)16/08/28 15:25:54 INFO scheduler.DAGScheduler: Final stage: Stage 38(collect at <console>:27)16/08/28 15:25:54 INFO scheduler.DAGScheduler: Parents of final stage: List()16/08/28 15:25:54 INFO scheduler.DAGScheduler: Missing parents: List()16/08/28 15:25:54 INFO scheduler.DAGScheduler: Submitting Stage 38 (FlatMappedRDD[74] at flatMap at <console>:24), which has no missing parents16/08/28 15:25:54 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 38 (FlatMappedRDD[74] at flatMap at <console>:24)16/08/28 15:25:54 INFO scheduler.TaskSchedulerImpl: Adding task set 38.0 with 2 tasks16/08/28 15:25:54 INFO scheduler.TaskSetManager: Starting task 38.0:0 as TID 403 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:25:54 INFO scheduler.TaskSetManager: Serialized task 38.0:0 as 1330 bytes in 3 ms16/08/28 15:25:54 INFO scheduler.TaskSetManager: Starting task 38.0:1 as TID 404 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:25:54 INFO scheduler.TaskSetManager: Serialized task 38.0:1 as 1330 bytes in 2 ms16/08/28 15:25:54 INFO executor.Executor: Running task ID 40316/08/28 15:25:54 INFO executor.Executor: Running task ID 40416/08/28 15:25:54 INFO executor.Executor: Serialized size of result for 403 is 55416/08/28 15:25:54 INFO executor.Executor: Sending result for 403 directly to driver16/08/28 15:25:54 INFO executor.Executor: Finished task ID 40316/08/28 15:25:54 INFO scheduler.DAGScheduler: Completed ResultTask(38, 0)16/08/28 15:25:54 INFO scheduler.TaskSetManager: Finished TID 403 in 58 ms on localhost (progress: 1/2)16/08/28 15:25:54 INFO executor.Executor: Serialized size of result for 404 is 57016/08/28 15:25:54 INFO executor.Executor: Sending result for 404 directly to driver16/08/28 15:25:54 INFO executor.Executor: Finished task ID 40416/08/28 15:25:54 INFO scheduler.TaskSetManager: Finished TID 404 in 71 ms on localhost (progress: 2/2)16/08/28 15:25:54 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 38.0, whose tasks have all completed, from pool 16/08/28 15:25:54 INFO scheduler.DAGScheduler: Completed ResultTask(38, 1)16/08/28 15:25:54 INFO scheduler.DAGScheduler: Stage 38 (collect at <console>:27) finished in 0.082 s16/08/28 15:25:54 INFO spark.SparkContext: Job finished: collect at <console>:27, took 0.178752245 sres57: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4)scala> val a = sc.parallelize(1 to 2, 2)a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[75] at parallelize at <console>:22scala> a.collect16/08/28 15:27:27 INFO spark.SparkContext: Starting job: collect at <console>:2516/08/28 15:27:27 INFO scheduler.DAGScheduler: Got job 36 (collect at <console>:25) with 2 output partitions (allowLocal=false)16/08/28 15:27:27 INFO scheduler.DAGScheduler: Final stage: Stage 39(collect at <console>:25)16/08/28 15:27:27 INFO scheduler.DAGScheduler: Parents of final stage: List()16/08/28 15:27:27 INFO scheduler.DAGScheduler: Missing parents: List()16/08/28 15:27:27 INFO scheduler.DAGScheduler: Submitting Stage 39 (ParallelCollectionRDD[75] at parallelize at <console>:22), which has no missing parents16/08/28 15:27:27 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 39 (ParallelCollectionRDD[75] at parallelize at <console>:22)16/08/28 15:27:27 INFO scheduler.TaskSchedulerImpl: Adding task set 39.0 with 2 tasks16/08/28 15:27:27 INFO scheduler.TaskSetManager: Starting task 39.0:0 as TID 405 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:27:27 INFO scheduler.TaskSetManager: Serialized task 39.0:0 as 1089 bytes in 3 ms16/08/28 15:27:27 INFO scheduler.TaskSetManager: Starting task 39.0:1 as TID 406 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:27:27 INFO scheduler.TaskSetManager: Serialized task 39.0:1 as 1089 bytes in 5 ms16/08/28 15:27:27 INFO executor.Executor: Running task ID 40516/08/28 15:27:27 INFO executor.Executor: Running task ID 40616/08/28 15:27:27 INFO executor.Executor: Serialized size of result for 405 is 54616/08/28 15:27:27 INFO executor.Executor: Sending result for 405 directly to driver16/08/28 15:27:27 INFO executor.Executor: Finished task ID 40516/08/28 15:27:27 INFO scheduler.DAGScheduler: Completed ResultTask(39, 0)16/08/28 15:27:27 INFO scheduler.TaskSetManager: Finished TID 405 in 67 ms on localhost (progress: 1/2)16/08/28 15:27:27 INFO executor.Executor: Serialized size of result for 406 is 54616/08/28 15:27:27 INFO executor.Executor: Sending result for 406 directly to driver16/08/28 15:27:27 INFO executor.Executor: Finished task ID 40616/08/28 15:27:27 INFO scheduler.TaskSetManager: Finished TID 406 in 92 ms on localhost (progress: 2/2)16/08/28 15:27:27 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 39.0, whose tasks have all completed, from pool 16/08/28 15:27:27 INFO scheduler.DAGScheduler: Completed ResultTask(39, 1)16/08/28 15:27:27 INFO scheduler.DAGScheduler: Stage 39 (collect at <console>:25) finished in 0.116 s16/08/28 15:27:27 INFO spark.SparkContext: Job finished: collect at <console>:25, took 0.149541039 sres58: Array[Int] = Array(1, 2)scala> val b = a.flatMap(x => 1 to x)b: org.apache.spark.rdd.RDD[Int] = FlatMappedRDD[76] at flatMap at <console>:24scala> b.collect16/08/28 15:27:41 INFO spark.SparkContext: Starting job: collect at <console>:2716/08/28 15:27:41 INFO scheduler.DAGScheduler: Got job 37 (collect at <console>:27) with 2 output partitions (allowLocal=false)16/08/28 15:27:41 INFO scheduler.DAGScheduler: Final stage: Stage 40(collect at <console>:27)16/08/28 15:27:41 INFO scheduler.DAGScheduler: Parents of final stage: List()16/08/28 15:27:41 INFO scheduler.DAGScheduler: Missing parents: List()16/08/28 15:27:41 INFO scheduler.DAGScheduler: Submitting Stage 40 (FlatMappedRDD[76] at flatMap at <console>:24), which has no missing parents16/08/28 15:27:41 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 40 (FlatMappedRDD[76] at flatMap at <console>:24)16/08/28 15:27:41 INFO scheduler.TaskSchedulerImpl: Adding task set 40.0 with 2 tasks16/08/28 15:27:41 INFO scheduler.TaskSetManager: Starting task 40.0:0 as TID 407 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:27:41 INFO scheduler.TaskSetManager: Serialized task 40.0:0 as 1329 bytes in 3 ms16/08/28 15:27:41 INFO scheduler.TaskSetManager: Starting task 40.0:1 as TID 408 on executor localhost: localhost (PROCESS_LOCAL)16/08/28 15:27:41 INFO scheduler.TaskSetManager: Serialized task 40.0:1 as 1329 bytes in 4 ms16/08/28 15:27:41 INFO executor.Executor: Running task ID 40716/08/28 15:27:41 INFO executor.Executor: Running task ID 40816/08/28 15:27:41 INFO executor.Executor: Serialized size of result for 407 is 54616/08/28 15:27:41 INFO executor.Executor: Sending result for 407 directly to driver16/08/28 15:27:41 INFO executor.Executor: Serialized size of result for 408 is 55016/08/28 15:27:41 INFO executor.Executor: Sending result for 408 directly to driver16/08/28 15:27:41 INFO executor.Executor: Finished task ID 40816/08/28 15:27:41 INFO scheduler.DAGScheduler: Completed ResultTask(40, 0)16/08/28 15:27:41 INFO scheduler.TaskSetManager: Finished TID 407 in 56 ms on localhost (progress: 1/2)16/08/28 15:27:41 INFO scheduler.TaskSetManager: Finished TID 408 in 69 ms on localhost (progress: 2/2)16/08/28 15:27:41 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 40.0, whose tasks have all completed, from pool 16/08/28 15:27:41 INFO scheduler.DAGScheduler: Completed ResultTask(40, 1)16/08/28 15:27:41 INFO scheduler.DAGScheduler: Stage 40 (collect at <console>:27) finished in 0.077 s16/08/28 15:27:41 INFO spark.SparkContext: Job finished: collect at <console>:27, took 0.151573644 sres59: Array[Int] = Array(1, 1, 2)16/08/28 15:27:41 INFO executor.Executor: Finished task ID 407
spark之map与flatMap区别
http://blog.csdn.net/u013361361/article/details/44463307
0 0
- spark val b = a.flatMap(x => 1 to x)详解
- A+B=X
- 扩展gcd模板,a^x=b。
- 求解x=a^b(mod m)
- 求解x=a^b(mod m)
- BSGS(a^x%p=b)
- a<=1 && !x++
- 1 / x + 1 / y = b / a(已知a,b)
- 求解一元二次方程:a*x*x+b*x+c=0,a、b、c从命令行输入
- 求一元二次方程a*x*x+b*x+c=0的3个根
- 分别输出a*x*x+b*x+c=0的三种情况的根
- X^a mod b=c 式子中求所有的X(b总是质数)
- A + B Problem (X)
- x=x&(x-1)
- x=x&(x-1)
- x = x&(x-1)
- x = x&(x-1)
- Java题,class A { static int x; } class B { A a = new A(); a.x=1; }错在哪
- 会计学
- javaFx的变量,javaFx的数据类型,javaFx的运算符
- ios学习开发0828
- HUD 1541/BIT(数状数组)
- R语言从基础入门到提高(二)Vectors(向量)
- spark val b = a.flatMap(x => 1 to x)详解
- 用户画像的技术选型与架构实现
- 283. Move Zeroes *
- AutoCompleteTextView自动补全
- # A31S android 系统修改声音播放策略,HDMI发声
- 【51单片机学习过程记录】 1学习实践前要准备的
- 查看spark历史日志
- 基金入门二
- HTTP 1.1与HTTP 1.0的比较