spark RDD transformation和action操作

来源:互联网 发布:仓库管理网络信息打单 编辑:程序博客网 时间:2024/04/28 10:57
spark RDD transformation和action
1.启用spark-shell,使用根目录下的test.txt作为文件的示例
scala> sc
res30: org.apache.spark.SparkContext = org.apache.spark.SparkContext@68fda8

scala> val file = sc.textFile("test.txt")
15/12/09 14:00:40 INFO MemoryStore: ensureFreeSpace(191856) called with curMem=1362582, maxMem=277877882
15/12/09 14:00:40 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 187.4 KB, free 263.5 MB)
15/12/09 14:00:40 INFO BlockManagerInfo: Removed broadcast_27_piece0 on 10.28.23.201:51294 in memory (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:00:40 INFO BlockManagerInfo: Removed broadcast_27_piece0 on 10.28.23.202:50706 in memory (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:00:40 INFO BlockManagerInfo: Removed broadcast_27_piece0 on 10.28.23.201:60179 in memory (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:00:40 INFO MemoryStore: ensureFreeSpace(19750) called with curMem=1549548, maxMem=277877882
15/12/09 14:00:40 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 19.3 KB, free 263.5 MB)
15/12/09 14:00:40 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 10.28.23.201:60179 in memory (size: 1901.0 B, free: 264.9 MB)
15/12/09 14:00:40 INFO BlockManagerInfo: Removed broadcast_25_piece0 on 10.28.23.203:57813 in memory (size: 1901.0 B, free: 264.9 MB)
15/12/09 14:00:40 INFO BlockManagerInfo: Removed broadcast_24_piece0 on 10.28.23.201:60179 in memory (size: 1901.0 B, free: 264.9 MB)
15/12/09 14:00:41 INFO BlockManagerInfo: Removed broadcast_24_piece0 on 10.28.23.202:50706 in memory (size: 1901.0 B, free: 264.9 MB)
15/12/09 14:00:41 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on 10.28.23.201:60179 (size: 19.3 KB, free: 264.9 MB)
15/12/09 14:00:41 INFO SparkContext: Created broadcast 28 from textFile at <console>:21
file: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[21] at textFile at <console>:21


2.文件中的内容如下
scala> file.toArray
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
15/12/09 14:02:50 INFO FileInputFormat: Total input paths to process : 1
15/12/09 14:02:50 INFO SparkContext: Starting job: toArray at <console>:24
15/12/09 14:02:50 INFO DAGScheduler: Got job 21 (toArray at <console>:24) with 2 output partitions (allowLocal=false)
15/12/09 14:02:50 INFO DAGScheduler: Final stage: ResultStage 21(toArray at <console>:24)
15/12/09 14:02:50 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:02:50 INFO DAGScheduler: Missing parents: List()
15/12/09 14:02:50 INFO DAGScheduler: Submitting ResultStage 21 (MapPartitionsRDD[21] at textFile at <console>:21), which has no missing parents
15/12/09 14:02:50 INFO MemoryStore: ensureFreeSpace(3112) called with curMem=1558888, maxMem=277877882
15/12/09 14:02:50 INFO MemoryStore: Block broadcast_29 stored as values in memory (estimated size 3.0 KB, free 263.5 MB)
15/12/09 14:02:50 INFO MemoryStore: ensureFreeSpace(1778) called with curMem=1562000, maxMem=277877882
15/12/09 14:02:50 INFO MemoryStore: Block broadcast_29_piece0 stored as bytes in memory (estimated size 1778.0 B, free 263.5 MB)
15/12/09 14:02:50 INFO BlockManagerInfo: Added broadcast_29_piece0 in memory on 10.28.23.201:60179 (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:02:50 INFO SparkContext: Created broadcast 29 from broadcast at DAGScheduler.scala:874
15/12/09 14:02:50 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 21 (MapPartitionsRDD[21] at textFile at <console>:21)
15/12/09 14:02:50 INFO TaskSchedulerImpl: Adding task set 21.0 with 2 tasks
15/12/09 14:02:50 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 33, 10.28.23.202, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:02:50 INFO TaskSetManager: Starting task 1.0 in stage 21.0 (TID 34, 10.28.23.201, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:02:50 INFO BlockManagerInfo: Added broadcast_29_piece0 in memory on 10.28.23.202:50706 (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:02:50 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on 10.28.23.202:50706 (size: 19.3 KB, free: 264.9 MB)
15/12/09 14:02:50 INFO BlockManagerInfo: Added broadcast_29_piece0 in memory on 10.28.23.201:51294 (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:02:50 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on 10.28.23.201:51294 (size: 19.3 KB, free: 264.9 MB)
15/12/09 14:02:50 INFO TaskSetManager: Finished task 0.0 in stage 21.0 (TID 33) in 91 ms on 10.28.23.202 (1/2)
15/12/09 14:02:50 INFO TaskSetManager: Finished task 1.0 in stage 21.0 (TID 34) in 247 ms on 10.28.23.201 (2/2)
15/12/09 14:02:50 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have all completed, from pool 
15/12/09 14:02:50 INFO DAGScheduler: ResultStage 21 (toArray at <console>:24) finished in 0.248 s
15/12/09 14:02:50 INFO DAGScheduler: Job 21 finished: toArray at <console>:24, took 0.290135 s
res31: Array[String] = Array(demo test can you see the best, walking in the sun, hello world, hello spark)


filter,找出包含hello的文字
scala> var filterLine = file.filter(_.contains("hello"))
filterLine: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[22] at filter at <console>:23
scala> filterLine.toArray
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
15/12/09 14:04:40 INFO SparkContext: Starting job: toArray at <console>:26
15/12/09 14:04:40 INFO DAGScheduler: Got job 22 (toArray at <console>:26) with 2 output partitions (allowLocal=false)
15/12/09 14:04:40 INFO DAGScheduler: Final stage: ResultStage 22(toArray at <console>:26)
15/12/09 14:04:40 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:04:40 INFO DAGScheduler: Missing parents: List()
15/12/09 14:04:40 INFO DAGScheduler: Submitting ResultStage 22 (MapPartitionsRDD[22] at filter at <console>:23), which has no missing parents
15/12/09 14:04:40 INFO MemoryStore: ensureFreeSpace(3336) called with curMem=1563778, maxMem=277877882
15/12/09 14:04:40 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 3.3 KB, free 263.5 MB)
15/12/09 14:04:40 INFO MemoryStore: ensureFreeSpace(1889) called with curMem=1567114, maxMem=277877882
15/12/09 14:04:40 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 1889.0 B, free 263.5 MB)
15/12/09 14:04:40 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 10.28.23.201:60179 (size: 1889.0 B, free: 264.9 MB)
15/12/09 14:04:40 INFO SparkContext: Created broadcast 30 from broadcast at DAGScheduler.scala:874
15/12/09 14:04:40 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 22 (MapPartitionsRDD[22] at filter at <console>:23)
15/12/09 14:04:40 INFO TaskSchedulerImpl: Adding task set 22.0 with 2 tasks
15/12/09 14:04:40 INFO TaskSetManager: Starting task 0.0 in stage 22.0 (TID 35, 10.28.23.203, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:04:40 INFO TaskSetManager: Starting task 1.0 in stage 22.0 (TID 36, 10.28.23.202, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:04:40 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 10.28.23.203:57813 (size: 1889.0 B, free: 264.9 MB)
15/12/09 14:04:40 INFO BlockManagerInfo: Added broadcast_30_piece0 in memory on 10.28.23.202:50706 (size: 1889.0 B, free: 264.9 MB)
15/12/09 14:04:40 INFO TaskSetManager: Finished task 1.0 in stage 22.0 (TID 36) in 67 ms on 10.28.23.202 (1/2)
15/12/09 14:04:40 INFO BlockManagerInfo: Added broadcast_28_piece0 in memory on 10.28.23.203:57813 (size: 19.3 KB, free: 264.9 MB)
15/12/09 14:04:40 INFO TaskSetManager: Finished task 0.0 in stage 22.0 (TID 35) in 144 ms on 10.28.23.203 (2/2)
15/12/09 14:04:40 INFO TaskSchedulerImpl: Removed TaskSet 22.0, whose tasks have all completed, from pool 
15/12/09 14:04:40 INFO DAGScheduler: ResultStage 22 (toArray at <console>:26) finished in 0.145 s
15/12/09 14:04:40 INFO DAGScheduler: Job 22 finished: toArray at <console>:26, took 0.176616 s
res32: Array[String] = Array(hello world, hello spark)


Map
scala> filterLine.map(line=>line.split(" "))
res33: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[23] at map at <console>:26
scala> filterLine.map(line=>line.split(" ")) take 4
15/12/09 14:06:32 INFO SparkContext: Starting job: take at <console>:26
15/12/09 14:06:33 INFO DAGScheduler: Got job 23 (take at <console>:26) with 1 output partitions (allowLocal=true)
15/12/09 14:06:33 INFO DAGScheduler: Final stage: ResultStage 23(take at <console>:26)
15/12/09 14:06:33 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:06:33 INFO DAGScheduler: Missing parents: List()
15/12/09 14:06:33 INFO DAGScheduler: Submitting ResultStage 23 (MapPartitionsRDD[24] at map at <console>:26), which has no missing parents
15/12/09 14:06:33 INFO MemoryStore: ensureFreeSpace(3528) called with curMem=1569003, maxMem=277877882
15/12/09 14:06:33 INFO MemoryStore: Block broadcast_31 stored as values in memory (estimated size 3.4 KB, free 263.5 MB)
15/12/09 14:06:33 INFO MemoryStore: ensureFreeSpace(1971) called with curMem=1572531, maxMem=277877882
15/12/09 14:06:33 INFO MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 1971.0 B, free 263.5 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Removed broadcast_30_piece0 on 10.28.23.202:50706 in memory (size: 1889.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Removed broadcast_30_piece0 on 10.28.23.201:60179 in memory (size: 1889.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Added broadcast_31_piece0 in memory on 10.28.23.201:60179 (size: 1971.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Removed broadcast_30_piece0 on 10.28.23.203:57813 in memory (size: 1889.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO SparkContext: Created broadcast 31 from broadcast at DAGScheduler.scala:874
15/12/09 14:06:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 23 (MapPartitionsRDD[24] at map at <console>:26)
15/12/09 14:06:33 INFO TaskSchedulerImpl: Adding task set 23.0 with 1 tasks
15/12/09 14:06:33 INFO TaskSetManager: Starting task 0.0 in stage 23.0 (TID 37, 10.28.23.201, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:06:33 INFO BlockManagerInfo: Removed broadcast_29_piece0 on 10.28.23.201:60179 in memory (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Removed broadcast_29_piece0 on 10.28.23.201:51294 in memory (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Removed broadcast_29_piece0 on 10.28.23.202:50706 in memory (size: 1778.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Added broadcast_31_piece0 in memory on 10.28.23.201:51294 (size: 1971.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO TaskSetManager: Finished task 0.0 in stage 23.0 (TID 37) in 536 ms on 10.28.23.201 (1/1)
15/12/09 14:06:33 INFO DAGScheduler: ResultStage 23 (take at <console>:26) finished in 0.534 s
15/12/09 14:06:33 INFO TaskSchedulerImpl: Removed TaskSet 23.0, whose tasks have all completed, from pool 
15/12/09 14:06:33 INFO DAGScheduler: Job 23 finished: take at <console>:26, took 0.666443 s
15/12/09 14:06:33 INFO SparkContext: Starting job: take at <console>:26
15/12/09 14:06:33 INFO DAGScheduler: Got job 24 (take at <console>:26) with 1 output partitions (allowLocal=true)
15/12/09 14:06:33 INFO DAGScheduler: Final stage: ResultStage 24(take at <console>:26)
15/12/09 14:06:33 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:06:33 INFO DAGScheduler: Missing parents: List()
15/12/09 14:06:33 INFO DAGScheduler: Submitting ResultStage 24 (MapPartitionsRDD[24] at map at <console>:26), which has no missing parents
15/12/09 14:06:33 INFO MemoryStore: ensureFreeSpace(3528) called with curMem=1564387, maxMem=277877882
15/12/09 14:06:33 INFO MemoryStore: Block broadcast_32 stored as values in memory (estimated size 3.4 KB, free 263.5 MB)
15/12/09 14:06:33 INFO MemoryStore: ensureFreeSpace(1971) called with curMem=1567915, maxMem=277877882
15/12/09 14:06:33 INFO MemoryStore: Block broadcast_32_piece0 stored as bytes in memory (estimated size 1971.0 B, free 263.5 MB)
15/12/09 14:06:33 INFO BlockManagerInfo: Added broadcast_32_piece0 in memory on 10.28.23.201:60179 (size: 1971.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO SparkContext: Created broadcast 32 from broadcast at DAGScheduler.scala:874
15/12/09 14:06:33 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 24 (MapPartitionsRDD[24] at map at <console>:26)
15/12/09 14:06:33 INFO TaskSchedulerImpl: Adding task set 24.0 with 1 tasks
15/12/09 14:06:33 INFO TaskSetManager: Starting task 0.0 in stage 24.0 (TID 38, 10.28.23.202, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:06:33 INFO BlockManagerInfo: Added broadcast_32_piece0 in memory on 10.28.23.202:50706 (size: 1971.0 B, free: 264.9 MB)
15/12/09 14:06:33 INFO TaskSetManager: Finished task 0.0 in stage 24.0 (TID 38) in 39 ms on 10.28.23.202 (1/1)
15/12/09 14:06:33 INFO TaskSchedulerImpl: Removed TaskSet 24.0, whose tasks have all completed, from pool 
15/12/09 14:06:33 INFO DAGScheduler: ResultStage 24 (take at <console>:26) finished in 0.039 s
15/12/09 14:06:33 INFO DAGScheduler: Job 24 finished: take at <console>:26, took 0.072027 s
res34: Array[Array[String]] = Array(Array(hello, world), Array(hello, spark))
map的最终输出是Array(array1,array2...)数组


flatMap将数据扁平化
scala> file.flatMap(line=>line.split(" ")) take 20
15/12/09 14:09:07 INFO SparkContext: Starting job: take at <console>:24
15/12/09 14:09:07 INFO DAGScheduler: Got job 26 (take at <console>:24) with 1 output partitions (allowLocal=true)
15/12/09 14:09:07 INFO DAGScheduler: Final stage: ResultStage 26(take at <console>:24)
15/12/09 14:09:07 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:09:07 INFO DAGScheduler: Missing parents: List()
15/12/09 14:09:07 INFO DAGScheduler: Submitting ResultStage 26 (MapPartitionsRDD[26] at flatMap at <console>:24), which has no missing parents
15/12/09 14:09:07 INFO MemoryStore: ensureFreeSpace(3360) called with curMem=1575165, maxMem=277877882
15/12/09 14:09:07 INFO MemoryStore: Block broadcast_34 stored as values in memory (estimated size 3.3 KB, free 263.5 MB)
15/12/09 14:09:07 INFO MemoryStore: ensureFreeSpace(1923) called with curMem=1578525, maxMem=277877882
15/12/09 14:09:07 INFO MemoryStore: Block broadcast_34_piece0 stored as bytes in memory (estimated size 1923.0 B, free 263.5 MB)
15/12/09 14:09:07 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory on 10.28.23.201:60179 (size: 1923.0 B, free: 264.8 MB)
15/12/09 14:09:07 INFO SparkContext: Created broadcast 34 from broadcast at DAGScheduler.scala:874
15/12/09 14:09:07 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 26 (MapPartitionsRDD[26] at flatMap at <console>:24)
15/12/09 14:09:07 INFO TaskSchedulerImpl: Adding task set 26.0 with 1 tasks
15/12/09 14:09:07 INFO TaskSetManager: Starting task 0.0 in stage 26.0 (TID 40, 10.28.23.202, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:09:07 INFO BlockManagerInfo: Added broadcast_34_piece0 in memory on 10.28.23.202:50706 (size: 1923.0 B, free: 264.9 MB)
15/12/09 14:09:07 INFO TaskSetManager: Finished task 0.0 in stage 26.0 (TID 40) in 64 ms on 10.28.23.202 (1/1)
15/12/09 14:09:07 INFO TaskSchedulerImpl: Removed TaskSet 26.0, whose tasks have all completed, from pool 
15/12/09 14:09:07 INFO DAGScheduler: ResultStage 26 (take at <console>:24) finished in 0.065 s
15/12/09 14:09:07 INFO DAGScheduler: Job 26 finished: take at <console>:24, took 0.113043 s
15/12/09 14:09:07 INFO SparkContext: Starting job: take at <console>:24
15/12/09 14:09:07 INFO DAGScheduler: Got job 27 (take at <console>:24) with 1 output partitions (allowLocal=true)
15/12/09 14:09:07 INFO DAGScheduler: Final stage: ResultStage 27(take at <console>:24)
15/12/09 14:09:07 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:09:07 INFO DAGScheduler: Missing parents: List()
15/12/09 14:09:07 INFO DAGScheduler: Submitting ResultStage 27 (MapPartitionsRDD[26] at flatMap at <console>:24), which has no missing parents
15/12/09 14:09:07 INFO MemoryStore: ensureFreeSpace(3360) called with curMem=1580448, maxMem=277877882
15/12/09 14:09:07 INFO MemoryStore: Block broadcast_35 stored as values in memory (estimated size 3.3 KB, free 263.5 MB)
15/12/09 14:09:07 INFO MemoryStore: ensureFreeSpace(1923) called with curMem=1583808, maxMem=277877882
15/12/09 14:09:07 INFO MemoryStore: Block broadcast_35_piece0 stored as bytes in memory (estimated size 1923.0 B, free 263.5 MB)
15/12/09 14:09:07 INFO BlockManagerInfo: Added broadcast_35_piece0 in memory on 10.28.23.201:60179 (size: 1923.0 B, free: 264.8 MB)
15/12/09 14:09:07 INFO SparkContext: Created broadcast 35 from broadcast at DAGScheduler.scala:874
15/12/09 14:09:07 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 27 (MapPartitionsRDD[26] at flatMap at <console>:24)
15/12/09 14:09:07 INFO TaskSchedulerImpl: Adding task set 27.0 with 1 tasks
15/12/09 14:09:07 INFO TaskSetManager: Starting task 0.0 in stage 27.0 (TID 41, 10.28.23.201, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:09:08 INFO BlockManagerInfo: Added broadcast_35_piece0 in memory on 10.28.23.201:51294 (size: 1923.0 B, free: 264.9 MB)
15/12/09 14:09:08 INFO TaskSetManager: Finished task 0.0 in stage 27.0 (TID 41) in 312 ms on 10.28.23.201 (1/1)
15/12/09 14:09:08 INFO TaskSchedulerImpl: Removed TaskSet 27.0, whose tasks have all completed, from pool 
15/12/09 14:09:08 INFO DAGScheduler: ResultStage 27 (take at <console>:24) finished in 0.311 s
15/12/09 14:09:08 INFO DAGScheduler: Job 27 finished: take at <console>:24, took 0.324655 s
res36: Array[String] = Array(demo, test, can, you, see, the, best, walking, in, the, sun, hello, world, hello, spark)


union
scala> val unionLine = file.union(file)
unionLine: org.apache.spark.rdd.RDD[String] = UnionRDD[30] at union at <console>:23


scala> unionLine.toArray
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
15/12/09 14:11:43 INFO SparkContext: Starting job: toArray at <console>:26
15/12/09 14:11:43 INFO DAGScheduler: Got job 31 (toArray at <console>:26) with 4 output partitions (allowLocal=false)
15/12/09 14:11:43 INFO DAGScheduler: Final stage: ResultStage 31(toArray at <console>:26)
15/12/09 14:11:43 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:11:43 INFO DAGScheduler: Missing parents: List()
15/12/09 14:11:43 INFO DAGScheduler: Submitting ResultStage 31 (UnionRDD[30] at union at <console>:23), which has no missing parents
15/12/09 14:11:43 INFO MemoryStore: ensureFreeSpace(3688) called with curMem=1564712, maxMem=277877882
15/12/09 14:11:43 INFO MemoryStore: Block broadcast_39 stored as values in memory (estimated size 3.6 KB, free 263.5 MB)
15/12/09 14:11:43 INFO MemoryStore: ensureFreeSpace(2158) called with curMem=1568400, maxMem=277877882
15/12/09 14:11:43 INFO MemoryStore: Block broadcast_39_piece0 stored as bytes in memory (estimated size 2.1 KB, free 263.5 MB)
15/12/09 14:11:43 INFO BlockManagerInfo: Added broadcast_39_piece0 in memory on 10.28.23.201:60179 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:11:43 INFO SparkContext: Created broadcast 39 from broadcast at DAGScheduler.scala:874
15/12/09 14:11:43 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 31 (UnionRDD[30] at union at <console>:23)
15/12/09 14:11:43 INFO TaskSchedulerImpl: Adding task set 31.0 with 4 tasks
15/12/09 14:11:43 INFO TaskSetManager: Starting task 0.0 in stage 31.0 (TID 50, 10.28.23.203, PROCESS_LOCAL, 1506 bytes)
15/12/09 14:11:43 INFO TaskSetManager: Starting task 1.0 in stage 31.0 (TID 51, 10.28.23.202, PROCESS_LOCAL, 1506 bytes)
15/12/09 14:11:43 INFO TaskSetManager: Starting task 2.0 in stage 31.0 (TID 52, 10.28.23.201, PROCESS_LOCAL, 1506 bytes)
15/12/09 14:11:43 INFO TaskSetManager: Starting task 3.0 in stage 31.0 (TID 53, 10.28.23.203, PROCESS_LOCAL, 1506 bytes)
15/12/09 14:11:43 INFO BlockManagerInfo: Added broadcast_39_piece0 in memory on 10.28.23.202:50706 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:11:43 INFO BlockManagerInfo: Added broadcast_39_piece0 in memory on 10.28.23.203:57813 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:11:43 INFO TaskSetManager: Finished task 1.0 in stage 31.0 (TID 51) in 47 ms on 10.28.23.202 (1/4)
15/12/09 14:11:43 INFO BlockManagerInfo: Added broadcast_39_piece0 in memory on 10.28.23.201:51294 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:11:43 INFO TaskSetManager: Finished task 0.0 in stage 31.0 (TID 50) in 70 ms on 10.28.23.203 (2/4)
15/12/09 14:11:43 INFO TaskSetManager: Finished task 3.0 in stage 31.0 (TID 53) in 69 ms on 10.28.23.203 (3/4)
15/12/09 14:11:43 INFO TaskSetManager: Finished task 2.0 in stage 31.0 (TID 52) in 89 ms on 10.28.23.201 (4/4)
15/12/09 14:11:43 INFO TaskSchedulerImpl: Removed TaskSet 31.0, whose tasks have all completed, from pool 
15/12/09 14:11:43 INFO DAGScheduler: ResultStage 31 (toArray at <console>:26) finished in 0.091 s
15/12/09 14:11:43 INFO DAGScheduler: Job 31 finished: toArray at <console>:26, took 0.109727 s

res43: Array[String] = Array(demo test can you see the best, walking in the sun, hello world, hello spark, demo test can you see the best, walking in the sun, hello world, hello spark)


flatMap
scala> val wordcount = file.flatMap(line=>line.split(" ")).map(word=>(word,1))
wordcount: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[42] at map at <console>:23


scala> wordcount.toArray
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
15/12/09 14:19:21 INFO SparkContext: Starting job: toArray at <console>:26
15/12/09 14:19:21 INFO DAGScheduler: Got job 36 (toArray at <console>:26) with 2 output partitions (allowLocal=false)
15/12/09 14:19:21 INFO DAGScheduler: Final stage: ResultStage 36(toArray at <console>:26)
15/12/09 14:19:21 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:19:21 INFO DAGScheduler: Missing parents: List()
15/12/09 14:19:21 INFO DAGScheduler: Submitting ResultStage 36 (MapPartitionsRDD[42] at map at <console>:23), which has no missing parents
15/12/09 14:19:21 INFO MemoryStore: ensureFreeSpace(3488) called with curMem=1564303, maxMem=277877882
15/12/09 14:19:21 INFO MemoryStore: Block broadcast_44 stored as values in memory (estimated size 3.4 KB, free 263.5 MB)
15/12/09 14:19:21 INFO MemoryStore: ensureFreeSpace(1927) called with curMem=1567791, maxMem=277877882
15/12/09 14:19:21 INFO MemoryStore: Block broadcast_44_piece0 stored as bytes in memory (estimated size 1927.0 B, free 263.5 MB)
15/12/09 14:19:21 INFO BlockManagerInfo: Added broadcast_44_piece0 in memory on 10.28.23.201:60179 (size: 1927.0 B, free: 264.9 MB)
15/12/09 14:19:21 INFO SparkContext: Created broadcast 44 from broadcast at DAGScheduler.scala:874
15/12/09 14:19:21 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 36 (MapPartitionsRDD[42] at map at <console>:23)
15/12/09 14:19:21 INFO TaskSchedulerImpl: Adding task set 36.0 with 2 tasks
15/12/09 14:19:21 INFO TaskSetManager: Starting task 0.0 in stage 36.0 (TID 62, 10.28.23.201, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:19:21 INFO TaskSetManager: Starting task 1.0 in stage 36.0 (TID 63, 10.28.23.203, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:19:21 INFO BlockManagerInfo: Added broadcast_44_piece0 in memory on 10.28.23.201:51294 (size: 1927.0 B, free: 264.9 MB)
15/12/09 14:19:21 INFO BlockManagerInfo: Added broadcast_44_piece0 in memory on 10.28.23.203:57813 (size: 1927.0 B, free: 264.9 MB)
15/12/09 14:19:21 INFO TaskSetManager: Finished task 1.0 in stage 36.0 (TID 63) in 107 ms on 10.28.23.203 (1/2)
15/12/09 14:19:21 INFO TaskSetManager: Finished task 0.0 in stage 36.0 (TID 62) in 133 ms on 10.28.23.201 (2/2)
15/12/09 14:19:21 INFO TaskSchedulerImpl: Removed TaskSet 36.0, whose tasks have all completed, from pool 
15/12/09 14:19:21 INFO DAGScheduler: ResultStage 36 (toArray at <console>:26) finished in 0.141 s
15/12/09 14:19:21 INFO DAGScheduler: Job 36 finished: toArray at <console>:26, took 0.198194 s
res53: Array[(String, Int)] = Array((demo,1), (test,1), (can,1), (you,1), (see,1), (the,1), (best,1), (walking,1), (in,1), (the,1), (sun,1), (hello,1), (world,1), (hello,1), (spark,1))


groupByKey
scala> wordcount.groupByKey().toArray
warning: there were 1 deprecation warning(s); re-run with -deprecation for details
15/12/09 14:20:32 INFO SparkContext: Starting job: toArray at <console>:28
15/12/09 14:20:32 INFO DAGScheduler: Registering RDD 42 (map at <console>:23)
15/12/09 14:20:32 INFO DAGScheduler: Got job 37 (toArray at <console>:28) with 2 output partitions (allowLocal=false)
15/12/09 14:20:32 INFO DAGScheduler: Final stage: ResultStage 38(toArray at <console>:28)
15/12/09 14:20:32 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 37)
15/12/09 14:20:32 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 37)
15/12/09 14:20:32 INFO DAGScheduler: Submitting ShuffleMapStage 37 (MapPartitionsRDD[42] at map at <console>:23), which has no missing parents
15/12/09 14:20:32 INFO MemoryStore: ensureFreeSpace(4864) called with curMem=1569718, maxMem=277877882
15/12/09 14:20:32 INFO MemoryStore: Block broadcast_45 stored as values in memory (estimated size 4.8 KB, free 263.5 MB)
15/12/09 14:20:32 INFO MemoryStore: ensureFreeSpace(2564) called with curMem=1574582, maxMem=277877882
15/12/09 14:20:32 INFO MemoryStore: Block broadcast_45_piece0 stored as bytes in memory (estimated size 2.5 KB, free 263.5 MB)
15/12/09 14:20:32 INFO BlockManagerInfo: Added broadcast_45_piece0 in memory on 10.28.23.201:60179 (size: 2.5 KB, free: 264.8 MB)
15/12/09 14:20:32 INFO SparkContext: Created broadcast 45 from broadcast at DAGScheduler.scala:874
15/12/09 14:20:32 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 37 (MapPartitionsRDD[42] at map at <console>:23)
15/12/09 14:20:32 INFO TaskSchedulerImpl: Adding task set 37.0 with 2 tasks
15/12/09 14:20:32 INFO TaskSetManager: Starting task 0.0 in stage 37.0 (TID 64, 10.28.23.202, PROCESS_LOCAL, 1386 bytes)
15/12/09 14:20:32 INFO TaskSetManager: Starting task 1.0 in stage 37.0 (TID 65, 10.28.23.201, PROCESS_LOCAL, 1386 bytes)
15/12/09 14:20:32 INFO BlockManagerInfo: Added broadcast_45_piece0 in memory on 10.28.23.202:50706 (size: 2.5 KB, free: 264.9 MB)
15/12/09 14:20:32 INFO BlockManagerInfo: Added broadcast_45_piece0 in memory on 10.28.23.201:51294 (size: 2.5 KB, free: 264.9 MB)
15/12/09 14:20:32 INFO TaskSetManager: Finished task 0.0 in stage 37.0 (TID 64) in 274 ms on 10.28.23.202 (1/2)
15/12/09 14:20:34 INFO TaskSetManager: Finished task 1.0 in stage 37.0 (TID 65) in 1690 ms on 10.28.23.201 (2/2)
15/12/09 14:20:34 INFO TaskSchedulerImpl: Removed TaskSet 37.0, whose tasks have all completed, from pool 
15/12/09 14:20:34 INFO DAGScheduler: ShuffleMapStage 37 (map at <console>:23) finished in 1.691 s
15/12/09 14:20:34 INFO DAGScheduler: looking for newly runnable stages
15/12/09 14:20:34 INFO DAGScheduler: running: Set()
15/12/09 14:20:34 INFO DAGScheduler: waiting: Set(ResultStage 38)
15/12/09 14:20:34 INFO DAGScheduler: failed: Set()
15/12/09 14:20:34 INFO DAGScheduler: Missing parents for ResultStage 38: List()
15/12/09 14:20:34 INFO DAGScheduler: Submitting ResultStage 38 (ShuffledRDD[43] at groupByKey at <console>:25), which is now runnable
15/12/09 14:20:34 INFO MemoryStore: ensureFreeSpace(5384) called with curMem=1577146, maxMem=277877882
15/12/09 14:20:34 INFO MemoryStore: Block broadcast_46 stored as values in memory (estimated size 5.3 KB, free 263.5 MB)
15/12/09 14:20:34 INFO MemoryStore: ensureFreeSpace(2765) called with curMem=1582530, maxMem=277877882
15/12/09 14:20:34 INFO MemoryStore: Block broadcast_46_piece0 stored as bytes in memory (estimated size 2.7 KB, free 263.5 MB)
15/12/09 14:20:34 INFO BlockManagerInfo: Added broadcast_46_piece0 in memory on 10.28.23.201:60179 (size: 2.7 KB, free: 264.8 MB)
15/12/09 14:20:34 INFO SparkContext: Created broadcast 46 from broadcast at DAGScheduler.scala:874
15/12/09 14:20:34 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 38 (ShuffledRDD[43] at groupByKey at <console>:25)
15/12/09 14:20:34 INFO TaskSchedulerImpl: Adding task set 38.0 with 2 tasks
15/12/09 14:20:34 INFO TaskSetManager: Starting task 0.0 in stage 38.0 (TID 66, 10.28.23.203, PROCESS_LOCAL, 1165 bytes)
15/12/09 14:20:34 INFO TaskSetManager: Starting task 1.0 in stage 38.0 (TID 67, 10.28.23.202, PROCESS_LOCAL, 1165 bytes)
15/12/09 14:20:34 INFO BlockManagerInfo: Added broadcast_46_piece0 in memory on 10.28.23.202:50706 (size: 2.7 KB, free: 264.9 MB)
15/12/09 14:20:34 INFO BlockManagerInfo: Added broadcast_46_piece0 in memory on 10.28.23.203:57813 (size: 2.7 KB, free: 264.9 MB)
15/12/09 14:20:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.28.23.202:53700
15/12/09 14:20:34 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 162 bytes
15/12/09 14:20:34 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 0 to 10.28.23.203:56943
15/12/09 14:20:34 INFO TaskSetManager: Finished task 1.0 in stage 38.0 (TID 67) in 552 ms on 10.28.23.202 (1/2)
15/12/09 14:20:34 INFO TaskSetManager: Finished task 0.0 in stage 38.0 (TID 66) in 558 ms on 10.28.23.203 (2/2)
15/12/09 14:20:34 INFO TaskSchedulerImpl: Removed TaskSet 38.0, whose tasks have all completed, from pool 
15/12/09 14:20:34 INFO DAGScheduler: ResultStage 38 (toArray at <console>:28) finished in 0.558 s
15/12/09 14:20:34 INFO DAGScheduler: Job 37 finished: toArray at <console>:28, took 2.518733 s
res54: Array[(String, Iterable[Int])] = Array((can,CompactBuffer(1)), (best,CompactBuffer(1)), (hello,CompactBuffer(1, 1)), (sun,CompactBuffer(1)), (test,CompactBuffer(1)), (world,CompactBuffer(1)), (walking,CompactBuffer(1)), (spark,CompactBuffer(1)), (you,CompactBuffer(1)), (demo,CompactBuffer(1)), (in,CompactBuffer(1)), (see,CompactBuffer(1)), (the,CompactBuffer(1, 1)))


distinct
scala> wordcount.count()
15/12/09 14:29:52 INFO SparkContext: Starting job: count at <console>:26
15/12/09 14:29:52 INFO DAGScheduler: Got job 48 (count at <console>:26) with 2 output partitions (allowLocal=false)
15/12/09 14:29:52 INFO DAGScheduler: Final stage: ResultStage 58(count at <console>:26)
15/12/09 14:29:52 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:29:52 INFO DAGScheduler: Missing parents: List()
15/12/09 14:29:52 INFO DAGScheduler: Submitting ResultStage 58 (MapPartitionsRDD[42] at map at <console>:23), which has no missing parents
15/12/09 14:29:52 INFO MemoryStore: ensureFreeSpace(3344) called with curMem=1579816, maxMem=277877882
15/12/09 14:29:52 INFO MemoryStore: Block broadcast_65 stored as values in memory (estimated size 3.3 KB, free 263.5 MB)
15/12/09 14:29:52 INFO MemoryStore: ensureFreeSpace(1893) called with curMem=1583160, maxMem=277877882
15/12/09 14:29:52 INFO MemoryStore: Block broadcast_65_piece0 stored as bytes in memory (estimated size 1893.0 B, free 263.5 MB)
15/12/09 14:29:52 INFO BlockManagerInfo: Added broadcast_65_piece0 in memory on 10.28.23.201:60179 (size: 1893.0 B, free: 264.8 MB)
15/12/09 14:29:52 INFO SparkContext: Created broadcast 65 from broadcast at DAGScheduler.scala:874
15/12/09 14:29:52 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 58 (MapPartitionsRDD[42] at map at <console>:23)
15/12/09 14:29:52 INFO TaskSchedulerImpl: Adding task set 58.0 with 2 tasks
15/12/09 14:29:52 INFO TaskSetManager: Starting task 0.0 in stage 58.0 (TID 104, 10.28.23.203, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:29:52 INFO TaskSetManager: Starting task 1.0 in stage 58.0 (TID 105, 10.28.23.201, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:29:52 INFO BlockManagerInfo: Added broadcast_65_piece0 in memory on 10.28.23.201:51294 (size: 1893.0 B, free: 264.9 MB)
15/12/09 14:29:52 INFO BlockManagerInfo: Added broadcast_65_piece0 in memory on 10.28.23.203:57813 (size: 1893.0 B, free: 264.9 MB)
15/12/09 14:29:52 INFO TaskSetManager: Finished task 0.0 in stage 58.0 (TID 104) in 67 ms on 10.28.23.203 (1/2)
15/12/09 14:29:52 INFO TaskSetManager: Finished task 1.0 in stage 58.0 (TID 105) in 69 ms on 10.28.23.201 (2/2)
15/12/09 14:29:52 INFO DAGScheduler: ResultStage 58 (count at <console>:26) finished in 0.083 s
15/12/09 14:29:52 INFO TaskSchedulerImpl: Removed TaskSet 58.0, whose tasks have all completed, from pool 
15/12/09 14:29:52 INFO DAGScheduler: Job 48 finished: count at <console>:26, took 0.119783 s
res72: Long = 15
scala> wordcount.distinct.count()
15/12/09 14:30:10 INFO SparkContext: Starting job: count at <console>:26
15/12/09 14:30:10 INFO DAGScheduler: Registering RDD 52 (distinct at <console>:26)
15/12/09 14:30:10 INFO DAGScheduler: Got job 50 (count at <console>:26) with 2 output partitions (allowLocal=false)
15/12/09 14:30:10 INFO DAGScheduler: Final stage: ResultStage 61(count at <console>:26)
15/12/09 14:30:10 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 60)
15/12/09 14:30:10 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 60)
15/12/09 14:30:10 INFO DAGScheduler: Submitting ShuffleMapStage 60 (MapPartitionsRDD[52] at distinct at <console>:26), which has no missing parents
15/12/09 14:30:10 INFO MemoryStore: ensureFreeSpace(4232) called with curMem=1564303, maxMem=277877882
15/12/09 14:30:10 INFO MemoryStore: Block broadcast_67 stored as values in memory (estimated size 4.1 KB, free 263.5 MB)
15/12/09 14:30:10 INFO MemoryStore: ensureFreeSpace(2306) called with curMem=1568535, maxMem=277877882
15/12/09 14:30:10 INFO MemoryStore: Block broadcast_67_piece0 stored as bytes in memory (estimated size 2.3 KB, free 263.5 MB)
15/12/09 14:30:10 INFO BlockManagerInfo: Added broadcast_67_piece0 in memory on 10.28.23.201:60179 (size: 2.3 KB, free: 264.9 MB)
15/12/09 14:30:10 INFO SparkContext: Created broadcast 67 from broadcast at DAGScheduler.scala:874
15/12/09 14:30:10 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 60 (MapPartitionsRDD[52] at distinct at <console>:26)
15/12/09 14:30:10 INFO TaskSchedulerImpl: Adding task set 60.0 with 2 tasks
15/12/09 14:30:10 INFO TaskSetManager: Starting task 0.0 in stage 60.0 (TID 108, 10.28.23.203, PROCESS_LOCAL, 1386 bytes)
15/12/09 14:30:10 INFO TaskSetManager: Starting task 1.0 in stage 60.0 (TID 109, 10.28.23.201, PROCESS_LOCAL, 1386 bytes)
15/12/09 14:30:10 INFO BlockManagerInfo: Added broadcast_67_piece0 in memory on 10.28.23.201:51294 (size: 2.3 KB, free: 264.9 MB)
15/12/09 14:30:10 INFO BlockManagerInfo: Added broadcast_67_piece0 in memory on 10.28.23.203:57813 (size: 2.3 KB, free: 264.9 MB)
15/12/09 14:30:10 INFO TaskSetManager: Finished task 0.0 in stage 60.0 (TID 108) in 75 ms on 10.28.23.203 (1/2)
15/12/09 14:30:10 INFO TaskSetManager: Finished task 1.0 in stage 60.0 (TID 109) in 135 ms on 10.28.23.201 (2/2)
15/12/09 14:30:10 INFO TaskSchedulerImpl: Removed TaskSet 60.0, whose tasks have all completed, from pool 
15/12/09 14:30:10 INFO DAGScheduler: ShuffleMapStage 60 (distinct at <console>:26) finished in 0.135 s
15/12/09 14:30:10 INFO DAGScheduler: looking for newly runnable stages
15/12/09 14:30:10 INFO DAGScheduler: running: Set()
15/12/09 14:30:10 INFO DAGScheduler: waiting: Set(ResultStage 61)
15/12/09 14:30:10 INFO DAGScheduler: failed: Set()
15/12/09 14:30:10 INFO DAGScheduler: Missing parents for ResultStage 61: List()
15/12/09 14:30:10 INFO DAGScheduler: Submitting ResultStage 61 (MapPartitionsRDD[54] at distinct at <console>:26), which is now runnable
15/12/09 14:30:10 INFO MemoryStore: ensureFreeSpace(2584) called with curMem=1570841, maxMem=277877882
15/12/09 14:30:10 INFO MemoryStore: Block broadcast_68 stored as values in memory (estimated size 2.5 KB, free 263.5 MB)
15/12/09 14:30:10 INFO MemoryStore: ensureFreeSpace(1530) called with curMem=1573425, maxMem=277877882
15/12/09 14:30:10 INFO MemoryStore: Block broadcast_68_piece0 stored as bytes in memory (estimated size 1530.0 B, free 263.5 MB)
15/12/09 14:30:10 INFO BlockManagerInfo: Added broadcast_68_piece0 in memory on 10.28.23.201:60179 (size: 1530.0 B, free: 264.8 MB)
15/12/09 14:30:10 INFO SparkContext: Created broadcast 68 from broadcast at DAGScheduler.scala:874
15/12/09 14:30:10 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 61 (MapPartitionsRDD[54] at distinct at <console>:26)
15/12/09 14:30:10 INFO TaskSchedulerImpl: Adding task set 61.0 with 2 tasks
15/12/09 14:30:10 INFO TaskSetManager: Starting task 0.0 in stage 61.0 (TID 110, 10.28.23.201, PROCESS_LOCAL, 1165 bytes)
15/12/09 14:30:10 INFO TaskSetManager: Starting task 1.0 in stage 61.0 (TID 111, 10.28.23.203, PROCESS_LOCAL, 1165 bytes)
15/12/09 14:30:10 INFO BlockManagerInfo: Added broadcast_68_piece0 in memory on 10.28.23.201:51294 (size: 1530.0 B, free: 264.9 MB)
15/12/09 14:30:10 INFO BlockManagerInfo: Added broadcast_68_piece0 in memory on 10.28.23.203:57813 (size: 1530.0 B, free: 264.9 MB)
15/12/09 14:30:10 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 9 to 10.28.23.201:44784
15/12/09 14:30:10 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 9 is 163 bytes
15/12/09 14:30:10 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 9 to 10.28.23.203:56943
15/12/09 14:30:10 INFO TaskSetManager: Finished task 1.0 in stage 61.0 (TID 111) in 101 ms on 10.28.23.203 (1/2)
15/12/09 14:30:10 INFO TaskSetManager: Finished task 0.0 in stage 61.0 (TID 110) in 103 ms on 10.28.23.201 (2/2)
15/12/09 14:30:10 INFO TaskSchedulerImpl: Removed TaskSet 61.0, whose tasks have all completed, from pool 
15/12/09 14:30:10 INFO DAGScheduler: ResultStage 61 (count at <console>:26) finished in 0.103 s
15/12/09 14:30:10 INFO DAGScheduler: Job 50 finished: count at <console>:26, took 0.267361 s
res74: Long = 13


sortByKey
scala> wordcount.sortByKey(true) take 20
15/12/09 14:31:41 INFO SparkContext: Starting job: sortByKey at <console>:26
15/12/09 14:31:41 INFO DAGScheduler: Got job 53 (sortByKey at <console>:26) with 2 output partitions (allowLocal=false)
15/12/09 14:31:41 INFO DAGScheduler: Final stage: ResultStage 64(sortByKey at <console>:26)
15/12/09 14:31:41 INFO DAGScheduler: Parents of final stage: List()
15/12/09 14:31:41 INFO DAGScheduler: Missing parents: List()
15/12/09 14:31:41 INFO DAGScheduler: Submitting ResultStage 64 (MapPartitionsRDD[56] at sortByKey at <console>:26), which has no missing parents
15/12/09 14:31:41 INFO MemoryStore: ensureFreeSpace(3984) called with curMem=1564303, maxMem=277877882
15/12/09 14:31:41 INFO MemoryStore: Block broadcast_71 stored as values in memory (estimated size 3.9 KB, free 263.5 MB)
15/12/09 14:31:41 INFO MemoryStore: ensureFreeSpace(2127) called with curMem=1568287, maxMem=277877882
15/12/09 14:31:41 INFO MemoryStore: Block broadcast_71_piece0 stored as bytes in memory (estimated size 2.1 KB, free 263.5 MB)
15/12/09 14:31:41 INFO BlockManagerInfo: Added broadcast_71_piece0 in memory on 10.28.23.201:60179 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:31:41 INFO SparkContext: Created broadcast 71 from broadcast at DAGScheduler.scala:874
15/12/09 14:31:41 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 64 (MapPartitionsRDD[56] at sortByKey at <console>:26)
15/12/09 14:31:41 INFO TaskSchedulerImpl: Adding task set 64.0 with 2 tasks
15/12/09 14:31:41 INFO TaskSetManager: Starting task 0.0 in stage 64.0 (TID 116, 10.28.23.202, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:31:41 INFO TaskSetManager: Starting task 1.0 in stage 64.0 (TID 117, 10.28.23.203, PROCESS_LOCAL, 1397 bytes)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_71_piece0 in memory on 10.28.23.202:50706 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_71_piece0 in memory on 10.28.23.203:57813 (size: 2.1 KB, free: 264.9 MB)
15/12/09 14:31:42 INFO TaskSetManager: Finished task 0.0 in stage 64.0 (TID 116) in 86 ms on 10.28.23.202 (1/2)
15/12/09 14:31:42 INFO TaskSetManager: Finished task 1.0 in stage 64.0 (TID 117) in 83 ms on 10.28.23.203 (2/2)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Removed TaskSet 64.0, whose tasks have all completed, from pool 
15/12/09 14:31:42 INFO DAGScheduler: ResultStage 64 (sortByKey at <console>:26) finished in 0.087 s
15/12/09 14:31:42 INFO DAGScheduler: Job 53 finished: sortByKey at <console>:26, took 0.124303 s
15/12/09 14:31:42 INFO SparkContext: Starting job: take at <console>:26
15/12/09 14:31:42 INFO DAGScheduler: Registering RDD 42 (map at <console>:23)
15/12/09 14:31:42 INFO DAGScheduler: Got job 54 (take at <console>:26) with 1 output partitions (allowLocal=true)
15/12/09 14:31:42 INFO DAGScheduler: Final stage: ResultStage 66(take at <console>:26)
15/12/09 14:31:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 65)
15/12/09 14:31:42 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 65)
15/12/09 14:31:42 INFO DAGScheduler: Submitting ShuffleMapStage 65 (MapPartitionsRDD[42] at map at <console>:23), which has no missing parents
15/12/09 14:31:42 INFO MemoryStore: ensureFreeSpace(4288) called with curMem=1570414, maxMem=277877882
15/12/09 14:31:42 INFO MemoryStore: Block broadcast_72 stored as values in memory (estimated size 4.2 KB, free 263.5 MB)
15/12/09 14:31:42 INFO MemoryStore: ensureFreeSpace(2366) called with curMem=1574702, maxMem=277877882
15/12/09 14:31:42 INFO MemoryStore: Block broadcast_72_piece0 stored as bytes in memory (estimated size 2.3 KB, free 263.5 MB)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_72_piece0 in memory on 10.28.23.201:60179 (size: 2.3 KB, free: 264.8 MB)
15/12/09 14:31:42 INFO SparkContext: Created broadcast 72 from broadcast at DAGScheduler.scala:874
15/12/09 14:31:42 INFO DAGScheduler: Submitting 2 missing tasks from ShuffleMapStage 65 (MapPartitionsRDD[42] at map at <console>:23)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Adding task set 65.0 with 2 tasks
15/12/09 14:31:42 INFO TaskSetManager: Starting task 0.0 in stage 65.0 (TID 118, 10.28.23.202, PROCESS_LOCAL, 1386 bytes)
15/12/09 14:31:42 INFO TaskSetManager: Starting task 1.0 in stage 65.0 (TID 119, 10.28.23.203, PROCESS_LOCAL, 1386 bytes)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_72_piece0 in memory on 10.28.23.203:57813 (size: 2.3 KB, free: 264.9 MB)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_72_piece0 in memory on 10.28.23.202:50706 (size: 2.3 KB, free: 264.9 MB)
15/12/09 14:31:42 INFO TaskSetManager: Finished task 1.0 in stage 65.0 (TID 119) in 61 ms on 10.28.23.203 (1/2)
15/12/09 14:31:42 INFO TaskSetManager: Finished task 0.0 in stage 65.0 (TID 118) in 67 ms on 10.28.23.202 (2/2)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Removed TaskSet 65.0, whose tasks have all completed, from pool 
15/12/09 14:31:42 INFO DAGScheduler: ShuffleMapStage 65 (map at <console>:23) finished in 0.066 s
15/12/09 14:31:42 INFO DAGScheduler: looking for newly runnable stages
15/12/09 14:31:42 INFO DAGScheduler: running: Set()
15/12/09 14:31:42 INFO DAGScheduler: waiting: Set(ResultStage 66)
15/12/09 14:31:42 INFO DAGScheduler: failed: Set()
15/12/09 14:31:42 INFO DAGScheduler: Missing parents for ResultStage 66: List()
15/12/09 14:31:42 INFO DAGScheduler: Submitting ResultStage 66 (ShuffledRDD[57] at sortByKey at <console>:26), which is now runnable
15/12/09 14:31:42 INFO MemoryStore: ensureFreeSpace(2528) called with curMem=1577068, maxMem=277877882
15/12/09 14:31:42 INFO MemoryStore: Block broadcast_73 stored as values in memory (estimated size 2.5 KB, free 263.5 MB)
15/12/09 14:31:42 INFO MemoryStore: ensureFreeSpace(1512) called with curMem=1579596, maxMem=277877882
15/12/09 14:31:42 INFO MemoryStore: Block broadcast_73_piece0 stored as bytes in memory (estimated size 1512.0 B, free 263.5 MB)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_73_piece0 in memory on 10.28.23.201:60179 (size: 1512.0 B, free: 264.8 MB)
15/12/09 14:31:42 INFO SparkContext: Created broadcast 73 from broadcast at DAGScheduler.scala:874
15/12/09 14:31:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 66 (ShuffledRDD[57] at sortByKey at <console>:26)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Adding task set 66.0 with 1 tasks
15/12/09 14:31:42 INFO TaskSetManager: Starting task 0.0 in stage 66.0 (TID 120, 10.28.23.201, PROCESS_LOCAL, 1165 bytes)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_73_piece0 in memory on 10.28.23.201:51294 (size: 1512.0 B, free: 264.9 MB)
15/12/09 14:31:42 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.28.23.201:44784
15/12/09 14:31:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 10 is 163 bytes
15/12/09 14:31:42 INFO TaskSetManager: Finished task 0.0 in stage 66.0 (TID 120) in 122 ms on 10.28.23.201 (1/1)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Removed TaskSet 66.0, whose tasks have all completed, from pool 
15/12/09 14:31:42 INFO DAGScheduler: ResultStage 66 (take at <console>:26) finished in 0.123 s
15/12/09 14:31:42 INFO DAGScheduler: Job 54 finished: take at <console>:26, took 0.218343 s
15/12/09 14:31:42 INFO SparkContext: Starting job: take at <console>:26
15/12/09 14:31:42 INFO DAGScheduler: Got job 55 (take at <console>:26) with 1 output partitions (allowLocal=true)
15/12/09 14:31:42 INFO DAGScheduler: Final stage: ResultStage 68(take at <console>:26)
15/12/09 14:31:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 67)
15/12/09 14:31:42 INFO DAGScheduler: Missing parents: List()
15/12/09 14:31:42 INFO DAGScheduler: Submitting ResultStage 68 (ShuffledRDD[57] at sortByKey at <console>:26), which has no missing parents
15/12/09 14:31:42 INFO MemoryStore: ensureFreeSpace(2528) called with curMem=1581108, maxMem=277877882
15/12/09 14:31:42 INFO MemoryStore: Block broadcast_74 stored as values in memory (estimated size 2.5 KB, free 263.5 MB)
15/12/09 14:31:42 INFO MemoryStore: ensureFreeSpace(1512) called with curMem=1583636, maxMem=277877882
15/12/09 14:31:42 INFO MemoryStore: Block broadcast_74_piece0 stored as bytes in memory (estimated size 1512.0 B, free 263.5 MB)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_74_piece0 in memory on 10.28.23.201:60179 (size: 1512.0 B, free: 264.8 MB)
15/12/09 14:31:42 INFO SparkContext: Created broadcast 74 from broadcast at DAGScheduler.scala:874
15/12/09 14:31:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 68 (ShuffledRDD[57] at sortByKey at <console>:26)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Adding task set 68.0 with 1 tasks
15/12/09 14:31:42 INFO TaskSetManager: Starting task 0.0 in stage 68.0 (TID 121, 10.28.23.202, PROCESS_LOCAL, 1165 bytes)
15/12/09 14:31:42 INFO BlockManagerInfo: Added broadcast_74_piece0 in memory on 10.28.23.202:50706 (size: 1512.0 B, free: 264.9 MB)
15/12/09 14:31:42 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 10 to 10.28.23.202:53700
15/12/09 14:31:42 INFO TaskSetManager: Finished task 0.0 in stage 68.0 (TID 121) in 55 ms on 10.28.23.202 (1/1)
15/12/09 14:31:42 INFO TaskSchedulerImpl: Removed TaskSet 68.0, whose tasks have all completed, from pool 
15/12/09 14:31:42 INFO DAGScheduler: ResultStage 68 (take at <console>:26) finished in 0.055 s
15/12/09 14:31:42 INFO DAGScheduler: Job 55 finished: take at <console>:26, took 0.070686 s
res77: Array[(String, Int)] = Array((best,1), (can,1), (demo,1), (hello,1), (hello,1), (in,1), (see,1), (spark,1), (sun,1), (test,1), (the,1), (the,1), (walking,1), (world,1), (you,1))











0 0