Spark入门--求中位数

来源:互联网 发布:小程序店和淘宝 编辑:程序博客网 时间:2024/05/10 16:25

数据如下:

1 2 3 4 5 6 8 9 11 12 13 15 18 20 22 23 25 27 29

代码如下:

import org.apache.spark.{SparkConf, SparkContext}import scala.util.control.Breaks._/** * Created by xuyao on 15-7-24. * 求中位数,数据是分布式存储的 * 将整体的数据分为K个桶,统计每个桶内的数据量,然后统计整个数据量 * 根据桶的数量和总的数据量,可以判断数据落在哪个桶里,以及中位数的偏移量 * 取出这个中位数 */object Median {   def main (args: Array[String]) {    val conf =new SparkConf().setAppName("Median")     val sc=new SparkContext(conf)     //通过textFile读入的是字符串型,所以要进行类型转换     val data =sc.textFile("data").flatMap(x=>x.split(' ')).map(x=>x.toInt)     //将数据分为4组,当然我这里的数据少     val  mappeddata =data.map(x=>(x/4,x)).sortByKey()     //p_count为每个分组的个数     val p_count =data.map(x=>(x/4,1)).reduceByKey(_+_).sortByKey()     p_count.foreach(println)     //p_count是一个RDD,不能进行Map集合操作,所以要通过collectAsMap方法将其转换成scala的集合     val scala_p_count=p_count.collectAsMap()     //根据key值得到value值     println(scala_p_count(0))     //sum_count是统计总的个数,不能用count(),因为会得到多少个map对。     val sum_count = p_count.map(x=>x._2).sum().toInt     println(sum_count)     var temp =0//中值所在的区间累加的个数     var temp2=0//中值所在区间的前面所有的区间累加的个数     var index=0//中值的区间     var mid= 0     if(sum_count%2!=0){        mid =sum_count/2+1//中值在整个数据的偏移量     }     else{        mid =sum_count/2     }     val pcount=p_count.count()     breakable{       for(i <- 0 to pcount.toInt-1){         temp =temp + scala_p_count(i)         temp2 =temp-scala_p_count(i)         if(temp>=mid){           index=i           break         }       }     }     println(mid+" "+index+" "+temp+" "+temp2)     //中位数在桶中的偏移量     val offset =mid-temp2     //takeOrdered它默认可以将key从小到大排序后,获取rdd中的前n个元素     val result =mappeddata.filter(x=>x._1==index).takeOrdered(offset)     println(result(offset-1)._2)     sc.stop()  }}

运行结果如下:

/usr/lib/jvm/java-7-sun/bin/java -Dspark.master=local -Didea.launcher.port=7535 -Didea.launcher.bin.path=/opt/idea/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-7-sun/jre/lib/jfr.jar:/usr/lib/jvm/java-7-sun/jre/lib/javaws.jar:/usr/lib/jvm/java-7-sun/jre/lib/resources.jar:/usr/lib/jvm/java-7-sun/jre/lib/plugin.jar:/usr/lib/jvm/java-7-sun/jre/lib/jfxrt.jar:/usr/lib/jvm/java-7-sun/jre/lib/jsse.jar:/usr/lib/jvm/java-7-sun/jre/lib/charsets.jar:/usr/lib/jvm/java-7-sun/jre/lib/deploy.jar:/usr/lib/jvm/java-7-sun/jre/lib/management-agent.jar:/usr/lib/jvm/java-7-sun/jre/lib/rt.jar:/usr/lib/jvm/java-7-sun/jre/lib/jce.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-7-sun/jre/lib/ext/localedata.jar:/opt/IdeaProjects/SparkTest/target/scala-2.10/classes:/home/xuyao/.sbt/boot/scala-2.10.4/lib/scala-library.jar:/home/xuyao/spark/lib/spark-assembly-1.4.0-hadoop2.4.0.jar:/opt/idea/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain MedianUsing Spark's default log4j profile: org/apache/spark/log4j-defaults.properties15/07/29 12:43:28 INFO SparkContext: Running Spark version 1.4.015/07/29 12:43:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable15/07/29 12:43:29 WARN Utils: Your hostname, hadoop resolves to a loopback address: 127.0.1.1; using 192.168.73.129 instead (on interface eth0)15/07/29 12:43:29 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address15/07/29 12:43:29 INFO SecurityManager: Changing view acls to: xuyao15/07/29 12:43:29 INFO SecurityManager: Changing modify acls to: xuyao15/07/29 12:43:29 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xuyao); users with modify permissions: Set(xuyao)15/07/29 12:43:30 INFO Slf4jLogger: Slf4jLogger started15/07/29 12:43:31 INFO Remoting: Starting remoting15/07/29 12:43:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@192.168.73.129:58364]15/07/29 12:43:32 INFO Utils: Successfully started service 'sparkDriver' on port 58364.15/07/29 12:43:32 INFO SparkEnv: Registering MapOutputTracker15/07/29 12:43:33 INFO SparkEnv: Registering BlockManagerMaster15/07/29 12:43:33 INFO DiskBlockManager: Created local directory at /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8/blockmgr-f9da5521-a9c0-4801-bffb-3a92f089d1cd15/07/29 12:43:33 INFO MemoryStore: MemoryStore started with capacity 131.6 MB15/07/29 12:43:33 INFO HttpFileServer: HTTP File server directory is /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8/httpd-fd2adba3-06b9-4035-9c2b-6733e379207a15/07/29 12:43:33 INFO HttpServer: Starting HTTP Server15/07/29 12:43:33 INFO Utils: Successfully started service 'HTTP file server' on port 58175.15/07/29 12:43:33 INFO SparkEnv: Registering OutputCommitCoordinator15/07/29 12:43:38 INFO Utils: Successfully started service 'SparkUI' on port 4040.15/07/29 12:43:38 INFO SparkUI: Started SparkUI at http://192.168.73.129:404015/07/29 12:43:39 INFO Executor: Starting executor ID driver on host localhost15/07/29 12:43:39 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 56974.15/07/29 12:43:39 INFO NettyBlockTransferService: Server created on 5697415/07/29 12:43:39 INFO BlockManagerMaster: Trying to register BlockManager15/07/29 12:43:39 INFO BlockManagerMasterEndpoint: Registering block manager localhost:56974 with 131.6 MB RAM, BlockManagerId(driver, localhost, 56974)15/07/29 12:43:39 INFO BlockManagerMaster: Registered BlockManager15/07/29 12:43:40 WARN SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(137512) called with curMem=0, maxMem=13794803715/07/29 12:43:41 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 134.3 KB, free 131.4 MB)15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(12633) called with curMem=137512, maxMem=13794803715/07/29 12:43:41 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 12.3 KB, free 131.4 MB)15/07/29 12:43:41 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:56974 (size: 12.3 KB, free: 131.5 MB)15/07/29 12:43:41 INFO SparkContext: Created broadcast 0 from textFile at Median.scala:1515/07/29 12:43:41 INFO FileInputFormat: Total input paths to process : 115/07/29 12:43:41 INFO SparkContext: Starting job: foreach at Median.scala:2015/07/29 12:43:41 INFO DAGScheduler: Registering RDD 6 (map at Median.scala:19)15/07/29 12:43:41 INFO DAGScheduler: Registering RDD 7 (reduceByKey at Median.scala:19)15/07/29 12:43:41 INFO DAGScheduler: Got job 0 (foreach at Median.scala:20) with 1 output partitions (allowLocal=false)15/07/29 12:43:41 INFO DAGScheduler: Final stage: ResultStage 2(foreach at Median.scala:20)15/07/29 12:43:41 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 1)15/07/29 12:43:41 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 1)15/07/29 12:43:41 INFO DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[6] at map at Median.scala:19), which has no missing parents15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(4168) called with curMem=150145, maxMem=13794803715/07/29 12:43:41 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 131.4 MB)15/07/29 12:43:41 INFO MemoryStore: ensureFreeSpace(2376) called with curMem=154313, maxMem=13794803715/07/29 12:43:41 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 131.4 MB)15/07/29 12:43:41 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:56974 (size: 2.3 KB, free: 131.5 MB)15/07/29 12:43:41 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:87415/07/29 12:43:41 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[6] at map at Median.scala:19)15/07/29 12:43:41 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1399 bytes)15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)15/07/29 12:43:42 INFO HadoopRDD: Input split: file:/opt/IdeaProjects/SparkTest/data:0+4915/07/29 12:43:42 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id15/07/29 12:43:42 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id15/07/29 12:43:42 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap15/07/29 12:43:42 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition15/07/29 12:43:42 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2001 bytes result sent to driver15/07/29 12:43:42 INFO DAGScheduler: ShuffleMapStage 0 (map at Median.scala:19) finished in 0.435 s15/07/29 12:43:42 INFO DAGScheduler: looking for newly runnable stages15/07/29 12:43:42 INFO DAGScheduler: running: Set()15/07/29 12:43:42 INFO DAGScheduler: waiting: Set(ShuffleMapStage 1, ResultStage 2)15/07/29 12:43:42 INFO DAGScheduler: failed: Set()15/07/29 12:43:42 INFO DAGScheduler: Missing parents for ShuffleMapStage 1: List()15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 411 ms on localhost (1/1)15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/07/29 12:43:42 INFO DAGScheduler: Missing parents for ResultStage 2: List(ShuffleMapStage 1)15/07/29 12:43:42 INFO DAGScheduler: Submitting ShuffleMapStage 1 (ShuffledRDD[7] at reduceByKey at Median.scala:19), which is now runnable15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2608) called with curMem=156689, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.5 KB, free 131.4 MB)15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1586) called with curMem=159297, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1586.0 B, free 131.4 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:56974 (size: 1586.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:87415/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 1 (ShuffledRDD[7] at reduceByKey at Median.scala:19)15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1154 bytes)15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 3 ms15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1095 bytes result sent to driver15/07/29 12:43:42 INFO DAGScheduler: ShuffleMapStage 1 (reduceByKey at Median.scala:19) finished in 0.071 s15/07/29 12:43:42 INFO DAGScheduler: looking for newly runnable stages15/07/29 12:43:42 INFO DAGScheduler: running: Set()15/07/29 12:43:42 INFO DAGScheduler: waiting: Set(ResultStage 2)15/07/29 12:43:42 INFO DAGScheduler: failed: Set()15/07/29 12:43:42 INFO DAGScheduler: Missing parents for ResultStage 2: List()15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 2 (ShuffledRDD[8] at sortByKey at Median.scala:19), which is now runnable15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2456) called with curMem=160883, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 2.4 KB, free 131.4 MB)15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1501) called with curMem=163339, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 1501.0 B, free 131.4 MB)15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 85 ms on localhost (1/1)15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:56974 (size: 1501.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:87415/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (ShuffledRDD[8] at sortByKey at Median.scala:19)15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1165 bytes)15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 2.0 (TID 2)15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms(0,3)(1,3)(2,3)(3,3)(4,1)(5,3)(6,2)(7,1)15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 2.0 (TID 2). 886 bytes result sent to driver15/07/29 12:43:42 INFO DAGScheduler: ResultStage 2 (foreach at Median.scala:20) finished in 0.015 s15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 23 ms on localhost (1/1)15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/07/29 12:43:42 INFO DAGScheduler: Job 0 finished: foreach at Median.scala:20, took 0.882846 s15/07/29 12:43:42 INFO SparkContext: Starting job: collectAsMap at Median.scala:2215/07/29 12:43:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 143 bytes15/07/29 12:43:42 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 0 is 143 bytes15/07/29 12:43:42 INFO DAGScheduler: Got job 1 (collectAsMap at Median.scala:22) with 1 output partitions (allowLocal=false)15/07/29 12:43:42 INFO DAGScheduler: Final stage: ResultStage 5(collectAsMap at Median.scala:22)15/07/29 12:43:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 4)15/07/29 12:43:42 INFO DAGScheduler: Missing parents: List()15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 5 (ShuffledRDD[8] at sortByKey at Median.scala:19), which has no missing parents15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2552) called with curMem=164840, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.5 KB, free 131.4 MB)15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1502) called with curMem=167392, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1502.0 B, free 131.4 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:56974 (size: 1502.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:87415/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 5 (ShuffledRDD[8] at sortByKey at Median.scala:19)15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 5.0 with 1 tasks15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 5.0 (TID 3, localhost, PROCESS_LOCAL, 1165 bytes)15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 5.0 (TID 3)15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 2 ms15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 5.0 (TID 3). 1179 bytes result sent to driver15/07/29 12:43:42 INFO DAGScheduler: ResultStage 5 (collectAsMap at Median.scala:22) finished in 0.009 s15/07/29 12:43:42 INFO DAGScheduler: Job 1 finished: collectAsMap at Median.scala:22, took 0.063031 s15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 5.0 (TID 3) in 18 ms on localhost (1/1)15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 5.0, whose tasks have all completed, from pool 315/07/29 12:43:42 INFO SparkContext: Starting job: sum at Median.scala:2615/07/29 12:43:42 INFO DAGScheduler: Got job 2 (sum at Median.scala:26) with 1 output partitions (allowLocal=false)15/07/29 12:43:42 INFO DAGScheduler: Final stage: ResultStage 8(sum at Median.scala:26)15/07/29 12:43:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 7)15/07/29 12:43:42 INFO DAGScheduler: Missing parents: List()15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 8 (MapPartitionsRDD[10] at numericRDDToDoubleRDDFunctions at Median.scala:26), which has no missing parents15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(3464) called with curMem=168894, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 3.4 KB, free 131.4 MB)15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2011) called with curMem=172358, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 2011.0 B, free 131.4 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:56974 (size: 2011.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO SparkContext: Created broadcast 5 from broadcast at DAGScheduler.scala:87415/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 8 (MapPartitionsRDD[10] at numericRDDToDoubleRDDFunctions at Median.scala:26)15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 8.0 with 1 tasks15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 8.0 (TID 4, localhost, PROCESS_LOCAL, 1165 bytes)15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 8.0 (TID 4)15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 8.0 (TID 4). 926 bytes result sent to driver15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 8.0 (TID 4) in 11 ms on localhost (1/1)15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 8.0, whose tasks have all completed, from pool 15/07/29 12:43:42 INFO DAGScheduler: ResultStage 8 (sum at Median.scala:26) finished in 0.006 s15/07/29 12:43:42 INFO DAGScheduler: Job 2 finished: sum at Median.scala:26, took 0.038065 s15/07/29 12:43:42 INFO SparkContext: Starting job: count at Median.scala:381915/07/29 12:43:42 INFO DAGScheduler: Got job 3 (count at Median.scala:38) with 1 output partitions (allowLocal=false)15/07/29 12:43:42 INFO DAGScheduler: Final stage: ResultStage 11(count at Median.scala:38)15/07/29 12:43:42 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 10)15/07/29 12:43:42 INFO DAGScheduler: Missing parents: List()15/07/29 12:43:42 INFO DAGScheduler: Submitting ResultStage 11 (ShuffledRDD[8] at sortByKey at Median.scala:19), which has no missing parents15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(2376) called with curMem=174369, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 2.3 KB, free 131.4 MB)15/07/29 12:43:42 INFO MemoryStore: ensureFreeSpace(1451) called with curMem=176745, maxMem=13794803715/07/29 12:43:42 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 1451.0 B, free 131.4 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:56974 (size: 1451.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO SparkContext: Created broadcast 6 from broadcast at DAGScheduler.scala:87415/07/29 12:43:42 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 11 (ShuffledRDD[8] at sortByKey at Median.scala:19)15/07/29 12:43:42 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks15/07/29 12:43:42 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 5, localhost, PROCESS_LOCAL, 1165 bytes)15/07/29 12:43:42 INFO Executor: Running task 0.0 in stage 11.0 (TID 5)15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks15/07/29 12:43:42 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 0 ms15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_5_piece0 on localhost:56974 in memory (size: 2011.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO Executor: Finished task 0.0 in stage 11.0 (TID 5). 924 bytes result sent to driver15/07/29 12:43:42 INFO TaskSetManager: Finished task 0.0 in stage 11.0 (TID 5) in 9 ms on localhost (1/1)15/07/29 12:43:42 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool 15/07/29 12:43:42 INFO DAGScheduler: ResultStage 11 (count at Median.scala:38) finished in 0.010 s15/07/29 12:43:42 INFO DAGScheduler: Job 3 finished: count at Median.scala:38, took 0.167034 s10 3 12 915/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:56974 in memory (size: 1502.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:56974 in memory (size: 1501.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:56974 in memory (size: 1586.0 B, free: 131.5 MB)15/07/29 12:43:42 INFO BlockManagerInfo: Removed broadcast_1_piece0 on localhost:56974 in memory (size: 2.3 KB, free: 131.5 MB)15/07/29 12:43:43 INFO SparkContext: Starting job: takeOrdered at Median.scala:5315/07/29 12:43:43 INFO DAGScheduler: Registering RDD 4 (map at Median.scala:17)15/07/29 12:43:43 INFO DAGScheduler: Got job 4 (takeOrdered at Median.scala:53) with 1 output partitions (allowLocal=false)15/07/29 12:43:43 INFO DAGScheduler: Final stage: ResultStage 13(takeOrdered at Median.scala:53)15/07/29 12:43:43 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 12)15/07/29 12:43:43 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 12)15/07/29 12:43:43 INFO DAGScheduler: Submitting ShuffleMapStage 12 (MapPartitionsRDD[4] at map at Median.scala:17), which has no missing parents15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(4328) called with curMem=153972, maxMem=13794803715/07/29 12:43:43 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 4.2 KB, free 131.4 MB)15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(2424) called with curMem=158300, maxMem=13794803715/07/29 12:43:43 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 2.4 KB, free 131.4 MB)15/07/29 12:43:43 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:56974 (size: 2.4 KB, free: 131.5 MB)15/07/29 12:43:43 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:87415/07/29 12:43:43 INFO DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 12 (MapPartitionsRDD[4] at map at Median.scala:17)15/07/29 12:43:43 INFO TaskSchedulerImpl: Adding task set 12.0 with 1 tasks15/07/29 12:43:43 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 6, localhost, PROCESS_LOCAL, 1399 bytes)15/07/29 12:43:43 INFO Executor: Running task 0.0 in stage 12.0 (TID 6)15/07/29 12:43:43 INFO HadoopRDD: Input split: file:/opt/IdeaProjects/SparkTest/data:0+4915/07/29 12:43:43 INFO Executor: Finished task 0.0 in stage 12.0 (TID 6). 2001 bytes result sent to driver15/07/29 12:43:43 INFO TaskSetManager: Finished task 0.0 in stage 12.0 (TID 6) in 9 ms on localhost (1/1)15/07/29 12:43:43 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool 15/07/29 12:43:43 INFO DAGScheduler: ShuffleMapStage 12 (map at Median.scala:17) finished in 0.004 s15/07/29 12:43:43 INFO DAGScheduler: looking for newly runnable stages15/07/29 12:43:43 INFO DAGScheduler: running: Set()15/07/29 12:43:43 INFO DAGScheduler: waiting: Set(ResultStage 13)15/07/29 12:43:43 INFO DAGScheduler: failed: Set()15/07/29 12:43:43 INFO DAGScheduler: Missing parents for ResultStage 13: List()15/07/29 12:43:43 INFO DAGScheduler: Submitting ResultStage 13 (MapPartitionsRDD[12] at takeOrdered at Median.scala:53), which is now runnable15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(3600) called with curMem=160724, maxMem=13794803715/07/29 12:43:43 INFO MemoryStore: Block broadcast_8 stored as values in memory (estimated size 3.5 KB, free 131.4 MB)15/07/29 12:43:43 INFO MemoryStore: ensureFreeSpace(2075) called with curMem=164324, maxMem=13794803715/07/29 12:43:43 INFO MemoryStore: Block broadcast_8_piece0 stored as bytes in memory (estimated size 2.0 KB, free 131.4 MB)15/07/29 12:43:43 INFO BlockManagerInfo: Added broadcast_8_piece0 in memory on localhost:56974 (size: 2.0 KB, free: 131.5 MB)15/07/29 12:43:43 INFO SparkContext: Created broadcast 8 from broadcast at DAGScheduler.scala:87415/07/29 12:43:43 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 13 (MapPartitionsRDD[12] at takeOrdered at Median.scala:53)15/07/29 12:43:43 INFO TaskSchedulerImpl: Adding task set 13.0 with 1 tasks15/07/29 12:43:43 INFO TaskSetManager: Starting task 0.0 in stage 13.0 (TID 7, localhost, PROCESS_LOCAL, 1165 bytes)15/07/29 12:43:43 INFO Executor: Running task 0.0 in stage 13.0 (TID 7)15/07/29 12:43:43 INFO ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks15/07/29 12:43:43 INFO ShuffleBlockFetcherIterator: Started 0 remote fetches in 1 ms15/07/29 12:43:43 INFO Executor: Finished task 0.0 in stage 13.0 (TID 7). 1486 bytes result sent to driver15/07/29 12:43:43 INFO TaskSetManager: Finished task 0.0 in stage 13.0 (TID 7) in 32 ms on localhost (1/1)15/07/29 12:43:43 INFO TaskSchedulerImpl: Removed TaskSet 13.0, whose tasks have all completed, from pool 15/07/29 12:43:43 INFO DAGScheduler: ResultStage 13 (takeOrdered at Median.scala:53) finished in 0.028 s15/07/29 12:43:43 INFO DAGScheduler: Job 4 finished: takeOrdered at Median.scala:53, took 0.071571 s1215/07/29 12:43:43 INFO SparkUI: Stopped Spark web UI at http://192.168.73.129:404015/07/29 12:43:43 INFO DAGScheduler: Stopping DAGScheduler15/07/29 12:43:43 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!15/07/29 12:43:43 INFO Utils: path = /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8/blockmgr-f9da5521-a9c0-4801-bffb-3a92f089d1cd, already present as root for deletion.15/07/29 12:43:43 INFO MemoryStore: MemoryStore cleared15/07/29 12:43:43 INFO BlockManager: BlockManager stopped15/07/29 12:43:43 INFO BlockManagerMaster: BlockManagerMaster stopped15/07/29 12:43:43 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!15/07/29 12:43:43 INFO SparkContext: Successfully stopped SparkContext15/07/29 12:43:43 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.15/07/29 12:43:43 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.15/07/29 12:43:43 INFO Utils: Shutdown hook called15/07/29 12:43:43 INFO Utils: Deleting directory /tmp/spark-329d9ad9-4ed6-4a79-97f3-254cab1a13b8Process finished with exit code 0
0 0
原创粉丝点击