以wordcount理解spark的执行过程

来源:互联网 发布:ftp命令 linux 编辑:程序博客网 时间:2024/05/16 10:14
以wordcount理解spark的执行过程:

1、代码以及交互界面的回应:
(RDD是spark的核心抽象,所有的计算都围绕RDD进行,生成RDD,然后可以对RDD进行各种操作,
这些操作主要有两类:
Transformation(转换)
[一个RDD进过计算生成一个新的RDD,比如接下来示例中的flatMap、map、reduceByKey]

Action(动作)
[返回结果到Driver程序中,这意味着RDD计算的结束,比如wordcount中的collect操作])

(1)从本地文件生成一个RDD,每一条记录,但没有执行读文件操作。
scala> val textFile=sc.textFile("input")
(SparkContext,即sc对象的textFile函数接收一个文本文件作为输入,创建一个新的RDD,并返回给只读变量textFile)

16/07/08 01:20:32 WARN util.SizeEstimator: Failed to check whether UseCompressedOops is set; assuming yes
16/07/08 01:20:32 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 24.2 KB, free 24.2 KB)
16/07/08 01:20:32 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 5.4 KB, free 29.6 KB)
16/07/08 01:20:32 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:49890 (size: 5.4 KB, free: 517.4 MB)
16/07/08 01:20:32 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:21
textFile: org.apache.spark.rdd.RDD[String] = input MapPartitionsRDD[1] at textFile at <console>:21

(2)将每行数据按空格拆分成单词,也没有马上执行。
scala> val words=textFile.flatMap(line=>line.split(" "))
(flatMap函数是一个RDD转换函数,对当前RDD的所有成员调用输入的函数,并返回一个新的RDD.flatMap对每个输入的是一个RDD成员,输出的是一个集合,集合里面的成员会被展开,这样一个输出多个)
(line=>line.split(" ")是Scala函数式编程的简写形式,表示函数接收一个名为line的参数,函数体对line进行split操作,split的结果就是返回值)

words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23

(3)将各个单词加上计数值1,没有马上执行
scala> val wordPairs=words.map(word => (word,1))
(map函数接收一个函数作为参数,它会遍历RDD中所有的成员,并将成员作为参数来调用,输出的结果为新的RDD的成员map(word => (word,1))

wordPairs: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[3] at map at <console>:25

(4)对所有相同的单词进行聚合相加求各单词的总数,依然没有马上执行。
scala> val wordCounts = wordPairs.reduceByKey((a,b) => a + b)
(reduceByKey((a,b) => a + b),代码按key聚合,数量相加)

16/07/08 01:24:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/08 01:24:09 WARN snappy.LoadSnappy: Snappy native library not loaded
16/07/08 01:24:10 INFO mapred.FileInputFormat: Total input paths to process : 1
wordCounts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:27

(5)collect是Action
scala> wordCounts .collect ()
(collect函数将RDD分布在各个节点的数据拉回到Driver程序到本地,交互式shell会自动显示内容)

16/07/08 01:24:48 INFO spark.SparkContext: Starting job: collect at <console>:30
16/07/08 01:24:48 INFO scheduler.DAGScheduler: Registering RDD 3 (map at <console>:25)
16/07/08 01:24:48 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:30) with 1 output partitions
16/07/08 01:24:48 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (collect at <console>:30)
16/07/08 01:24:48 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
16/07/08 01:24:48 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
16/07/08 01:24:49 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:25), which has no missing parents
16/07/08 01:24:49 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 33.7 KB)
16/07/08 01:24:49 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 36.0 KB)
16/07/08 01:24:49 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:49890 (size: 2.3 KB, free: 517.4 MB)
16/07/08 01:24:49 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
16/07/08 01:24:49 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at <console>:25)
16/07/08 01:24:49 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
16/07/08 01:24:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2135 bytes)
16/07/08 01:24:50 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
16/07/08 01:24:50 INFO rdd.HadoopRDD: Input split: hdfs://192.168.147.129:9000/user/root/input:0+1366
16/07/08 01:24:51 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2253 bytes result sent to driver
16/07/08 01:24:51 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1227 ms on localhost (1/1)
16/07/08 01:24:51 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
16/07/08 01:24:51 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (map at <console>:25) finished in 1.440 s
16/07/08 01:24:51 INFO scheduler.DAGScheduler: looking for newly runnable stages
16/07/08 01:24:51 INFO scheduler.DAGScheduler: running: Set()
16/07/08 01:24:51 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
16/07/08 01:24:51 INFO scheduler.DAGScheduler: failed: Set()
16/07/08 01:24:51 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:27), which has no missing parents
16/07/08 01:24:51 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 2.6 KB, free 38.6 KB)
16/07/08 01:24:51 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 1588.0 B, free 40.1 KB)
16/07/08 01:24:51 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:49890 (size: 1588.0 B, free: 517.4 MB)
16/07/08 01:24:51 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
16/07/08 01:24:51 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (ShuffledRDD[4] at reduceByKey at <console>:27)
16/07/08 01:24:51 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
16/07/08 01:24:51 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, partition 0,NODE_LOCAL, 1894 bytes)
16/07/08 01:24:51 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
16/07/08 01:24:51 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
16/07/08 01:24:52 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 49 ms
16/07/08 01:24:52 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 4131 bytes result sent to driver
16/07/08 01:24:52 INFO scheduler.DAGScheduler: ResultStage 1 (collect at <console>:30) finished in 0.610 s
16/07/08 01:24:52 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 614 ms on localhost (1/1)
16/07/08 01:24:52 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
16/07/08 01:24:52 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:30, took 3.726771 s
res1: Array[(String, Int)] = Array((Hadoop,1), (Commodity,1), (For,1), (this,3), (country,1), (under,1), (it,1), (The,4), (Jetty,1), (Software,2), (Technology,1), (<http://www.wassenaar.org/>,1), (have,1), (http://wiki.apache.org/hadoop/,1), (BIS,1), (classified,1), (This,1), (following,1), (which,2), (security,1), (See,1), (encryption,3), (Number,1), (export,1), (reside,1), (for,3), ((BIS),,1), (any,1), (at:,2), (software,2), (makes,1), (algorithms.,1), (re-export,2), (latest,1), (your,1), (SSL,1), (the,8), (Administration,1), (includes,2), (import,,2), (provides,1), (Unrestricted,1), (country's,1), (if,1), (740.13),1), (Commerce,,1), (country,,1), (software.,2), (concerning,1), (laws,,1), (source,1), (possession,,2), (Apache,1), (our,2), (written,1), (as,1), (License,1), (regulations,...
scala> 16/07/08 01:51:45 INFO spark.ContextCleaner: Cleaned accumulator 1
16/07/08 01:51:45 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:49890 in memory (size: 1588.0 B, free: 517.4 MB)
16/07/08 01:51:45 INFO spark.ContextCleaner: Cleaned accumulator 2
16/07/08 01:51:45 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:49890 in memory (size: 2.3 KB, free: 517.4 MB)

所以说spark最核心的抽象是弹性分布式数据集(RDD),RDD可以通过HDFS文件创建,或者由其他RDD转换而来。

2、善于传递函数参数
spark严重依赖传递函数类型的参数,比如常见的RDD Transformation函数map和reuce,接收的参数都是一个函数类型。
一般有两种形式,
(1)匿名函数,适用于小片段的代码,前面wordcount事例中的函数参数都是这种类型。
(2)传递object对象中的静态方法,比如wordcount中的flatMap调用,写成:
scala> object MyFunctions{
     | def lineSplit(line:String):Array[String]={
     | line.split("")
     | }
     | }
defined module MyFunctions

scala> val words =textFile.flatMap(MyFunctions.lineSplit)
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[5] at flatMap at <console>:25

scala>
等价于scala> val words=textFile.flatMap(line=>line.split(" "))
words: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at flatMap at <console>:23

0 0