spark源码阅读二-spark application运行过程

来源：互联网发布：成都php培训费用要多少编辑：程序博客网时间：2024/05/24 01:42

代码版本：spark 2.2.0

本篇文章主要讲述一个application的运行过程。大体分为三部分：（1）SparkConf创建；（2）SparkContext创建；（3）任务执行。

假如我们用scala写了一个wordcount程序对文件单词进行计数，

package com.spark.myapp

import org.apache.spark.{SparkContext, SparkConf}

object WordCount {

def main(args: Array[String]) {

val conf = new SparkConf().setAppName("WordCount").setMaster("spark://master:7077”)

val sc = new SparkContext(conf)

sc.textFile(“README.md").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)

sc.stop()

}

编译打包jar完成后，要在standalone集群环境提交一个任务，在提交任务机器安装的spark目录下敲入命令：spark-submit --classcom.spark.myapp.WordCount --masterspark://master:7077 /home/xx/myapps/wordcount.jar

关于spark-submit是如何运行到任务代码，请参考前面的文章“spark-submit执行过程”。本篇文章主要讲述一个application的运行过程。

大体分为三部分：（1）SparkConf创建；（2）SparkContext创建；（3）任务执行。

1.SparkConf创建

SparkConf包含了Spark集群配置的各种参数，我们看下这个类的说明。

Configuration for a Spark application. Used to set various Spark parameters as key-value pairs.Most of the time, you would create a SparkConf object with`new SparkConf()`, which will load values from any`spark.*`Java system properties set in your application as well. In this case, parameters you set directly on the`SparkConf`object take priority over system properties.

重点就是说new SparkConf()会从系统配置里读取spark相关参数，参数都是k-v对，然后你可以使用SparkConf的set函数来自己设置覆盖读取的配置。

常见的参数设置函数如下：

（1）设置master url

def setMaster(master: String): SparkConf

（2）设置application名称，在spark web UI展示

def setAppName(name: String): SparkConf

（3）设置jar包

def setJars(jars: Seq[String]): SparkConf

（4）设置Executor环境变量

def setExecutorEnv(variable: String, value: String): SparkConf

（5）设置spark home安装目录

def setSparkHome(home: String): SparkConf

2.SparkContext创建

SparkContext是spark开发过程中的重要对象，是spark上层应用和底层api的中介。SparkContext的构造函数参数就是上面描述过的SparkConf。

Only one SparkContext may be active per JVM. You must`stop()`the active SparkContext before creating a new one.

几个关键属性为SparkEnv、schedulerBackend、taskScheduler、dagScheduler

（1）创建SparkEnv

// Create the Spark execution environment (cache, map output tracker, etc)

_env = createSparkEnv(_conf, isLocal, listenerBus)

SparkEnv.set(_env)

（2）创建schedulerBackend、taskScheduler

// Create and start the scheduler

val (sched, ts) = SparkContext.createTaskScheduler(this, master, deployMode)

_schedulerBackend = sched

_taskScheduler = ts

// start TaskScheduler after taskScheduler sets DAGScheduler reference in DAGScheduler’s constructor

_taskScheduler.start()

其中createTaskScheduler是根据传入的master参数来返回对应的schedulerBackend、taskScheduler，类似工厂模式。

master match {

case “local"

case LOCAL_N_REGEX(threads)

case LOCAL_N_FAILURES_REGEX(threads, maxFailures)

case SPARK_REGEX(sparkUrl) //standalone 进入这个分支

case LOCAL_CLUSTER_REGEX(numSlaves, coresPerSlave, memoryPerSlave)

case masterUrl //其他yarn或mesos集群进入这个分支

}

（3）创建dagScheduler

_dagScheduler = new DAGScheduler(this)

3.任务执行

spark任务执行的基础就是RDD（弹性分布式数据集），各种spark算子在RDD上运算输出新的RDD，最终得到结果输出到屏幕或文件或内存等。

sc.textFile(“README.md").flatMap(_.split(" ")).map((_, 1)).reduceByKey(_+_).collect().foreach(println)

第一步sc.textFile(“README.md”)

def textFile(

path: String,

minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {

assertNotStopped()

hadoopFile(path, classOf[TextInputFormat],classOf[LongWritable],classOf[Text],

minPartitions).map(pair => pair._2.toString).setName(path)

}

hadoopFile输出结果是HadoopRDD，默认是2个partition；然后进行map操作得到一个新的MapPartitionsRDD，看下面代码。详细过程可以参看前面文章《spark读写文件代码分析》。RDD的创建只有2种方式：一是从文件系统或数据库读取数据输入创建；二是从父RDD计算转换得到新的RDD。

/** * Return a new RDD by applying a function to all elements of this RDD.*/

def map[U: ClassTag](f: T => U): RDD[U] = withScope {

val cleanF = sc.clean(f)

new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))

}

第二步flatMap(_.split(" "))

/** * Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results.*/

def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {

val cleanF = sc.clean(f)

new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))

}

使用空格将单词分割提取后，合并成一个集合，还是返回MapPartitionsRDD。

第三步map((_, 1))每个数据项增加了计数1，返回MapPartitionsRDD。

假如有这样一句话“hello world”，那么第二步变为（hello， world），第三步变为（（hello， 1），（world， 1））

第四步reduceByKey(_+_)

这个函数不在RDD文件里面，而是在PairRDDFunctions。看代码是做了一个RDD隐式转换

/** * Defines implicit functions that provide extra functionalities on RDDs of specific types.

* * For example,[[RDD.rddToPairRDDFunctions]]converts an RDD into a [[PairRDDFunctions]]for

* key-value-pair RDDs, and enabling extra functionalities such as`PairRDDFunctions.reduceByKey`.*/

implicit def rddToPairRDDFunctions[K, V](rdd: RDD[(K, V)])

(implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] =null): PairRDDFunctions[K, V] = {

new PairRDDFunctions(rdd)

}

def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {

reduceByKey(defaultPartitioner(self), func)

}

def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {

combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)

}

数据合并之后需要重新分区，分区对象partitioner默认是HashPartitioner

第五步collect().foreach(println)

这2个一起说明，因为此时进入了action算子（collect和foreach都是action算子），前面的都是transformation算子。transformation操作是延迟计算的，需要等到action算子才能真正触发运算，此时会提交作业job到executor执行。

def collect(): Array[T] = withScope {

val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

Array.concat(results: _*)

}

def foreach(f: T => Unit): Unit = withScope {

val cleanF = sc.clean(f)

sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))

}

使用的是spark context的runJob，我们来看实现

def runJob[T, U: ClassTag](rdd: RDD[T], func: Iterator[T] => U): Array[U] = {

runJob(rdd, func, 0 until rdd.partitions.length)

}

最终调用的函数

/** * Run a function on a given set of partitions in an RDD and pass the results to the given

* handler function. This is the main entry point for all actions in Spark.

* *@param rddtarget RDD to run tasks on

* @param funca function to run on each partition of the RDD

* @param partitionsset of partitions to run on; some jobs may not want to compute on all

* partitions of the target RDD, e.g. for operations like `first()`

* @param resultHandlercallback to pass each result to*/

def runJob[T, U: ClassTag](

rdd: RDD[T],

func: (TaskContext, Iterator[T]) => U,

partitions: Seq[Int],

resultHandler: (Int, U) => Unit): Unit = {

if (stopped.get()) {

throw new IllegalStateException("SparkContext has been shutdown")

}

val callSite = getCallSite

val cleanedFunc = clean(func)

logInfo("Starting job: " + callSite.shortForm)

if (conf.getBoolean("spark.logLineage",false)) {

logInfo("RDD's recursive dependencies:\n" + rdd.toDebugString)

}

dagScheduler.runJob(rdd, cleanedFunc, partitions, callSite, resultHandler,localProperties.get)

progressBar.foreach(_.finishAll())

rdd.doCheckpoint()

}

阅读全文

0 0