Spark2.0.X源码深度剖析之 Spark Submit..

来源:互联网 发布:沈阳贵德软件电话 编辑:程序博客网 时间:2024/06/05 19:53

微信号:519292115

邮箱:taosiyuan163@163.com


尊重原创,禁止转载!!


Spark目前是大数据领域中最火的框架之一,可高效实现离线批处理,实时计算和机器学习等多元化操作,阅读源码有助你加深对框架的理解和认知

本人将依次剖析Spark2.0.0.X版本的各个核心组件,包括以后章节的SparkContext,SparkEnv,RpcEnv,NettyRpc,BlockManager,OutputTracker,TaskScheduler,DAGScheduler等


Spark Submit 脚本是客户端提交任务时的入口脚本,里面包含了集群的提交模式,并行个数,core个数等,下面是他所触发的核心代码


在我们提交Submit脚本的时候 shell脚本最终会触发如下命令,这也就进入了Spark的入口函数\spark\spark-master\spark-master\core\src\main\scala\org\apache\spark\deploy\SparkSubmit.scala

exec"$SPARK_HOME"/bin/spark-class org.apache.spark.deploy.SparkSubmit "$@"


根据不同的参数触发不同的执行任务

override def main(args: Array[String]): Unit = {  //拿到submit脚本传入的参数  val appArgs = new SparkSubmitArguments(args)  if (appArgs.verbose) {    // scalastyle:off println    printStream.println(appArgs)    // scalastyle:on println  }  //根据传入的参数匹配对应的执行方法  appArgs.action match {      //根据传入参数提交命令    case SparkSubmitAction.SUBMIT => submit(appArgs)      //只有standlonemesos集群模式才能触发    case SparkSubmitAction.KILL => kill(appArgs)    //只有standlonemesos集群模式才能触发    case SparkSubmitAction.REQUEST_STATUS => requestStatus(appArgs)  }}
submit十分关键,主要分为两步骤:

1,调用prepareSubmitEnvironment:

prepareSubmitEnvironment(args)这个核心方法会根据你传入的参数匹配不同的Spark运行
模式的入口类,这里会生成系统参数,相关Classpath和运行模式等用作稍后Spark的运行环境准备
2. 调用doRunMain:

这步就是根据之前拿到的Spark运行环境的4元祖然后调用对应的相关入口函数,主要也就是调用对应模式的相关main方法

/** * Submit the application using the provided parameters. * * This runs in two steps. First, we prepare the launch environment by setting up * the appropriate classpath, system properties, and application arguments for * running the child main class based on the cluster manager and the deploy mode. * Second, we use this launch environment to invoke the main method of the child * main class. */@tailrecprivate def submit(args: SparkSubmitArguments): Unit = {  //核心prepareSubmitEnvironment拿到args参数并匹配对应模式返回4元祖  val (childArgs, childClasspath, sysProps, childMainClass) = prepareSubmitEnvironment(args)  def doRunMain(): Unit = {    if (args.proxyUser != null) {      val proxyUser = UserGroupInformation.createProxyUser(args.proxyUser,        UserGroupInformation.getCurrentUser())      try {        proxyUser.doAs(new PrivilegedExceptionAction[Unit]() {          override def run(): Unit = {            //根据拿到的核心信息调用runMain            runMain(childArgs, childClasspath, sysProps, childMainClass, args.verbose)          }        })

prepareSubmitEnvironment方法:根据args提取到不同的参数 触发不同的deploy模式

if (deployMode == CLIENT || isYarnCluster) {  childMainClass = args.mainClass
if (args.isStandaloneCluster) {  if (args.useRest) {    childMainClass = "org.apache.spark.deploy.rest.RestSubmissionClient"    childArgs += (args.primaryResource, args.mainClass)  } else {    // In legacy standalone cluster mode, use Client as a wrapper around the user class    childMainClass = "org.apache.spark.deploy.Client"

if (isYarnCluster) {  childMainClass = "org.apache.spark.deploy.yarn.Client"

if (isMesosCluster) {  assert(args.useRest, "Mesos cluster mode is only supported through the REST submission API")  childMainClass = "org.apache.spark.deploy.rest.RestSubmissionClient"

最后返回的四元组会被传入doRunMain里的run方法,而在runMain里面主要是通过Java的反射动态获取到mainClassPath的真正的main入口函数

var mainClass: Class[_] = nulltry {  mainClass = Utils.classForName(childMainClass)} catch {  case e: ClassNotFoundException =>

里面是个带类加载器参数的Class.forName,其实源码里面很多地方都有用Java代码来实现,包括使用反射来动态获取对象,包括在生成

SparkEnv里的Serializer(默认是JavaSerializer,当然你也可以用高效的Kryo,在后面的章节会讲到)以及一些核心组件的线程池,数据存储结构等

/** Preferred alternative to Class.forName(className) */def classForName(className: String): Class[_] = {  //最后还是通过java反射机制实现  Class.forName(className, true, getContextOrSparkClassLoader)  // scalastyle:on classforname}

最终调用对应的main函数

val mainMethod = mainClass.getMethod("main", new Array[String](0).getClass)
try {
  mainMethod.invoke(null, childArgs.toArray)} catch {  case t: Throwable =>
阅读全文
1 0
原创粉丝点击