Spark提交应用（Submitting Applications）

来源：互联网发布：faceu是什么软件编辑：程序博客网时间：2024/05/23 01:58

1、提交应用(Submitting Applications)

用spark的bin目录下的spark-submit脚本在集群上启动应用。它可以通过统一的接口来管理spark所支持的cluster managers，所以不需要为每一个应用做特殊的配置。

2、打包程序(Bundling Your Application’s Dependencies)

如果你的代码依赖于其他项目,需要将应用程序打包才能在集群上分发代码。为此,创建一个装配jar(或“超级”jar)包含代码及其依赖项。可以用sbt和Maven插件组装。在打jar包时,spark和Hadoop的依赖包不需要打包,因为他们是cluster manager在运行时提供的。打好jar包后就可以用bin / spark-submit脚本来提交应用。对于Python,可以使用spark-submit的–py-files参数添加.py,.zip或者.egg文件 to be distributed with your application.如果你依赖于多个Python文件我们建议打包成一个zip或.egg。

3、启动程序(Launching Applications with spark-submit)

一旦用户应用程序捆绑,就可以用bin/spark-submit脚本来启动。这个脚本负责设置spark的类路径和依赖，并且支持不同的spark支持的不同cluster manager和发布模式。
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \
... #other options
<application-jar> \
[application-arguments]

一些通用设置:
--class: 程序入口点 (例如： org.apache.spark.examples.SparkPi)
--master: 集群的master URL (例如： spark://23.195.26.187:7077)
--deploy-mode: 发布driver程序在 worker nodes (cluster) 或者 locally as an external client (client) (default: client)
--conf: spark的任意key=value格式的参数.对于包含空格的值，用双引号例如”key=value”。
--application-jar: 包含您的应用程序和所有依赖项的绑定的路径。URL必须是全局可见的在你的集群，例如，hdfs:// path 或者 file:// path 在所有节点是可见的.
--application-arguments: 传递给主类的main方法的参数
一个常见的部署策略是从和work machine 合作的gateway machine提交你的应用程序（例如:master node 在独立的），身体与位于你的工人的机器（在一个独立的EC2集群如主节点）。在此设置中，客户端模式是适当的。在客户端模式下，该驱动程序直接在spark-submit过程中作为集群的客户端启动。该应用程序的输入和输出连接到控制台上。因此，这种模式特别适合应用到REPL（例如 spark shell）。
另外，如果应用程序提交的machine远离work machine（如在您的笔记本电脑），使用集群模式是常见的，以减少drivers和executors之间的网络延迟。目前只有yarn支持cluster model下的python应用。
对于Python应用程序，简单的通过一个.py文件在<application-jar>位置而不是一个JAR，添加Python .zip，.egg，.py文件到搜索路径用- -py-files。
有几个可用的选项，是特定的cluster manager正在使用。例如，用一个Spark standalone cluster 的cluster 发布模式，也可以指定–supervise，以确保驱动程序在失败与非零的退出代码时是自动重新启动。通过执行–help来列举spark-submit的所有的可用参数。下面是常见的参数的例子：
#Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100

//Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000

//Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000

//Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000

//Run a Python application on a Spark standalone cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ 1000

//Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master mesos://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ 1000

4、Master URLs

传递给spark的master URL有下面的几种格式：

Master URL Meaning local Run Spark locally with one worker thread (i.e. no parallelism at all). local[K] Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). local[*] Run Spark locally with as many worker threads as logical cores on your machine. spark://HOST: PORT Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default. mesos://HOST: PORT Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://…. To submit with --deploy-mode cluster, the HOST: PORT should be configured to connect to the MesosClusterDispatcher. yarn Connect to a YARN cluster in client or cluster mode depending on the value of –deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

5、（从文件加载配置）Loading Configuration from a File

spark-submit脚本可以从参数文件加载默认的spark配置参数值，并且将它们传递给应用程序。默认会从spark目录conf/spark-defaults.conf读取参数。更多信息参考http://spark.apache.org/docs/latest/configuration.html#loading-default-configurations。
加载默认的spark配置方式可以避免spark-submit必须需要确切参数。例如，如果spark.master参数被设置，就可以安全的省略spark-submit的--master标志。总之，配置值显示设置在sparkConf是最高优先级，其次是通过spark-submit提交的，最后是默认文件的值。
如果不清楚配置参数从哪里来的，可以输出细粒度的调试信息通过运行spark-submit的参数--verbose

6、（依赖管理）Advanced Dependency Management

当用spark-submit提交时，应用jar包和任何–jars参数下的jar包都被自动分发到集群。URLs提供后--jars必须用逗号隔开。这个列表包括driver和executor classpaths。用--jars的目录扩展不起作用。
spark使用下面的URL模式，以允许不同的策略来分发jar包：
- file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
- hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
- local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
注意，在executor节点每一个SparkContext会将jar包和文件复制到工作目录。随着时间的推移会用大量的空间，需要定时清理。yarn可以自动清理，spark standalone可以配置参数spark.worker.cleanup.appDataTtl
用户还可以用–packages提供一个用逗号分隔的maven坐标列表。使用此命令时将处理所有传递的依赖关系。用标志–repositories可以添加附加仓库（或者SBT解析器）用逗号分隔的方式。这些命令可以被pyspark，spark-shell，和spark-submit使用。
对于python，等效于--py-files参数可以用来分配.egg,.zip和.py库到executors。

0 0