Spark配置参数中英文对照

来源:互联网 发布:php视频网站 编辑:程序博客网 时间:2024/06/04 20:11

转至:https://www.oschina.net/translate/spark-configuration?cmp



Spark provides three main locations to configure the system:

  • Environment variables for launching Spark workers, which can be set either in your driver program or in theconf/spark-env.shscript.
  • Java system properties, which control internal configuration parameters and can be set either programmatically (by callingSystem.setPropertybefore creating aSparkContext) or through theSPARK_JAVA_OPTSenvironment variable inspark-env.sh.
  • Logging configuration, which is done throughlog4j.properties.


译者信息

Spark提供了三种主要本地设置来配置系统:

  • 环境变量 用来加载Spark的workers,可以在你的驱动程序或theconf/spark-env.shscript中设定。
  • Java系统属性 控制内部配置参数,可以通过编程方式设置(通过在创建SparkContext之前调用System.setProperty)或者通过inspark-env.sh中的SPARK_JAVA_OPTS环境变量。
  • 日志配置 通过log4j.properties来设置。


Environment Variables

Spark determines how to initialize the JVM on worker nodes, or even on the local node when you runspark-shell, by running theconf/spark-env.shscript in the directory where it is installed. This script does not exist by default in the Git repository, but but you can create it by copyingconf/spark-env.sh.template. Make sure that you make the copy executable.

Insidespark-env.sh, you must set at least the following two variables:

  • SCALA_HOME, to point to your Scala installation, orSCALA_LIBRARY_PATHto point to the directory for Scala library JARs (if you install Scala as a Debian or RPM package, there is noSCALA_HOME, but these libraries are in a separate path, typically /usr/share/java; look forscala-library.jar).
  • MESOS_NATIVE_LIBRARY, if you are running on a Mesos cluster.


译者信息

环境变量

Spark决定worker节点上如何初始化JVM,当你在本地运行spark-shell时也是这样,通过运行Spark安装目录下的conf/spark-env.sh脚本即可启动spark-shell。该脚本在Git库中默认不存在,但你可以通过复制conf/spark-env.sh.template来创建一个。确保复制后的脚本有执行权限。

在spark-env.sh中,你必需至少设置下面两个变量:

  • SCALA_HOME 指定你的Scala安装位置,或使用SCALA_LIBRARY_PATH指定Scala的jar库位置(如果你以Debian或RPM包方式安装Scala,无法设置SCALA_HOME,但库在一个特定目录里,通常是/usr/share/java,可以通过搜索scala-library.jar获取)。
  • MESOS_NATIVE_LIBRARY 如果你正运行在一个Mesos集群上设置该项。
In addition, there are four other variables that control execution. These should be set in the environment that launches the job’s driver programinstead ofspark-env.sh, because they will be automatically propagated to workers. Setting these per-job instead of inspark-env.shensures that different jobs can have different settings for these variables.


  • SPARK_JAVA_OPTS, to add JVM options. This includes any system properties that you’d like to pass with-D.
  • SPARK_CLASSPATH, to add elements to Spark’s classpath.
  • SPARK_LIBRARY_PATH, to add search directories for native libraries.
  • SPARK_MEM, to set the amount of memory used per node. This should be in the same format as the JVM’s -Xmx option, e.g.300mor1g. Note that this option will soon be deprecated in favor of thespark.executor.memorysystem property, so we recommend using that in new code.

Beware that if you do set these variables inspark-env.sh, they will override the values set by user programs, which is undesirable; if you prefer, you can choose to havespark-env.shset them only if the user program hasn’t, as follows:

if [ -z "$SPARK_JAVA_OPTS" ] ; then  SPARK_JAVA_OPTS="-verbose:gc"fi
译者信息

另外,这里有4个另外的用于控制执行的变量。这些变量需要在运行Job的驱动程序的上下文设置,而不是在spark-env.sh,因为这些变量将会被自动传递给workers。在每一个job里设置这些变量可以使得不同的job对这些变量有不同的配置。

  • SPARK_JAVA_OPTS, 添加JVM选项。这包含了你用-D传递的一些系统属性。
  • SPARK_CLASSPATH, 向Spark的classpath添加元素
  • SPARK_LIBRARY_PATH, 添加本地库的搜索路径
  • SPARK_MEM, 设置每一个节点使用的内存大小。这应该和JVM的-Xmx选项类似的形式,比如300m或者1g。需要注意为了推行spark.executor.memorysystem属性,这个选项很快将不被推荐使用,所以我们推荐你使用在代码中使用spark.executor.memorysystem。

注意,如果在spark-env.sh文件中设置这些变量,他们将会被用户程序中设置的值所覆盖。如果你愿意,你可以选择只有在用户程序中没有设置的情况下在spark-env.sh中如下设置他们:

if [ -z "$SPARK_JAVA_OPTS" ] ; then  SPARK_JAVA_OPTS="-verbose:gc"fi


System Properties

To set a system property for configuring Spark, you need to either pass it with a -D flag to the JVM (for examplejava -Dspark.cores.max=5 MyProgram) or callSystem.setPropertyin your code before creating your Spark context, as follows:

System.setProperty("spark.cores.max", "5")val sc = new SparkContext(...)


Most of the configurable system properties control internal settings that have reasonable default values. However, there are at least five properties that you will commonly want to control:

Property NameDefaultMeaningspark.executor.memory512mAmount of memory to use per executor process, in the same format as JVM memory strings (e.g. `512m`, `2g`).spark.serializerspark.JavaSerializerClass to use for serializing objects that will be sent over the network or need to be cached in serialized form. The default of Java serialization works with any Serializable Java object but is quite slow, so we recommend usingspark.KryoSerializerand configuring Kryo serialization when speed is necessary. Can be any subclass of spark.Serializer).spark.kryo.registrator(none)If you use Kryo serialization, set this class to register your custom classes with Kryo. You need to set it to a class that extends spark.KryoRegistrator). See the tuning guide for more details.spark.local.dir/tmpDirectory to use for "scratch" space in Spark, including map output files and RDDs that get stored on disk. This should be on a fast, local disk in your system. It can also be a comma-separated list of multiple directories.spark.cores.max(infinite)When running on a standalone deploy cluster or a Mesos cluster in "coarse-grained" sharing mode, how many CPU cores to request at most. The default will use all available cores.


译者信息

系统属性

为了设置Spark的系统属性,你需要给JVM传递一个-D标志的参数 (比如,java -Dspark.cores.max=5 MyProgram) 或者是在你创建Spark上下文的时候调用System.setProperty()方法,就像下面这样:

System.setProperty("spark.cores.max", "5")val sc = new SparkContext(...)

大多数的可配置系统会在内部设置一个比较合理的默认值。但是,至少下面的5个属性你应该自己去设置的:

属性名称默认值含义spark.executor.memory512m每个处理器可以使用的内存大小,跟JVM的内存表示的字符串格式是一样的(比如: '512m','2g')spark.serializerspark.JavaSerializer一个类名,用于序列化网络传输或者以序列化形式缓存起来的各种对象。默认情况下Java的序列化机制可以序列化任何实现了Serializable接口的对象,但是速度是很慢的,因此当你在意运行速度的时候我们建议你使用spark.KryoSerializer 并且配置 Kryo serialization。可以是任何 spark.Serializer的子类。spark.kryo.registrator(none)如果你使用的是Kryo序列化,就要为Kryo设置这个类去注册你自定义的类。这个类需要继承spark.KryoRegistrator。 可以参考 调优指南 获取更多的信息。spark.local.dir/tmp设置Spark的暂存目录,包括映射输出文件盒需要存储在磁盘上的RDDs。这个磁盘目录在你的系统上面访问速度越快越好。可以用逗号隔开来设置多个目录。spark.cores.max(infinite)当运行在一个独立部署集群上或者是一个粗粒度共享模式的Mesos集群上的时候,最多可以请求多少个CPU核心。默认是所有的都能用。Apart from these, the following properties are also available, and may be useful in some situations:


Property NameDefaultMeaningspark.mesos.coarsefalseIf set to "true", runs over Mesos clusters in "coarse-grained" sharing mode, where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. This gives lower-latency scheduling for short queries, but leaves resources in use for the whole duration of the Spark job.spark.default.parallelism8Default number of tasks to use for distributed shuffle operations (groupByKey,reduceByKey, etc) when not set by user.spark.storage.memoryFraction0.66Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase it if you configure your own old generation size.spark.ui.port(random)Port for your application's dashboard, which shows memory usage of each RDD.spark.shuffle.compresstrueWhether to compress map output files. Generally a good idea.spark.broadcast.compresstrueWhether to compress broadcast variables before sending them. Generally a good idea.spark.rdd.compressfalseWhether to compress serialized RDD partitions (e.g. forStorageLevel.MEMORY_ONLY_SER). Can save substantial space at the cost of some extra CPU time.spark.reducer.maxMbInFlight48Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since each output requires us to create a buffer to receive it, this represents a fixed memory overhead per reduce task, so keep it small unless you have a large amount of memory.spark.closure.serializerspark.JavaSerializerSerializer class to use for closures. Generally Java is fine unless your distributed functions (e.g. map functions) reference large objects in the driver program.spark.kryoserializer.buffer.mb32Maximum object size to allow within Kryo (the library needs to create a buffer at least as large as the largest single object you'll serialize). Increase this if you get a "buffer limit exceeded" exception inside Kryo. Note that there will be one buffer per core on each worker.spark.broadcast.factoryspark.broadcast.HttpBroadcastFactoryWhich broadcast implementation to use.spark.locality.wait3000Number of milliseconds to wait to launch a data-local task before giving up and launching it in a non-data-local location. You should increase this if your tasks are long and you are seeing poor data locality, but the default generally works well.spark.worker.timeout60Number of seconds after which the standalone deploy master considers a worker lost if it receives no heartbeats.spark.akka.frameSize10Maximum message size to allow in "control plane" communication (for serialized tasks and task results), in MB. Increase this if your tasks need to send back large results to the driver (e.g. usingcollect()on a large dataset).spark.akka.threads4Number of actor threads to use for communication. Can be useful to increase on large clusters when the driver has a lot of CPU cores.spark.akka.timeout20Communication timeout between Spark nodes, in seconds.spark.driver.host(local hostname)Hostname or IP address for the driver to listen on.spark.driver.port(random)Port for the driver to listen on.spark.cleaner.ttl(disable)Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be forgetten. This is useful for running Spark for many hours / days (for example, running 24/7 in case of Spark Streaming applications). Note that any RDD that persists in memory for more than this duration will be cleared as well.spark.streaming.blockInterval200Duration (milliseconds) of how long to batch new objects coming from network receivers.

Configuring Logging

Spark uses log4j for logging. You can configure it by adding alog4j.propertiesfile in theconfdirectory. One way to start is to copy the existinglog4j.properties.templatelocated there.

译者信息除了上体的5个外,下面还列举了一些属性,在某些情况下你可能需要自己去配置下。属性名默认值含义spark.mesos.coarsefalse如果设置为了"true",将以粗粒度共享模式运行在Mesos集群上, 这时候Spark会在每台机器上面获得一个长期运行的Mesos任务,而不是对每个Spark任务都要产生一个Mesos任务。对于很多短查询,这个可能会有些许的延迟,但是会大大提高Spark工作时的资源利用率。spark.default.parallelism8在用户没有指定时,用于分布式随机操作(groupByKey,reduceByKey等等)的默认的任务数。spark.storage.memoryFraction0.66Spark用于缓存的内存大小所占用的Java堆的比率。这个不应该大于JVM中老年代所分配的内存大小,默认情况下老年代大小是堆大小的2/3,但是你可以通过配置你的老年代的大小,然后再去增加这个比率。spark.ui.port(random)你的应用程序控制面板端口号,控制面板中可以显示每个RDD的内存使用情况。spark.shuffle.compresstrue是否压缩映射输出文件,通常设置为true是个不错的选择。spark.broadcast.compresstrue广播变量在发送之前是否先要被压缩,通常设置为true是个不错的选择。spark.rdd.compressfalse是否要压缩序列化的RDD分区(比如,StorageLevel.MEMORY_ONLY_SER)。在消耗一点额外的CPU时间的代价下,可以极大的提高减少空间的使用。spark.reducer.maxMbInFlight48同时获取每一个分解任务的时候,映射输出文件的最大的尺寸(以兆为单位)。由于对每个输出都需要我们去创建一个缓冲区去接受它,这个属性值代表了对每个分解任务所使用的内存的一个上限值,因此除非你机器内存很大,最好还是配置一下这个值。spark.closure.serializerspark.JavaSerializer用于闭包的序列化类。通常Java是可以胜任的,除非在你的驱动程序中分布式函数(比如map函数)引用了大量的对象。spark.kryoserializer.buffer.mb32Kryo中运行的对象的最大尺寸(Kryo库需要创建一个不小于最大的单个序列化对象的缓存区)。如果在Kryo中出现"buffer limit exceeded"异常,你就需要去增加这个值了。注意,对每个worker而言,一个核心就会有一个缓冲。spark.broadcast.factoryspark.broadcast.HttpBroadcastFactory使用哪一个广播实现spark.locality.wait3000在发布一个本地数据任务时候,放弃并发布到一个非本地数据的地方前,需要等待的时间。如果你的很多任务都是长时间运行的任务,并且看到了很多的脏数据的话,你就该增加这个值了。但是一般情况下缺省值就可以很好的工作了。spark.worker.timeout60如果超过这个时间,独立部署master还没有收到worker的心跳回复,那么就认为这个worker已经丢失了。spark.akka.frameSize10在控制面板通信(序列化任务和任务结果)的时候消息尺寸的最大值,单位是MB。如果你需要给驱动器发回大尺寸的结果(比如使用在一个大的数据集上面使用collect()方法),那么你就该增加这个值了。spark.akka.threads4用于通信的actor线程数量。如果驱动器有很多CPU核心,那么在大集群上可以增大这个值。spark.akka.timeout20Spark节点之间通信的超时时间,以秒为单位spark.driver.host(local hostname)驱动器监听主机名或者IP地址.spark.driver.port(random)驱动器监听端口号spark.cleaner.ttl(disable)Spark记忆任何元数据(stages生成,任务生成等等)的时间(秒)。周期性清除保证在这个时间之前的元数据会被遗忘。当长时间几小时,几天的运行Spark的时候设置这个是很有用的。注意:任何内存中的RDD只要过了这个时间就会被清除掉。spark.streaming.blockInterval200从网络中批量接受对象时的持续时间。

日志配置

Spark使用 log4j 作为它的日志实现。 你可以在conf文件夹中增加一个log4j.properties配置文件去配置日志。开始的时候,你可以复制conf文件夹中已经存在一个log4j.properties.template模板,重命名为log4j.properties。

0 0
原创粉丝点击