spark-Spark Configuration

来源:互联网 发布:nginx优化10万并发 编辑:程序博客网 时间:2024/06/03 20:46

原文:spark configuration

Spark Properties

设置参数的3中具体方式
  1. sparkconf
  2. bin/spark-submit 
  3. conf/spark-defaults.conf文件
优先级:SparkConf>spark-submit or spark-shell>spark-defaults.conf file.最终的参数为3者的merge


Spark properties 为不同的引用配置不同的参数,例如:本地模式2个线程

val conf = new SparkConf()             .setMaster("local[2]")             .setAppName("CountingSheep")val sc = new SparkContext(conf)

对于时间和字节参数要添加单位,如

25ms (milliseconds)5s (seconds)10m or 10min (minutes)3h (hours)5d (days)1y (years)
1b (bytes)1k or 1kb (kibibytes = 1024 bytes)1m or 1mb (mebibytes = 1024 kibibytes)1g or 1gb (gibibytes = 1024 mebibytes)1t or 1tb (tebibytes = 1024 gibibytes)1p or 1pb (pebibytes = 1024 tebibytes)

Dynamically Loading Spark Properties

可以创建空conf

val sc = new SparkContext(new SparkConf())

在运行是指定参数

./bin/spark-submit --name "My app" --master local[4] --conf spark.eventLog.enabled=false  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps" myApp.jar

bin/spark-submit 可以直接读取conf/spark-defaults.conf文件,每一行为一个key和value

spark.master            spark://5.6.7.8:7077spark.executor.memory   4gspark.eventLog.enabled  truespark.serializer        org.apache.spark.serializer.KryoSerializer


Viewing Spark Properties

web UI at http://<driver>:4040 的“Environment” tab中具体校对提交的参数是否和自己本意一致

Available Properties

Most of the properties that control internal settings have reasonable default values. Some of the most common options to set are:

Application Properties

  • spark.driver.maxResultSize:限定Spark action (e.g. collect)到driver的序列化大小,超过Jobs 将aborted 
  • spark.memory.fraction:0.6  Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended. For more detail, including important information about correctly tuning JVM garbage collection when increasing this value, see this description.

Inheriting Hadoop Cluster Configuration

If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that should be included on Spark’s classpath:

  • hdfs-site.xml, which provides default behaviors for the HDFS client.
  • core-site.xml, which sets the default filesystem name.

The location of these configuration files varies across CDH and HDP versions, but a common location is inside of /etc/hadoop/conf. Some tools, such as Cloudera Manager, create configurations on-the-fly, but offer a mechanisms to download copies of them.

To make these files visible to Spark, set HADOOP_CONF_DIR in $SPARK_HOME/spark-env.sh to a location containing the configuration files.