spark通过合理设置spark.default.parallelism参数提高执行效率
来源:互联网 发布:淘宝客服什么时候上班 编辑:程序博客网 时间:2024/05/19 20:38
spark中有partition的概念(和slice是同一个概念,在spark1.2中官网已经做出了说明),一般每个partition对应一个task。在我的测试过程中,如果没有设置spark.default.parallelism参数,spark计算出来的partition非常巨大,与我的cores非常不搭。我在两台机器上(8cores *2 +6g * 2)上,spark计算出来的partition达到2.8万个,也就是2.9万个tasks,每个task完成时间都是几毫秒或者零点几毫秒,执行起来非常缓慢。在我尝试设置了 spark.default.parallelism 后,任务数减少到10,执行一次计算过程从minute降到20second。
参数可以通过spark_home/conf/spark-default.conf配置文件设置。
eg.
spark.master spark://master:7077 spark.default.parallelism 10 spark.driver.memory 2g spark.serializer org.apache.spark.serializer.KryoSerializer spark.sql.shuffle.partitions 50
下面是官网的相关描述:
from:http://spark.apache.org/docs/latest/configuration.html
spark.default.parallelism
For distributed shuffle operations like reduceByKey
and join
, the largest number of partitions in a parent RDD. For operations likeparallelize
with no parent RDDs, it depends on the cluster manager:- Local mode: number of cores on the local machine
- Mesos fine grained mode: 8
- Others: total number of cores on all executor nodes or 2, whichever is larger
join
, reduceByKey
, and parallelize
when not set by user.from:http://spark.apache.org/docs/latest/tuning.html
Level of Parallelism
Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile
, etc), and for distributed “reduce” operations, such as groupByKey
and reduceByKey
, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions
documentation), or set the config propertyspark.default.parallelism
to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.
- spark通过合理设置spark.default.parallelism参数提高执行效率
- Spark性能调优之合理设置并行度
- Spark性能调优之合理设置并行度
- Spark性能调优:合理设置并行度
- Spark性能调优之合理设置并行度
- Spark spark-submit 参数
- Spark Default Partitioner
- 如何利用Spark提高批量插入Solr的效率
- Spark参数
- Spark集群设置多Application并行执行
- 【spark】idea 手动添加设置参数
- 动态设置spark.sql.shuffle.partitions参数
- Spark SQL & Spark Hive编程开发, 并和Hive执行效率对比
- 【Spark】Spark应用执行机制
- 【Spark】Spark应用执行机制
- [spark] spark推测式执行
- 软件项目管理:通过合理的团队组合提高团队协作效率
- 如何通过测试替代(Test Doubles)合理隔离单元测试以提高单元测试效率
- myeclipse和eclipse的区别
- 安卓入门之安卓系统架构
- cell的分割线问题
- Error:Execution failed for task ':app:mergeDebugResources'.
- HDD is Outdated Technology(求总间隔)
- spark通过合理设置spark.default.parallelism参数提高执行效率
- cocos2dx CCNotificationCenter的简单实用
- Leetcode: N Queen I & II
- JAVA设计模式--策略模式
- java之集合框架
- 半透明界面
- JAVA实践数组版图存储结构--邻接表
- 千万不要用cell.tag
- nyoj 655 光棍的YY (斐波那契数列)