Understadning treeReduce() in Spark
来源:互联网 发布:手机淘宝无法识别图片 编辑:程序博客网 时间:2024/06/02 07:28
There is a fundamental difference between the two-reduceByKey is only available on key-value pair RDDs, while treeReduce is a generalization of reduce operation on any RDD. reduceByKey is used for implementing treeReduce but they are not related in any other sense.reduceByKey performs reduction per each key, resulting in an RDD; it is not an "action" in RDD sense but a transformation that returns a ShuffleRDD. This is equivalent to groupByKey followed by a map that does key-wise reduction (check this why using groupByKey is inefficient).On the other hand, treeAggregate is a generalization of reduce function, inspired from AllReduce. This is an "action" in spark sense, returning the result on the master node. As explained the link posted in your question, after performing local reduce operation, reduce performs rest of the computation on the master, which can be very burdensome (especially in machine learning when the reduce function results in a large vectors or a matrices). Instead, treeReduce perform the reduction in parallel using reduceByKey (this is done by creating a key-value pair RDD on the fly, with the keys determined by the depth of the tree; check implementation here).So, to answer your first two questions, you have to use reduceByKey for word count since you are interested in getting per word-count and treeReduce is not appropriate here. The other two questions are not related to this topic.
Standard reduce is taking a wrapped version of the function and using it to mapPartitions. After that results are collected and reduced locally on a driver. If number of the partitions is large and/or function you use is expensive it places a significant load on a single machine.The first phase of the treeReduce is pretty much the same as above but after that partial results are merged in parallel and only the final aggregation is performed on the driver.depth is suggested depth of the tree and since depth of the node in tree is defined as number of edges between the root and the node it should you give you more or less an expected pattern although it looks like a distributed aggregation can be stopped early in some cases.It is worth to note that what you get with treeReduce is not a binary tree. Number of the partitions is adjusted on each level and most likely more than a two partitions will be merged at once.Compared to the standard reduce, tree based version performs reduceByKey with each iteration and it means a lot of data shuffling. If number of the partitions is relatively small it will be much cheaper to use plain reduce. If you suspect that the final phase of the reduce is a bottleneck tree* version could be worth trying.
0 0
- Understadning treeReduce() in Spark
- 【Spark Java API】Action(5)—treeAggregate、treeReduce
- treeAggregate、treeReduce
- Spark编程之基本的RDD算子之fold,foldByKey,treeAggregate, treeReduce
- Spark编程之基本的RDD算子之fold,foldByKey,treeAggregate, treeReduce
- Histogram in Spark (1)
- UDF overloading in spark
- Fold in spark
- Experiment in Spark
- spark in eclipse---Spark学习笔记3
- Getting Spark Setup in Eclipse
- Spark 协同过滤 in scala
- Frequently Asked Questions in Spark
- LDA in spark测试备忘
- Spark GraphX in Action 1.1
- Spark GraphX in Action 1.2
- spark-sql not in 优化
- How-to: enable spark sql in cdh version spark
- Note7燃损真相大白 电池是祸根
- redis持久化原理详解
- pandas画图
- java多线程CountDownLatch及线程池ThreadPoolExecutor/ExecutorService使用示例
- box-sizing:border-box;的一些坑
- Understadning treeReduce() in Spark
- 【BZOJ1503】Splay 区间删除 (1)
- spring 使用get/set注入对象的属性值
- 【BZOJ 2561】最小生成树 最小割
- 数据源
- 如何动态修改log4j2的配置文件路径,并兼容commong logging门面框架
- 5.4.2.1、SSAS-创建计算-同环比
- Kaldi lattices format
- sublime text 3配置c/c++编译环境