spark源码解读4之SortByKey
来源:互联网 发布:全国网络平台监管系统 编辑:程序博客网 时间:2024/06/04 21:52
更多代码请见:https://github.com/xubo245/SparkLearning
spark源码解读系列环境:spark-2.0.1 (20161103github下载版)
1.理解
1.1 需求
使用spark的时候会经常使用sortBykey,比如wordCount后需要排序,可以使用sortBy,也可以先map然后再sortByKey,soerBy也是调用SortByKey
1.2 源码
SortByKey:org.apache.spark.rdd.OrderedRDDFunctions#sortByKey
/** * Sort the RDD by key, so that each partition contains a sorted range of the elements. Calling * `collect` or `save` on the resulting RDD will return or output an ordered list of records * (in the `save` case, they will be written to multiple `part-X` files in the filesystem, in * order of the keys). */ // TODO: this currently doesn't work on P other than Tuple2! def sortByKey(ascending: Boolean = true, numPartitions: Int = self.partitions.length) : RDD[(K, V)] = self.withScope { val part = new RangePartitioner(numPartitions, self, ascending) new ShuffledRDD[K, V, V](self, part) .setKeyOrdering(if (ascending) ordering else ordering.reverse) }
sortBy:org.apache.spark.rdd.RDD#sortBy
/** * Return this RDD sorted by the given key function. */ def sortBy[K]( f: (T) => K, ascending: Boolean = true, numPartitions: Int = this.partitions.length) (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T] = withScope { this.keyBy[K](f) .sortByKey(ascending, numPartitions) .values }
1.3 分析
1.3.1 sortByKey之数据partitioner类RangePartitioner
sortByKey使用了RangePartitioner,这个在前面的博文“spark源码解读1之Partitioner”中已经有初步分析。RangePartitioner能很大程度上避免hash出现数据的数据分布不均匀的情况
RangePartitioner会在determineBounds对边界进行排序,用的是scala.collection.SeqLike#sorted ,调用的是java.util.Arrays#sort(T[], java.util.Comparator
1.3.2 ShuffleRDD
new ShuffledRDD并且返回,即为排序好的soetByKey的结果
1.3.2.1 partitions_属性
new ShuffledRDD的数据存储在partitions_属性中,这个继承自父类RDD,final方法partitions会给partitions_赋值,调用的是getPartitions方法,然后zipWithIndex
partitions源码:
final def partitions: Array[Partition] = { checkpointRDD.map(_.partitions).getOrElse { if (partitions_ == null) { partitions_ = getPartitions partitions_.zipWithIndex.foreach { case (partition, index) => require(partition.index == index, s"partitions($index).partition == ${partition.index}, but it should equal $index") } } partitions_ } }
1.3.2.2 getPartitions方法
getPartitions在ShuffledRDD重写了:
override def getPartitions: Array[Partition] = { Array.tabulate[Partition](part.numPartitions)(i => new ShuffledRDDPartition(i)) }
part.numPartitions实际为1.3.1中传入的RangePartitioner的属性:
def numPartitions: Int = rangeBounds.length + 1
而rangeBounds则是用水塘抽样算法(Reservoir Sampling)建立的边界范围,
1 = 632826677 0 = -841013005rangeBounds = {int[2]@5390}
getPartitions后是确定RDD的partition数量和index
只有当进行取数操作时,比如top(k)然后显示,数据才会划分到partitions_的每个values下
debug数据copy:
"WrappedArray$ofRef" size = 333values = {WrappedArray$ofRef@5956} "WrappedArray$ofRef" size = 333 0 = {Tuple2$mcII$sp@7560} "(-1813557161,-1212512531)" 1 = {Tuple2$mcII$sp@7561} "(-1144323740,933490971)" 2 = {Tuple2$mcII$sp@7562} "(-12508600,-329995331)" 3 = {Tuple2$mcII$sp@7563} "(-1570574142,-743284380)" 5 = {Tuple2$mcII$sp@7565} "(-532362478,1106605038)" 4 = {Tuple2$mcII$sp@7564} "(249668146,-1487774671)" 6 = {Tuple2$mcII$sp@7566} "(-146176592,666226908)"
本地debug的代码是:
test("large array") { val rand = new scala.util.Random() val pairArr = Array.fill(1000) { (rand.nextInt(), rand.nextInt()) } val pairs = sc.parallelize(pairArr, 3) val sorted = pairs.sortByKey() sorted.count() sorted.top(3).foreach(println) assert(sorted.partitions.size === 3) assert(sorted.collect() === pairArr.sortBy(_._1)) }
ShuffledRDD的partitions_对应的是三个ParallelCollectionPartition,这个是RDD的依赖关系得到的,ParallelCollectionPartition类重写了getPartitions方法,所以
override def getPartitions: Array[Partition] = { val slices = ParallelCollectionRDD.slice(data, numSlices).toArray slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray }
里面partition 的排序方法没找到,不知道逻辑,需要后续去学习RDD和DAG、Stage等代码。
2.代码:
sortByKey使用:org.apache.spark.rdd.SortingSuite
test("sortByKey") { val pairs = sc.parallelize(Array((1, 0), (2, 0), (0, 0), (3, 0)), 2) assert(pairs.sortByKey().collect() === Array((0, 0), (1, 0), (2, 0), (3, 0))) }
sortBy:org.apache.spark.rdd.RDDSuite
test("sortBy") { val data = sc.parallelize(Seq("5|50|A", "4|60|C", "6|40|B")) val col1 = Array("4|60|C", "5|50|A", "6|40|B") val col2 = Array("6|40|B", "5|50|A", "4|60|C") val col3 = Array("5|50|A", "6|40|B", "4|60|C") assert(data.sortBy(_.split("\|")(0)).collect() === col1) assert(data.sortBy(_.split("\|")(1)).collect() === col2) assert(data.sortBy(_.split("\|")(2)).collect() === col3) }
3.结果:
3.1 TimSort有待学习
3.2 RangePartitioner只是确定numPartitions和getPartition(key: Any),partition内部如何排序没有看到
参考
【1】http://spark.apache.org/【2】http://spark.apache.org/docs/1.5.2/programming-guide.html【3】https://github.com/xubo245/SparkLearning【4】book:《深入理解spark核心思想与源码分析》【5】book:《spark核心源码分析和开发实战》【6】http://blog.csdn.net/u014393917/article/details/50602047
- spark源码解读4之SortByKey
- spark sortByKey
- Spark之 源码解读系列
- spark源码解读3之RDD中top源码解读
- Spark算子[13]:sortByKey、sortBy、二次排序 源码实例详解
- Spark Streaming源码解读之Job详解
- Spark Streaming源码解读之JobScheduler详解
- spark源码解读1之Partitioner
- Spark源码解读之SparkContext初始化
- Spark源码解读之RDD依赖Dependency
- Spark源码解读之Job提交
- spark--transform算子--sortByKey
- Spark源码解读(4)——RDD
- spark源码解读
- 【Spark】SparkContext源码解读
- Spark Streaming源码解读
- spark streaming源码解读
- 【spark】sortByKey实现二次排序
- Codeforces 710E Generate a String【dp】好题!
- 编译连接的整个过程和详细的每个过程
- 用view画圈带字
- leetCode练习(128)
- Retrofit 原理 运行机制详解
- spark源码解读4之SortByKey
- 学习收获
- javascript 学习笔记-数组
- android onWindowFocusChanged(boolean hasWindowFocus)
- 文章标题
- win7怎么隐藏c盘内容
- gradle入门
- 2017年校园招聘中国银行、中国邮政储蓄银行、中国移动笔试内容
- 利用ViewPager搭建程序主界面