Spark-partitioner

来源:互联网 发布:linux分页显示命令 编辑:程序博客网 时间:2024/05/22 13:41

Spark-partitioner

@(spark)[partitioner]

Partitioner

/**                                                                                                                                                                      * An object that defines how the elements in a key-value pair RDD are partitioned by key.                                                                               * Maps each key to a partition ID, from 0 to `numPartitions - 1`.                                                                                                       */                                                                                                                                                                     abstract class Partitioner extends Serializable {                                                                                                                         def numPartitions: Int                                                                                                                                                  def getPartition(key: Any): Int                                                                                                                                       }  

HashPartitioner

/**                                                                                                                                                                      * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using                                                                                      * Java's `Object.hashCode`.                                                                                                                                             *                                                                                                                                                                       * Java arrays have hashCodes that are based on the arrays' identities rather than their contents,                                                                       * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will                                                                        * produce an unexpected or incorrect result.                                                                                                                            */                                                                                                                                                                     class HashPartitioner(partitions: Int) extends Partitioner {   

RangePartitioner

实际上这个用于sort base的partition
1. 取个sample,得到大概的数据分布
2. 每个key,根据上面的sample确定partition

/**                                                                                                                                                                      * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly                                                                             * equal ranges. The ranges are determined by sampling the content of the RDD passed in.                                                                                 *                                                                                                                                                                       * Note that the actual number of partitions created by the RangePartitioner might not be the same                                                                       * as the `partitions` parameter, in the case where the number of sampled records is less than                                                                           * the value of `partitions`.                                                                                                                                            */                                                                                                                                                                     class RangePartitioner[K : Ordering : ClassTag, V]( 
0 0
原创粉丝点击