spark自定义分区

来源：互联网发布：恒压供水plc编程编辑：程序博客网时间：2024/05/22 13:25

——————————————————————————————–

一、需求

防止大量数据倾斜，自定义Partition的函数，map阶段使用元祖（int ， String）int 去模做Hash，均匀分配到不同的Partion中。后续演化：自定义map的key值，key值为一个随机的范围数。

二、代码展示

两个类：defineSparkPartition.scala UsedefineSparkPartition.scala
注意事项：
(1)不要使用flatMap()方法
(2)只有Key-Value类型的RDD才有分区的，非Key-Value类型的RDD分区的值是None
(3)每个RDD的分区ID范围：0~numPartitions-1，决定这个值是属于那个分区的。

import org.apache.spark.Partitioner/**  * Created by yuhui  */class  defineSparkPartition(numParts: Int) extends Partitioner {  /**    * 这个方法需要返回你想要创建分区的个数    */  override def numPartitions: Int = numParts  /**    *    * 这个函数需要对输入的key做计算，然后返回该key的分区ID，范围一定是0到numPartitions-1；    *    * @param key    * @return    */  override def getPartition(key: Any): Int = {    val domain = new java.net.URL(key.toString).getHost()    domain match {      case "blog.csdn.net" => 1 % numPartitions      case "news.cctv.com" => 2 % numPartitions      case "news.china.com" => 3 % numPartitions      case _ =>4 % numPartitions    }  }  /**    * 这个是Java标准的判断相等的函数，之所以要求用户实现这个函数是因为Spark内部会比较两个RDD的分区是否一样。    * @param other    * @return    */  override def equals(other: Any): Boolean = other match {    case mypartition: defineSparkPartition =>      mypartition.numPartitions == numPartitions    case _ =>      false  }  override def hashCode: Int = numPartitions}

import org.apache.spark.{SparkConf, SparkContext}/**  * Created by yuhui  */object UsedefineSparkPartition {  def main(args: Array[String]) {    val conf=new SparkConf()      .setMaster("local[2]")      .setAppName("UsedefineSparkPartition")    val sc=new SparkContext(conf)    //读取本地文件    val lines=sc.textFile("D:/word.txt")    val splitMap=lines.map(line=>(line.split(",")(0),line.split(",")(1))).map(word=>(word._1,word._2))//注意：RDD一定要是key-value    //保存到本地文件    splitMap.partitionBy(new defineSparkPartition(4)).saveAsTextFile("D:/partrion/test")    sc.stop()  }}

三、数据展示

http://blog.csdn.net/silentwolfyh/article/details/76993419,blog.csdn.net
http://blog.csdn.net/silentwolfyh/article/details/76860369,blog.csdn.net
http://blog.csdn.net/silentwolfyh/article/details/77571596,blog.csdn.net
http://blog.csdn.net/silentwolfyh/article/details/77188905,blog.csdn.net
http://news.cctv.com/2017/09/18/ARTIEX7bcZI2cYUqrsEC2DLf170918.shtml,news.cctv.com
http://news.cctv.com/2017/09/18/ARTI4McIqsaFV6115br9eiRJ170918.shtml,news.cctv.com
http://news.cctv.com/2017/09/18/ARTfdabrnntvV6115br9eiRJ170918.shtml,news.cctv.com
http://news.china.com/domestic/945/20170919/31463894.html,news.china.com
http://news.china.com/domestic/945/20170919/31464711.html,news.china.com
http://news.china.com/domestic/945/20170919/31464711.html,news.china.com
https://www.baidu.com/,www.baidu.com
http://news.163.com/17/0918/22/CULBLQUT0001899N.html,news.163.com
http://news.163.com/17/0919/06/CUM7EVQI0001899N.html,news.163.com
http://news.163.com/17/0919/03/CULRN5180001875P.html,news.163.com

四、结果展示

part-00000

(https://www.baidu.com/,www.baidu.com)
(http://news.163.com/17/0918/22/CULBLQUT0001899N.html,news.163.com)
(http://news.163.com/17/0919/06/CUM7EVQI0001899N.html,news.163.com)
(http://news.163.com/17/0919/03/CULRN5180001875P.html,news.163.com)

part-00001

(http://blog.csdn.net/silentwolfyh/article/details/76993419,blog.csdn.net)
(http://blog.csdn.net/silentwolfyh/article/details/76860369,blog.csdn.net)
(http://blog.csdn.net/silentwolfyh/article/details/77571596,blog.csdn.net)
(http://blog.csdn.net/silentwolfyh/article/details/77188905,blog.csdn.net)

part-00002

(http://news.cctv.com/2017/09/18/ARTIEX7bcZI2cYUqrsEC2DLf170918.shtml,news.cctv.com)
(http://news.cctv.com/2017/09/18/ARTI4McIqsaFV6115br9eiRJ170918.shtml,news.cctv.com)
(http://news.cctv.com/2017/09/18/ARTfdabrnntvV6115br9eiRJ170918.shtml,news.cctv.com)

part-00003

(http://news.china.com/domestic/945/20170919/31463894.html,news.china.com)
(http://news.china.com/domestic/945/20170919/31464711.html,news.china.com)
(http://news.china.com/domestic/945/20170919/31464711.html,news.china.com)

五、三种分区方式介绍

1、默认分区方式(实际上是HashPartitioner)

defaultPartitioner.scala

  /**   * Choose a partitioner to use for a cogroup-like operation between a number of RDDs.   *   * If any of the RDDs already has a partitioner, choose that one.   *   * Otherwise, we use a default HashPartitioner. For the number of partitions, if   * spark.default.parallelism is set, then we'll use the value from SparkContext   * defaultParallelism, otherwise we'll use the max number of upstream partitions.   *   * Unless spark.default.parallelism is set, the number of partitions will be the   * same as the number of partitions in the largest upstream RDD, as this should   * be least likely to cause out-of-memory errors.   *   * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD.   */  def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = {    val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse    for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) {      return r.partitioner.get    }    if (rdd.context.conf.contains("spark.default.parallelism")) {      new HashPartitioner(rdd.context.defaultParallelism)    } else {      new HashPartitioner(bySize.head.partitions.size)    }  }

2、HashPartitioner分区

HashPartitioner分区的原理：对于给定的key，计算其hashCode，并除于分区的个数取余，如果余数小于0，则用余数+分区的个数，最后返回的值就是这个key所属的分区ID。实现如下：

/** * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using * Java's `Object.hashCode`. * * Java arrays have hashCodes that are based on the arrays' identities rather than their contents, * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will * produce an unexpected or incorrect result. */class HashPartitioner(partitions: Int) extends Partitioner {  require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.")  def numPartitions: Int = partitions  def getPartition(key: Any): Int = key match {    case null => 0    case _ => Utils.nonNegativeMod(key.hashCode, numPartitions)  }  override def equals(other: Any): Boolean = other match {    case h: HashPartitioner =>      h.numPartitions == numPartitions    case _ =>      false  }  override def hashCode: Int = numPartitions}

3、RangePartitioner分区

HashPartitioner分区弊端：可能导致每个分区中数据量的不均匀，极端情况下会导致某些分区拥有RDD的全部数据。

RangePartitioner分区优势：尽量保证每个分区中数据量的均匀，而且分区与分区之间是有序的，一个分区中的元素肯定都是比另一个分区内的元素小或者大；

但是分区内的元素是不能保证顺序的。简单的说就是将一定范围内的数映射到某一个分区内。

RangePartitioner作用：将一定范围内的数映射到某一个分区内，在实现中，分界的算法尤为重要。算法对应的函数是rangeBounds。

代码如下：

/** * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly * equal ranges. The ranges are determined by sampling the content of the RDD passed in. * * Note that the actual number of partitions created by the RangePartitioner might not be the same * as the `partitions` parameter, in the case where the number of sampled records is less than * the value of `partitions`. */class RangePartitioner[K : Ordering : ClassTag, V](    partitions: Int,    rdd: RDD[_ <: Product2[K, V]],    private var ascending: Boolean = true)  extends Partitioner {  // We allow partitions = 0, which happens when sorting an empty RDD under the default settings.  require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.")  private var ordering = implicitly[Ordering[K]]  // An array of upper bounds for the first (partitions - 1) partitions  private var rangeBounds: Array[K] = {    if (partitions <= 1) {      Array.empty    } else {      // This is the sample size we need to have roughly balanced output partitions, capped at 1M.      val sampleSize = math.min(20.0 * partitions, 1e6)      // Assume the input partitions are roughly balanced and over-sample a little bit.      val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.size).toInt      val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition)      if (numItems == 0L) {        Array.empty      } else {        // If a partition contains much more than the average number of items, we re-sample from it        // to ensure that enough items are collected from that partition.        val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0)        val candidates = ArrayBuffer.empty[(K, Float)]        val imbalancedPartitions = mutable.Set.empty[Int]        sketched.foreach { case (idx, n, sample) =>          if (fraction * n > sampleSizePerPartition) {            imbalancedPartitions += idx          } else {            // The weight is 1 over the sampling probability.            val weight = (n.toDouble / sample.size).toFloat            for (key <- sample) {              candidates += ((key, weight))            }          }        }        if (imbalancedPartitions.nonEmpty) {          // Re-sample imbalanced partitions with the desired sampling probability.          val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains)          val seed = byteswap32(-rdd.id - 1)          val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect()          val weight = (1.0 / fraction).toFloat          candidates ++= reSampled.map(x => (x, weight))        }        RangePartitioner.determineBounds(candidates, partitions)      }    }  }  def numPartitions: Int = rangeBounds.length + 1  private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K]  def getPartition(key: Any): Int = {    val k = key.asInstanceOf[K]    var partition = 0    if (rangeBounds.length <= 128) {      // If we have less than 128 partitions naive search      while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) {        partition += 1      }    } else {      // Determine which binary search method to use only once.      partition = binarySearch(rangeBounds, k)      // binarySearch either returns the match location or -[insertion point]-1      if (partition < 0) {        partition = -partition-1      }      if (partition > rangeBounds.length) {        partition = rangeBounds.length      }    }    if (ascending) {      partition    } else {      rangeBounds.length - partition    }  }  override def equals(other: Any): Boolean = other match {    case r: RangePartitioner[_, _] =>      r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending    case _ =>      false  }  override def hashCode(): Int = {    val prime = 31    var result = 1    var i = 0    while (i < rangeBounds.length) {      result = prime * result + rangeBounds(i).hashCode      i += 1    }    result = prime * result + ascending.hashCode    result  }  @throws(classOf[IOException])  private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException {    val sfactory = SparkEnv.get.serializer    sfactory match {      case js: JavaSerializer => out.defaultWriteObject()      case _ =>        out.writeBoolean(ascending)        out.writeObject(ordering)        out.writeObject(binarySearch)        val ser = sfactory.newInstance()        Utils.serializeViaNestedStream(out, ser) { stream =>          stream.writeObject(scala.reflect.classTag[Array[K]])          stream.writeObject(rangeBounds)        }    }  }  @throws(classOf[IOException])  private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException {    val sfactory = SparkEnv.get.serializer    sfactory match {      case js: JavaSerializer => in.defaultReadObject()      case _ =>        ascending = in.readBoolean()        ordering = in.readObject().asInstanceOf[Ordering[K]]        binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int]        val ser = sfactory.newInstance()        Utils.deserializeViaNestedStream(in, ser) { ds =>          implicit val classTag = ds.readObject[ClassTag[Array[K]]]()          rangeBounds = ds.readObject[Array[K]]()        }    }  }}

阅读全文

1 0

spark自定义分区

目录

一、需求

二、代码展示

三、数据展示

四、结果展示

五、三种分区方式介绍

1、默认分区方式(实际上是HashPartitioner)

2、HashPartitioner分区

3、RangePartitioner分区

——————————————————————————————–

一、需求

二、代码展示

三、数据展示

四、结果展示

五、三种分区方式介绍

1、默认分区方式(实际上是HashPartitioner)

2、HashPartitioner分区

3、RangePartitioner分区