spark自定义分区
来源:互联网 发布:恒压供水plc编程 编辑:程序博客网 时间:2024/05/22 13:25
目录
一、需求
二、代码展示
三、数据展示
四、结果展示
五、三种分区方式介绍
1、默认分区方式(实际上是HashPartitioner)
2、HashPartitioner分区
3、RangePartitioner分区
——————————————————————————————–
一、需求
防止大量数据倾斜,自定义Partition的函数,map阶段使用元祖(int , String)int 去模做Hash,均匀分配到不同的Partion中。后续演化:自定义map的key值,key值为一个随机的范围数。
二、代码展示
两个类:defineSparkPartition.scala UsedefineSparkPartition.scala
注意事项:
(1)不要使用flatMap()方法
(2)只有Key-Value类型的RDD才有分区的,非Key-Value类型的RDD分区的值是None
(3)每个RDD的分区ID范围:0~numPartitions-1,决定这个值是属于那个分区的。
import org.apache.spark.Partitioner/** * Created by yuhui */class defineSparkPartition(numParts: Int) extends Partitioner { /** * 这个方法需要返回你想要创建分区的个数 */ override def numPartitions: Int = numParts /** * * 这个函数需要对输入的key做计算,然后返回该key的分区ID,范围一定是0到numPartitions-1; * * @param key * @return */ override def getPartition(key: Any): Int = { val domain = new java.net.URL(key.toString).getHost() domain match { case "blog.csdn.net" => 1 % numPartitions case "news.cctv.com" => 2 % numPartitions case "news.china.com" => 3 % numPartitions case _ =>4 % numPartitions } } /** * 这个是Java标准的判断相等的函数,之所以要求用户实现这个函数是因为Spark内部会比较两个RDD的分区是否一样。 * @param other * @return */ override def equals(other: Any): Boolean = other match { case mypartition: defineSparkPartition => mypartition.numPartitions == numPartitions case _ => false } override def hashCode: Int = numPartitions}
import org.apache.spark.{SparkConf, SparkContext}/** * Created by yuhui */object UsedefineSparkPartition { def main(args: Array[String]) { val conf=new SparkConf() .setMaster("local[2]") .setAppName("UsedefineSparkPartition") val sc=new SparkContext(conf) //读取本地文件 val lines=sc.textFile("D:/word.txt") val splitMap=lines.map(line=>(line.split(",")(0),line.split(",")(1))).map(word=>(word._1,word._2))//注意:RDD一定要是key-value //保存到本地文件 splitMap.partitionBy(new defineSparkPartition(4)).saveAsTextFile("D:/partrion/test") sc.stop() }}
三、数据展示
http://blog.csdn.net/silentwolfyh/article/details/76993419,blog.csdn.net
http://blog.csdn.net/silentwolfyh/article/details/76860369,blog.csdn.net
http://blog.csdn.net/silentwolfyh/article/details/77571596,blog.csdn.net
http://blog.csdn.net/silentwolfyh/article/details/77188905,blog.csdn.net
http://news.cctv.com/2017/09/18/ARTIEX7bcZI2cYUqrsEC2DLf170918.shtml,news.cctv.com
http://news.cctv.com/2017/09/18/ARTI4McIqsaFV6115br9eiRJ170918.shtml,news.cctv.com
http://news.cctv.com/2017/09/18/ARTfdabrnntvV6115br9eiRJ170918.shtml,news.cctv.com
http://news.china.com/domestic/945/20170919/31463894.html,news.china.com
http://news.china.com/domestic/945/20170919/31464711.html,news.china.com
http://news.china.com/domestic/945/20170919/31464711.html,news.china.com
https://www.baidu.com/,www.baidu.com
http://news.163.com/17/0918/22/CULBLQUT0001899N.html,news.163.com
http://news.163.com/17/0919/06/CUM7EVQI0001899N.html,news.163.com
http://news.163.com/17/0919/03/CULRN5180001875P.html,news.163.com
四、结果展示
part-00000
(https://www.baidu.com/,www.baidu.com)
(http://news.163.com/17/0918/22/CULBLQUT0001899N.html,news.163.com)
(http://news.163.com/17/0919/06/CUM7EVQI0001899N.html,news.163.com)
(http://news.163.com/17/0919/03/CULRN5180001875P.html,news.163.com)
part-00001
(http://blog.csdn.net/silentwolfyh/article/details/76993419,blog.csdn.net)
(http://blog.csdn.net/silentwolfyh/article/details/76860369,blog.csdn.net)
(http://blog.csdn.net/silentwolfyh/article/details/77571596,blog.csdn.net)
(http://blog.csdn.net/silentwolfyh/article/details/77188905,blog.csdn.net)
part-00002
(http://news.cctv.com/2017/09/18/ARTIEX7bcZI2cYUqrsEC2DLf170918.shtml,news.cctv.com)
(http://news.cctv.com/2017/09/18/ARTI4McIqsaFV6115br9eiRJ170918.shtml,news.cctv.com)
(http://news.cctv.com/2017/09/18/ARTfdabrnntvV6115br9eiRJ170918.shtml,news.cctv.com)
part-00003
(http://news.china.com/domestic/945/20170919/31463894.html,news.china.com)
(http://news.china.com/domestic/945/20170919/31464711.html,news.china.com)
(http://news.china.com/domestic/945/20170919/31464711.html,news.china.com)
五、三种分区方式介绍
1、默认分区方式(实际上是HashPartitioner)
defaultPartitioner.scala
/** * Choose a partitioner to use for a cogroup-like operation between a number of RDDs. * * If any of the RDDs already has a partitioner, choose that one. * * Otherwise, we use a default HashPartitioner. For the number of partitions, if * spark.default.parallelism is set, then we'll use the value from SparkContext * defaultParallelism, otherwise we'll use the max number of upstream partitions. * * Unless spark.default.parallelism is set, the number of partitions will be the * same as the number of partitions in the largest upstream RDD, as this should * be least likely to cause out-of-memory errors. * * We use two method parameters (rdd, others) to enforce callers passing at least 1 RDD. */ def defaultPartitioner(rdd: RDD[_], others: RDD[_]*): Partitioner = { val bySize = (Seq(rdd) ++ others).sortBy(_.partitions.size).reverse for (r <- bySize if r.partitioner.isDefined && r.partitioner.get.numPartitions > 0) { return r.partitioner.get } if (rdd.context.conf.contains("spark.default.parallelism")) { new HashPartitioner(rdd.context.defaultParallelism) } else { new HashPartitioner(bySize.head.partitions.size) } }
2、HashPartitioner分区
HashPartitioner分区的原理:对于给定的key,计算其hashCode,并除于分区的个数取余,如果余数小于0,则用余数+分区的个数,最后返回的值就是这个key所属的分区ID。实现如下:
/** * A [[org.apache.spark.Partitioner]] that implements hash-based partitioning using * Java's `Object.hashCode`. * * Java arrays have hashCodes that are based on the arrays' identities rather than their contents, * so attempting to partition an RDD[Array[_]] or RDD[(Array[_], _)] using a HashPartitioner will * produce an unexpected or incorrect result. */class HashPartitioner(partitions: Int) extends Partitioner { require(partitions >= 0, s"Number of partitions ($partitions) cannot be negative.") def numPartitions: Int = partitions def getPartition(key: Any): Int = key match { case null => 0 case _ => Utils.nonNegativeMod(key.hashCode, numPartitions) } override def equals(other: Any): Boolean = other match { case h: HashPartitioner => h.numPartitions == numPartitions case _ => false } override def hashCode: Int = numPartitions}
3、RangePartitioner分区
HashPartitioner分区弊端:可能导致每个分区中数据量的不均匀,极端情况下会导致某些分区拥有RDD的全部数据。
RangePartitioner分区优势:尽量保证每个分区中数据量的均匀,而且分区与分区之间是有序的,一个分区中的元素肯定都是比另一个分区内的元素小或者大;
但是分区内的元素是不能保证顺序的。简单的说就是将一定范围内的数映射到某一个分区内。
RangePartitioner作用:将一定范围内的数映射到某一个分区内,在实现中,分界的算法尤为重要。算法对应的函数是rangeBounds。
代码如下:
/** * A [[org.apache.spark.Partitioner]] that partitions sortable records by range into roughly * equal ranges. The ranges are determined by sampling the content of the RDD passed in. * * Note that the actual number of partitions created by the RangePartitioner might not be the same * as the `partitions` parameter, in the case where the number of sampled records is less than * the value of `partitions`. */class RangePartitioner[K : Ordering : ClassTag, V]( partitions: Int, rdd: RDD[_ <: Product2[K, V]], private var ascending: Boolean = true) extends Partitioner { // We allow partitions = 0, which happens when sorting an empty RDD under the default settings. require(partitions >= 0, s"Number of partitions cannot be negative but found $partitions.") private var ordering = implicitly[Ordering[K]] // An array of upper bounds for the first (partitions - 1) partitions private var rangeBounds: Array[K] = { if (partitions <= 1) { Array.empty } else { // This is the sample size we need to have roughly balanced output partitions, capped at 1M. val sampleSize = math.min(20.0 * partitions, 1e6) // Assume the input partitions are roughly balanced and over-sample a little bit. val sampleSizePerPartition = math.ceil(3.0 * sampleSize / rdd.partitions.size).toInt val (numItems, sketched) = RangePartitioner.sketch(rdd.map(_._1), sampleSizePerPartition) if (numItems == 0L) { Array.empty } else { // If a partition contains much more than the average number of items, we re-sample from it // to ensure that enough items are collected from that partition. val fraction = math.min(sampleSize / math.max(numItems, 1L), 1.0) val candidates = ArrayBuffer.empty[(K, Float)] val imbalancedPartitions = mutable.Set.empty[Int] sketched.foreach { case (idx, n, sample) => if (fraction * n > sampleSizePerPartition) { imbalancedPartitions += idx } else { // The weight is 1 over the sampling probability. val weight = (n.toDouble / sample.size).toFloat for (key <- sample) { candidates += ((key, weight)) } } } if (imbalancedPartitions.nonEmpty) { // Re-sample imbalanced partitions with the desired sampling probability. val imbalanced = new PartitionPruningRDD(rdd.map(_._1), imbalancedPartitions.contains) val seed = byteswap32(-rdd.id - 1) val reSampled = imbalanced.sample(withReplacement = false, fraction, seed).collect() val weight = (1.0 / fraction).toFloat candidates ++= reSampled.map(x => (x, weight)) } RangePartitioner.determineBounds(candidates, partitions) } } } def numPartitions: Int = rangeBounds.length + 1 private var binarySearch: ((Array[K], K) => Int) = CollectionsUtils.makeBinarySearch[K] def getPartition(key: Any): Int = { val k = key.asInstanceOf[K] var partition = 0 if (rangeBounds.length <= 128) { // If we have less than 128 partitions naive search while (partition < rangeBounds.length && ordering.gt(k, rangeBounds(partition))) { partition += 1 } } else { // Determine which binary search method to use only once. partition = binarySearch(rangeBounds, k) // binarySearch either returns the match location or -[insertion point]-1 if (partition < 0) { partition = -partition-1 } if (partition > rangeBounds.length) { partition = rangeBounds.length } } if (ascending) { partition } else { rangeBounds.length - partition } } override def equals(other: Any): Boolean = other match { case r: RangePartitioner[_, _] => r.rangeBounds.sameElements(rangeBounds) && r.ascending == ascending case _ => false } override def hashCode(): Int = { val prime = 31 var result = 1 var i = 0 while (i < rangeBounds.length) { result = prime * result + rangeBounds(i).hashCode i += 1 } result = prime * result + ascending.hashCode result } @throws(classOf[IOException]) private def writeObject(out: ObjectOutputStream): Unit = Utils.tryOrIOException { val sfactory = SparkEnv.get.serializer sfactory match { case js: JavaSerializer => out.defaultWriteObject() case _ => out.writeBoolean(ascending) out.writeObject(ordering) out.writeObject(binarySearch) val ser = sfactory.newInstance() Utils.serializeViaNestedStream(out, ser) { stream => stream.writeObject(scala.reflect.classTag[Array[K]]) stream.writeObject(rangeBounds) } } } @throws(classOf[IOException]) private def readObject(in: ObjectInputStream): Unit = Utils.tryOrIOException { val sfactory = SparkEnv.get.serializer sfactory match { case js: JavaSerializer => in.defaultReadObject() case _ => ascending = in.readBoolean() ordering = in.readObject().asInstanceOf[Ordering[K]] binarySearch = in.readObject().asInstanceOf[(Array[K], K) => Int] val ser = sfactory.newInstance() Utils.deserializeViaNestedStream(in, ser) { ds => implicit val classTag = ds.readObject[ClassTag[Array[K]]]() rangeBounds = ds.readObject[Array[K]]() } } }}
- Spark自定义分区(Partitioner)
- Spark自定义分区(Partitioner)
- Spark自定义分区(Partitioner)
- Spark自定义分区(Partitioner)
- Spark自定义分区(Partitioner)
- Spark Partitioner自定义分区
- Spark自定义分区(Partitioner)
- spark自定义分区
- Spark自定义RDD重分区
- SPARK中实现自定义分区
- spark 点滴:多路输出,自定义分区
- Spark 基于自定义分区的方式 分析点击流日志
- spark RDD算子(十三)之RDD 分区 HashPartitioner,RangePartitioner,自定义分区
- spark用程序提交任务到yarn Spark自定义分区(Partitioner) textfile使用小技巧 createDirectStream
- Spark-RDD 分区
- Spark分区器HashPartitioner
- spark RDD 分区
- Spark Partition 分区记录
- 上传图片预览JS脚本 Input file图片预览的实现示例
- div实现绑定按键事件
- iOS视频边下边播–缓存数据流(备用知识点)
- css持续旋转
- 257. Binary Tree Paths
- spark自定义分区
- RxJava学习资料整理
- 各种功能的开源项目
- 关于12/24时间制的一些代码
- Android RxJava2+Retrofit2搭建网络请求框架
- 用jquery有用的功能
- Gradle2.0用户指南翻译——第五章. 疑难解答
- 高速公路ETC卡签之我见2-卡片消费
- 一个三角形,每行的每个数字都是它上面的数字+上面数字前的一个数字,求数字出现的层数,python