Spark算子[08]:combineByKey详解
来源:互联网 发布:淘宝闺蜜投诉入口 编辑:程序博客网 时间:2024/06/04 19:50
combineByKey
聚合数据一般在集中式数据比较方便,如果涉及到分布式的数据集,该如何去实现呢。这里介绍一下combineByKey, 这个是各种聚集操作的鼻祖,应该要好好了解一下,可以参考Spark API。
更好的,可以将spark的源码包加载到Idea工具中,Spark源码包下载。
源码
/** * @see [[combineByKeyWithClassTag]] * * 具体实现在combineByKeyWithClassTag中 */ def combineByKey[C]( createCombiner: V => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)] = self.withScope { combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine, serializer)(null) }
def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C): RDD[(K, C)]def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, numPartitions: Int): RDD[(K, C)]def combineByKey[C](createCombiner: (V) => C, mergeValue: (C, V) => C, mergeCombiners: (C, C) => C, partitioner: Partitioner, mapSideCombine: Boolean = true, serializer: Serializer = null): RDD[(K, C)]
该函数用于将RDD[K,V]转换成RDD[K,C],这里的V类型和C类型可以相同也可以不同。
参数:
- createCombiner:组合器函数,用于将V类型转换成C类型,输入参数为RDD[K,V]中的V,输出为C
- mergeValue:合并值函数,将一个C类型和一个V类型值合并成一个C类型,输入参数为(C,V),输出为C
- mergeCombiners:合并组合器函数,用于将两个C类型值合并成一个C类型,输入参数为(C,C),输出为C
- numPartitions:结果RDD分区数,默认保持原有的分区数
- partitioner:分区函数,默认为HashPartitioner
- mapSideCombine:是否需要在Map端进行combine操作,类似于MapReduce中的combine,默认为true
案例
Scala实战案例
举一个计算学生平均成绩的例子,scala版本实战案例参考链接
类ScoreDetail,存储学生的名字和一个主题的分数。
// 1、类ScoreDetail,存储学生的名字、学科、分数。case class ScoreDetail(studentName: String, subject: String, score: Float)/** * 求学生成绩平均值 */def avgScore(): Unit = { val conf = new SparkConf().setAppName("avgScore").setMaster("local[2]") val sc = new SparkContext(conf) //2.1 构建学生信息list集合 val scoreDetail = List( ScoreDetail("A", "Math", 98), ScoreDetail("A", "English", 88), ScoreDetail("B", "Math", 75), ScoreDetail("B", "English", 78), ScoreDetail("C", "Math", 90), ScoreDetail("C", "English", 80), ScoreDetail("D", "Math", 91), ScoreDetail("D", "English", 80) ) //2.2 创建学生信息Tuple2(学生名称,学生信息) val studentDetail = for {x <- scoreDetail} yield (x.studentName, x) //2.3 平行化学生信息,并创建Hash分区,缓存 val studentDetailRdd = sc.parallelize(studentDetail).partitionBy(new HashPartitioner(3)).cache() val avgscoreRdd = studentDetailRdd.combineByKey( //1、 createCombiner:组合器函数,输入参数为RDD[K,V]中的V(即ScoreDetail对象),输出为tuple2(学生成绩,1) (x: ScoreDetail) => (x.score, 1), //2、 mergeValue:合并值函数,输入参数为(C,V)即((学生成绩,1),ScoreDetail对象),输出为tuple2(学生成绩,2) (acc: (Float, Int), x: ScoreDetail) => (acc._1 + x.score, acc._2 + 1), //3、 mergeCombiners:合并组合器函数,对多个节点上的数据合并,输入参数为(C,C),输出为C (acc1: (Float, Int), acc2: (Float, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2) ).map(x => (x._1, x._2._1 / x._2._2)) //对于输出(学生姓名,(学生成绩和,学生成绩次数)),求学生成绩平均值 avgscoreRdd.foreach(println)}
输出结果:
(C,85.0)
(B,76.5)
(A,93.0)
(D,85.5)
Java实战案例
1、ScoreDetail对象,存储学生的名字、学科、分数。
public class ScoreDetail003 implements Serializable { String name ; String subject ; int score ; public ScoreDetail003(String name, String subject, int score) { this.name = name; this.subject = subject; this.score = score; }}
2、求成绩均值
public static void avgScore() { SparkConf conf = new SparkConf().setAppName("reduceByKey").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); ArrayList<ScoreDetail003> scoreDetail = new ArrayList<ScoreDetail003>(); scoreDetail.add(new ScoreDetail003("A", "Math", 98)); scoreDetail.add(new ScoreDetail003("A", "English", 88)); scoreDetail.add(new ScoreDetail003("B", "Math", 75)); scoreDetail.add(new ScoreDetail003("B", "English", 78)); scoreDetail.add(new ScoreDetail003("C", "Math", 90)); scoreDetail.add(new ScoreDetail003("C", "English", 80)); scoreDetail.add(new ScoreDetail003("D", "Math", 91)); scoreDetail.add(new ScoreDetail003("D", "English", 80)); JavaRDD<ScoreDetail003> scoreDetailRdd = sc.parallelize(scoreDetail); JavaPairRDD<String,ScoreDetail003> pairRDD = scoreDetailRdd.mapToPair(detail -> new Tuple2<String, ScoreDetail003>(detail.name,detail)); //1、创建createCombiner:组合器函数,输入参数为RDD[K,V]中的V(即ScoreDetail对象),输出为tuple2(学生成绩,1) Function<ScoreDetail003, Tuple2<Float,Integer>> createCombiner = new Function<ScoreDetail003, Tuple2<Float, Integer>>() { @Override public Tuple2<Float, Integer> call(ScoreDetail003 v1) throws Exception { return new Tuple2<Float, Integer>((float) v1.score,1); } }; //2、mergeValue:合并值函数,输入参数为(C,V)即(tuple2(学生成绩,1),ScoreDetail对象),输出为tuple2(学生成绩,2) Function2<Tuple2<Float,Integer>,ScoreDetail003,Tuple2<Float,Integer>> mergeValue = new Function2<Tuple2<Float, Integer>, ScoreDetail003, Tuple2<Float, Integer>>() { @Override public Tuple2<Float, Integer> call(Tuple2<Float, Integer> v1, ScoreDetail003 v2) throws Exception { return new Tuple2<Float, Integer>(v1._1()+v2.score,v1._2()+1); } }; //3、mergeCombiners:合并组合器函数,对多个节点上的数据合并,输入参数为(C,C),输出为C Function2<Tuple2<Float,Integer>,Tuple2<Float,Integer>,Tuple2<Float,Integer>> mergeCombiners = new Function2<Tuple2<Float, Integer>, Tuple2<Float, Integer>, Tuple2<Float, Integer>>() { @Override public Tuple2<Float, Integer> call(Tuple2<Float, Integer> v1, Tuple2<Float, Integer> v2) throws Exception { return new Tuple2<Float, Integer>(v1._1()+v2._1(),v1._2()+v2._2()); } }; //4、combineByKey并求均值 JavaPairRDD<String,Float> res = pairRDD.combineByKey(createCombiner, mergeValue, mergeCombiners, 2) .mapToPair(x -> new Tuple2<String, Float>(x._1(),x._2()._1()/x._2()._2())); //5、打印结果 res.foreach(x -> System.out.println(x)); sc.close();}
阅读全文
0 0
- Spark算子[08]:combineByKey详解
- spark 算子combineByKey 详解
- spark中算子详解:combineByKey
- Spark 核心算子:combineByKey()
- Spark算子篇-combineByKey实战
- Spark RDD算子【三】combineByKey
- Spark之combineByKey详解Java
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark算子:RDD键值转换操作(2)–combineByKey、foldByKey
- Spark编程的基本的算子之:combineByKey,reduceByKey,groupByKey
- Spark核心RDD:combineByKey函数详解
- Spark核心RDD:combineByKey函数详解
- Spark RDD操作:combineByKey函数详解
- Spark核心RDD:combineByKey函数详解
- Spark核心RDD:combineByKey函数详解
- Spark核心RDD:combineByKey函数详解
- JAVA中常见的4种线城池
- view的getTop和getBottom
- Windows核心编程之邮槽实现进程间通信
- 论文浅尝 | Reinforcement Learning for Relation Classification
- 【Scikit-Learn 中文文档】随机投影
- Spark算子[08]:combineByKey详解
- 2017-12-11
- JAVA打印回型数
- 【Scikit-Learn 中文文档】内核近似
- Vanilla RNN是什么,它背后又隐藏了什么?
- Mahmoud and a Triangle (CodeForces
- PostgreSQL for Data Architects.pdf 英文原版 免费下载
- scala语言基础学习
- docker教程(3)--volume