Spark算子[07]:reduce,reduceByKey,count,countByKey

来源:互联网 发布:php时间戳转换年月日 编辑:程序博客网 时间:2024/06/05 06:32

算子 reduce,reduceByKey,count,countByKey 可分为两类:

action操作:reduce,count,countByKey
transformation操作:reduceByKey


1、reduce

reduce(func) 是对JavaRDD的操作
使用函数func聚合rdd的元素(它需要两个参数并返回一个参数)。这个函数应该是可交换的,并且是相关联的,这样它就可以并行地计算出来。

scala版本

val rdd1 = sc.parallelize(List("a","b","b","c")) scala> val res = rdd1.reduce(_+"-"+_)res: String = b-c-a-b

java版本

JavaRDD<String> rdd1 = sc.parallelize(Arrays.asList("a", "b", "b", "c"));String res = rdd1.reduce(new Function2<String, String, String>() {    @Override    public String call(String v1, String v2) throws Exception {        return v1 +"-"+v2;    }});System.out.println(res);# b-c-a-b

2、reduceByKey

reduceByKey(func, [numTasks]) 是对JavapairRDD的操作;
针对(K, V) 的rdd,使用给定的reduce函数func聚合每个K的值,返回(K, V);可通过第二个参数定义Tasks个数。

scala版本

val scoreList = Array(Tuple2("class1", 90), Tuple2("class1", 60), Tuple2("class2", 60), Tuple2("class2", 50))val scoreRdd = sc.parallelize(scoreList)val resRdd = scoreRdd.reduceByKey(_ + _)resRdd.foreach(res => println(res._1 + ":" + res._2))# -----------------class1:150class2:110

java版本

List<Tuple2<String, Integer>> scoreList = Arrays.asList(        new Tuple2<String, Integer>("class1", 90),        new Tuple2<String, Integer>("class2", 60),        new Tuple2<String, Integer>("class1", 60),        new Tuple2<String, Integer>("class2", 50));//平行化集合 生成JavaPairRDD  此处使用的是parallelizePairsJavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);//JavaPairRDD<String, Integer> resRdd = scoreRdd.reduceByKey(new Function2<Integer, Integer, Integer>() {    public Integer call(Integer v1, Integer v2) throws Exception {        return v1 + v2;    }});//打印输出resRdd.foreach(new VoidFunction<Tuple2<String, Integer>>() {    public void call(Tuple2<String, Integer> tuple2) throws Exception {        System.out.println(tuple2._1() + ":" + tuple2._2());    }});

3、count

count() 返回rdd中的元素个数。

scala版本

val rdd1 = sc.parallelize(List("a","b","b","c")) scala> val res = rdd1.countres: Long = 4

java版本

List<Tuple2<String, Integer>> scoreList = Arrays.asList(        new Tuple2<String, Integer>("class1", 90),        new Tuple2<String, Integer>("class2", 60),        new Tuple2<String, Integer>("class1", 60),        new Tuple2<String, Integer>("class2", 50));//平行化集合 生成JavaPairRDD  此处使用的是parallelizePairsJavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);Long count = scoreRdd.count();#4

4、countByKey

countByKey() 只有对类型(K,V)类型的RDDs上才可用。返回一个Map(K,Long)对每个键的计数。

scala版本

val scoreList = Array(Tuple2("class1", 90), Tuple2("class1", 60), Tuple2("class2", 60), Tuple2("class2", 50))val scoreRdd = sc.parallelize(scoreList)scala> val res = scoreRdd.countByKeyres: scala.collection.Map[String,Long] = Map(class2 -> 2, class1 -> 2)

java版本

List<Tuple2<String, Integer>> scoreList = Arrays.asList(        new Tuple2<String, Integer>("class1", 90),        new Tuple2<String, Integer>("class2", 60),        new Tuple2<String, Integer>("class1", 60),        new Tuple2<String, Integer>("class2", 50));JavaPairRDD<String, Integer> scoreRdd = sc.parallelizePairs(scoreList);Map<String,Long> res = scoreRdd.countByKey();System.out.println(res);# {class1=2, class2=2}
原创粉丝点击