Spark的join与cogroup简单示例

来源:互联网 发布:多媒体数据挖掘 编辑:程序博客网 时间:2024/06/05 04:29

 1.join

 join就是把两个集合根据key,进行内容聚合;

         元组集合A:(1,"Spark"),(2,"Tachyon"),(3,"Hadoop")
 元组集合B:(1,100),(2,95),(3,65)                               
 A join B的结果:(1,("Spark",100)),(3,("hadoop",65)),(2,("Tachyon",95))

2.cogroup

cogroup就是:
有两个元组Tuple的集合A与B,先对A组集合中key相同的value进行聚合,

                        然后对B组集合中key相同的value进行聚合,之后对A组与B组进行"join"操作;  

示例代码:

public class CoGroup {public static void main(String[] args) {SparkConf conf=new SparkConf().setAppName("spark WordCount!").setMaster("local");JavaSparkContext sContext=new JavaSparkContext(conf);List<Tuple2<Integer,String>> namesList=Arrays.asList(new Tuple2<Integer, String>(1,"Spark"),new Tuple2<Integer, String>(3,"Tachyon"),new Tuple2<Integer, String>(4,"Sqoop"),new Tuple2<Integer, String>(2,"Hadoop"),new Tuple2<Integer, String>(2,"Hadoop2"));List<Tuple2<Integer,Integer>> scoresList=Arrays.asList(new Tuple2<Integer, Integer>(1,100),new Tuple2<Integer, Integer>(3,70),new Tuple2<Integer, Integer>(3,77),new Tuple2<Integer, Integer>(2,90),new Tuple2<Integer, Integer>(2,80));JavaPairRDD<Integer, String> names=sContext.parallelizePairs(namesList);JavaPairRDD<Integer, Integer> scores=sContext.parallelizePairs(scoresList);/** * <Integer> JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> * org.apache.spark.api.java.JavaPairRDD.cogroup(JavaPairRDD<Integer, Integer> other) */JavaPairRDD<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> nameScores=names.cogroup(scores);nameScores.foreach(new VoidFunction<Tuple2<Integer, Tuple2<Iterable<String>, Iterable<Integer>>>>() {private static final long serialVersionUID = 1L;int i=1;@Overridepublic void call(Tuple2<Integer, Tuple2<Iterable<String>, Iterable<Integer>>> t)throws Exception {String string="ID:"+t._1+" , "+"Name:"+t._2._1+" , "+"Score:"+t._2._2;string+="     count:"+i;System.out.println(string);i++;}});sContext.close();}}
示例结果:

ID:4 , Name:[Sqoop] , Score:[]     count:1ID:1 , Name:[Spark] , Score:[100]     count:2ID:3 , Name:[Tachyon] , Score:[70, 77]     count:3ID:2 , Name:[Hadoop, Hadoop2] , Score:[90, 80]     count:4

0 1
原创粉丝点击