Spark-ML-01-小试spark分析离线商品信息

来源:互联网 发布:罗马城市 知乎 编辑:程序博客网 时间:2024/05/01 09:31

任务

一个在线商品购买记录数据集,约40M,格式如下:

Jack,iphone cover,9,99Jack,iphone cover,9,99Jack,iphone cover,9,99Jack,iphone cover,9,99
  • 1
  • 2
  • 3
  • 4
  • 1
  • 2
  • 3
  • 4

完成统计: 
1.购买总次数 
2.客户总个数 
3.总收入 
4.最畅销的商品

代码

import java.util.Collections;import java.util.Comparator;import java.util.List;import org.apache.commons.collections.comparators.ComparableComparator;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaPairRDD;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.DoubleFunction;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.Function2;import org.apache.spark.api.java.function.PairFunction;import scala.Tuple2;/** *  * @author jinhang * */public class JavaApp {    public static void main(String[] args) {        SparkConf sparkConf = new SparkConf().setAppName("ShopInfoAnalysis").setMaster("local[*]");        JavaSparkContext sc = new JavaSparkContext(sparkConf);        JavaRDD<String[]> data = sc.textFile("data/UserPurchaseHistory.csv").map(s -> s.split(","));        /**         * 统计         */        long numPurchases = data.count();        long uniqueUsers = data.map(s->s[0]).distinct().count();        double totalRevenue = data.mapToDouble(s -> Double.parseDouble(s[2])).sum();        JavaPairRDD<String, Integer> product = data.mapToPair(s->new Tuple2(s[1],1));        List<Tuple2<String, Integer>> pairs= product.reduceByKey((x,y)->(x+y)).sortByKey().collect();        System.out.println(pairs);        String mostPopular = pairs.get(pairs.size()-1)._1();        int purchases = pairs.get(0)._2();        System.out.println("Total purchases: " + numPurchases);        System.out.println("Unique users: " + uniqueUsers);        System.out.println("Total revenue: " + totalRevenue);        System.out.println(String.format("Most popular product: %s with %d purchases",                mostPopular, purchases));        sc.stop();    }}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49

简单的RDD转换和执行就可以简单解决大数据的问题,Java实现的代码方便和以前的Hadoop代码结合执行。

0 0
原创粉丝点击