spark学习5-spark基础总结

来源：互联网发布：上传图片的js代码编辑：程序博客网时间：2024/06/06 10:57

继续上一篇学习spark

本次将综合运用spark的基础知识来解决一个实际问题

问题描述

假设有这样的数据（很多），第一个字段表示id，第二个字段表示type（type 只有01和02），第三个字段表示月份（只有7月和8月）

1 012015-07
2 01 2015-07
2 01 2015-07
2 02 2015-08
2 02 2015-08
3 02 2015-08

还有个数据格式如下，表示的是id

需要我们计算的指标为（所有的id都要计算，没有值的为0）：

1. 8月份01和02的个数

2. 8月份相对于7月份的总数的环比

解决方案

拿到这样的数据，第一反应是用hive做，很简单

指标1： select * from id i lefter outer join (select b.type, count(*) as cnt from bow b where b.data='2015-08' group by b.type)

指标2类似，用hive都可以解决，但是我们本文用spark来实现

代码

    val conf = new SparkConf()      .setAppName("hebei")      .set("spark.executor.memory", "4g")    val sc = new SparkContext(conf)    val trans = sc.textFile("/xxx/hebei/trans/hebei_trans")    val pid = sc.textFile("/xxx/hebei/pid/pid").map((_, 0))    val trans8 = trans.map { tran =>      val field = tran.split("\t")      (field(0), field(1), field(2))    }.filter(_._3.equals("2015-08"))    val trans7 = trans.map { tran =>      val field = tran.split("\t")      (field(0), field(1), field(2))    }.filter(_._3.equals("2015-07"))    val rs107 = pid.leftOuterJoin(      trans7.map(_._1)        .map((_, 1))        .reduceByKey(_ + _)    ).map { line =>      if (line._2._2 == None) (line._1, 0)      else (line._1, line._2._1 + line._2._2.get)    }    val rs108 = pid.leftOuterJoin(      trans8.map(_._1)        .map((_, 1))        .reduceByKey(_ + _)    ).map { line =>      if (line._2._2 == None) (line._1, 0)      else (line._1, line._2._1 + line._2._2.get)    }    val rs1= rs107.join(rs108).map { line =>      if (line._2._1 == 0) (line._1, 0.toDouble)      else (line._1, ((line._2._2.toDouble - line._2._1.toDouble) / line._2._1.toDouble))    }    val rs2 = pid.leftOuterJoin(      trans8.filter(_._2.equals("01"))        .map(_._1).map((_, 1))        .reduceByKey(_ + _)    ).map { line =>      if (line._2._2 == None) (line._1, 0)      else (line._1, line._2._1 + line._2._2.get)    }    val rs3 = pid.leftOuterJoin(      trans8.filter(_._2.equals("02"))        .map(_._1).map((_, 1))        .reduceByKey(_ + _)    ).map { line =>      if (line._2._2 == None) (line._1, 0)      else (line._1, line._2._1 + line._2._2.get)    }    val rs = rs1.join(rs2).join(rs3).map { line =>      line._1 + "\t" + line._2._1._1 + "\t" + line._2._1._2 + "\t" + line._2._2    }    rs.saveAsTextFile("/xxx/hebei/rs/rs")

代码解释：

其实就是模仿的hive的一些join操作，代码很简单，自己可以研究下～

0 0