Spark Scala 实现二次排序和相加
来源:互联网 发布:初中微机考试模拟软件 编辑:程序博客网 时间:2024/05/16 03:14
使用自定义MR实现如下逻辑
product_nolac_idmomentstart_timeuser_idcounty_idstaytimecity_id134291000312255482013-03-11 08:55:19.151754088571571282571134291000822254082013-03-11 08:58:20.152622488571571270571134291000822269182013-03-11 08:56:37.149593624571571103571134291000872270582013-03-11 08:56:51.139539816571571220571134291000872254082013-03-11 08:55:45.15027680057157166571134291000822254082013-03-11 08:55:38.140225200571571133571134291001402664292013-03-11 09:02:19.15175408857157118571134291000822269182013-03-11 08:57:32.151754088571571287571134291001892255882013-03-11 08:56:24.13953981657157148571134291003492250382013-03-11 08:54:30.152622440571571211571
product_no:用户手机号;
lac_id:用户所在基站;
start_time:用户在此基站的开始时间;
staytime:用户在此基站的逗留时间。
需求描述:
根据lac_id和start_time知道用户当时的位置,根据staytime知道用户各个基站的逗留时长。根据轨迹合并连续基站的staytime。
最终得到每一个用户按时间排序在每一个基站驻留时长
期望输出举例:
134291000312255482013-03-11 08:55:19.151754088571571282571134291000822254082013-03-11 08:58:20.152622488571571270571134291000822269182013-03-11 08:56:37.149593624571571370571134291000822254082013-03-11 08:55:38.140225200571571133571134291000872270582013-03-11 08:56:51.139539816571571220571134291000872254082013-03-11 08:55:45.15027680057157166571134291001402664292013-03-11 09:02:19.15175408857157118571134291001892255882013-03-11 08:56:24.13953981657157148571134291003492250382013-03-11 08:54:30.152622440571571211571
分析上面的结果:
第一列升序,第四列时间降序。因此,首先需要将这两列抽取出来,然后自定义排序。
import com.datascience.test.{Track, SecondarySort}import org.apache.spark.{SparkConf, SparkContext}/** * Created by on 2017/11/13. */object FindTrack { def parse(line: String) = { val pieces = line.split("\t") val product_no = pieces(0).toString val lac_id = pieces(1).toString val moment = pieces(2).toString val start_time = pieces(3).toString val user_id = pieces(4).toString val county_id = pieces(5).toString val staytime = pieces(6).toInt val city_id = pieces(7).toString val se = new SecondarySort(product_no, start_time) val track = new Track(product_no, lac_id,moment,start_time,user_id,county_id,staytime, city_id) (se,track) } def isHeader(line: String): Boolean = { line.contains("product_no") } def compTo(one:String,another:String):Int = { val len = one.length -1 val v1 = one.toCharArray val v2 = another.toCharArray for(i <- 0 to len){ val c1 = v1(i) val c2 = v2(i) if(c1 != c2) return c1 -c2 } return 0 } def add(x:Track, y:Track): Track = { if (compTo(x.startTime, y.startTime) < 0) { new Track(x.productNo,x.lacId,x.moment,x.startTime,x.userId,x.countyId,x.staytime + y.staytime,x.cityId) } else { new Track(y.productNo,y.lacId,y.moment,y.startTime,y.userId,y.countyId,x.staytime + y.staytime,y.cityId) } } def get(x:(SecondarySort,Iterable[Track])) :Track = { val xIter = x._2.head xIter } def main(args: Array[String]) { val sc = new SparkContext(new SparkConf().setAppName("FindTrack")) val base = "/user/ds/" val rawData = sc.textFile(base + "track.txt") val mds = rawData.filter(x => !isHeader(x)).map{x => parse(x)}.groupByKey().sortByKey(true).collect().map{x => get(x)}.reduceLeft{ (x, y) => if((x.productNo == y.productNo && x.lacId == y.lacId)) add(x, y) else { println(x) y } } }}
二次识别比较类
/** * Created by on 2017/11/13. */class SecondarySort(val first:String, val second:String) extends Ordered[SecondarySort] with Serializable{ def compTo(one:String,another:String):Int = { val len = one.length -1 val v1 = one.toCharArray val v2 = another.toCharArray for(i <- 0 to len){ val c1 = v1(i) val c2 = v2(i) if(c1 != c2) return c1 -c2 } return 0 } override def compare(that: SecondarySort): Int = { val minus = compTo(this.first,that.first) if(minus !=0) return minus return -compTo(this.second,that.second) } override def equals(obj:Any) :Boolean = { if(!obj.isInstanceOf[SecondarySort]) return false val obj2 = obj.asInstanceOf[SecondarySort] return (this.first==obj2.first) && (this.second==obj2.second) } override def toString :String = { first +" "+ second } override def hashCode :Int = { return this.first.hashCode()+this.second.hashCode(); }}
/** * Created by on 2017/11/13. */class Track extends java.io.Serializable { var productNo : String = "" var lacId : String = "" var moment : String = "" var startTime : String = "" var userId : String = "" var countyId : String = "" var staytime : Int = 0 var cityId : String = "" def this(_productNo: String, _lacId: String,_moment: String, _startTime: String,_userId: String,_countyId: String,_staytime: Int, _cityId: String) { this() this.productNo = _productNo this.lacId = _lacId this.moment = _moment this.startTime = _startTime this.userId = _userId this.countyId = _countyId this.staytime = _staytime this.cityId = _cityId } override def toString :String = { productNo +" "+ lacId + " "+ moment + " " + startTime + " " + userId + " "+ countyId + " "+ staytime + " " + cityId }}
结果图:二次排序实现了
最后的加法操作还是有点问题,如果大家有好大建议,请给我留言,不胜感激
阅读全文
0 0
- Spark Scala 实现二次排序和相加
- Spark:Scala实现二次排序
- Spark Scala 二次排序
- Spark Scala 二次排序
- Hadoop和Spark分别实现二次排序
- Hadoop和Spark分别实现二次排序
- 使用Hadoop和Spark实现二次排序
- spark 二次排序实现
- Spark——二次排序(scala)
- Spark二次排序(Java+Scala)
- scala语言二次排序实现
- 【spark】sortByKey实现二次排序
- Spark:Java实现 二次排序
- Spark基础排序+二次排序(java+scala)
- Spark 中的二次排序Java实现
- Scala之二次排序
- Spark中的二次排序
- spark二次排序
- 结构体、位段、枚举、联合
- 在2017年将会更加流行的6个Web开发趋势
- [深度学习]深度学习中卷积操作和数学中卷积操作的异同
- 认识CSS
- STL标准库String类型
- Spark Scala 实现二次排序和相加
- 由JVM引发的思考_GC算法与种类
- 编程之美_子数组的最大乘积
- 常规Oracle语句与存储过程语句
- thinkphp框架内实现无限级分类的方法
- 单片机原理(1):基本结构
- CentOS 7下MySQL服务启动失败的解决思路
- ApacheBench网站压力测试步骤
- C#版浅谈三层