Spark Scala 实现二次排序和相加

来源:互联网 发布:初中微机考试模拟软件 编辑:程序博客网 时间:2024/05/16 03:14

使用自定义MR实现如下逻辑

product_nolac_idmomentstart_timeuser_idcounty_idstaytimecity_id134291000312255482013-03-11 08:55:19.151754088571571282571134291000822254082013-03-11 08:58:20.152622488571571270571134291000822269182013-03-11 08:56:37.149593624571571103571134291000872270582013-03-11 08:56:51.139539816571571220571134291000872254082013-03-11 08:55:45.15027680057157166571134291000822254082013-03-11 08:55:38.140225200571571133571134291001402664292013-03-11 09:02:19.15175408857157118571134291000822269182013-03-11 08:57:32.151754088571571287571134291001892255882013-03-11 08:56:24.13953981657157148571134291003492250382013-03-11 08:54:30.152622440571571211571


字段解释:
product_no:用户手机号;
lac_id:用户所在基站;
start_time:用户在此基站的开始时间;
staytime:用户在此基站的逗留时间。


需求描述:
根据lac_id和start_time知道用户当时的位置,根据staytime知道用户各个基站的逗留时长。根据轨迹合并连续基站的staytime。
最终得到每一个用户按时间排序在每一个基站驻留时长
期望输出举例:

134291000312255482013-03-11 08:55:19.151754088571571282571134291000822254082013-03-11 08:58:20.152622488571571270571134291000822269182013-03-11 08:56:37.149593624571571370571134291000822254082013-03-11 08:55:38.140225200571571133571134291000872270582013-03-11 08:56:51.139539816571571220571134291000872254082013-03-11 08:55:45.15027680057157166571134291001402664292013-03-11 09:02:19.15175408857157118571134291001892255882013-03-11 08:56:24.13953981657157148571134291003492250382013-03-11 08:54:30.152622440571571211571


分析上面的结果: 
第一列升序,第四列时间降序。因此,首先需要将这两列抽取出来,然后自定义排序。


import com.datascience.test.{Track, SecondarySort}import org.apache.spark.{SparkConf, SparkContext}/**  * Created by on 2017/11/13.  */object FindTrack {  def parse(line: String) = {    val pieces = line.split("\t")    val product_no = pieces(0).toString    val lac_id = pieces(1).toString    val moment = pieces(2).toString    val start_time =  pieces(3).toString    val user_id = pieces(4).toString    val county_id = pieces(5).toString    val staytime = pieces(6).toInt    val city_id = pieces(7).toString    val se = new SecondarySort(product_no, start_time)    val track = new Track(product_no, lac_id,moment,start_time,user_id,county_id,staytime, city_id)    (se,track)  }  def isHeader(line: String): Boolean = {    line.contains("product_no")  }  def compTo(one:String,another:String):Int = {    val len = one.length -1    val v1 = one.toCharArray    val v2 = another.toCharArray    for(i <- 0 to len){      val c1 = v1(i)      val c2 = v2(i)      if(c1 != c2) return c1 -c2    }    return 0  }  def add(x:Track, y:Track): Track = {    if (compTo(x.startTime, y.startTime) < 0) {      new Track(x.productNo,x.lacId,x.moment,x.startTime,x.userId,x.countyId,x.staytime + y.staytime,x.cityId)    }    else {      new Track(y.productNo,y.lacId,y.moment,y.startTime,y.userId,y.countyId,x.staytime + y.staytime,y.cityId)    }  }  def get(x:(SecondarySort,Iterable[Track])) :Track = {      val xIter = x._2.head     xIter  }  def main(args: Array[String]) {    val sc = new SparkContext(new SparkConf().setAppName("FindTrack"))    val base = "/user/ds/"    val rawData = sc.textFile(base + "track.txt")    val mds = rawData.filter(x => !isHeader(x)).map{x => parse(x)}.groupByKey().sortByKey(true).collect().map{x => get(x)}.reduceLeft{ (x, y) =>      if((x.productNo == y.productNo && x.lacId == y.lacId))        add(x, y)      else      {        println(x)        y      }    }  }}


二次识别比较类

/**  * Created by on 2017/11/13.  */class SecondarySort(val first:String, val second:String) extends Ordered[SecondarySort] with Serializable{  def compTo(one:String,another:String):Int = {    val len = one.length -1    val v1 = one.toCharArray    val v2 = another.toCharArray    for(i <- 0 to len){      val c1 = v1(i)      val c2 = v2(i)      if(c1 != c2) return c1 -c2    }    return 0  }  override def compare(that: SecondarySort): Int = {  val minus = compTo(this.first,that.first)  if(minus !=0) return  minus  return  -compTo(this.second,that.second)  }  override def equals(obj:Any) :Boolean  = {    if(!obj.isInstanceOf[SecondarySort]) return false    val obj2 = obj.asInstanceOf[SecondarySort]    return (this.first==obj2.first) && (this.second==obj2.second)  }  override def toString :String = {    first +" "+ second  }  override  def hashCode :Int = {    return this.first.hashCode()+this.second.hashCode();  }}



/**  * Created by on 2017/11/13.  */class Track extends  java.io.Serializable  {  var productNo : String = ""  var lacId : String  = ""  var moment : String = ""  var startTime : String = ""  var userId  : String = ""  var countyId : String = ""  var staytime : Int  = 0  var cityId  : String = ""  def this(_productNo: String, _lacId: String,_moment: String, _startTime: String,_userId: String,_countyId: String,_staytime: Int, _cityId: String)  {    this()    this.productNo = _productNo    this.lacId = _lacId    this.moment = _moment    this.startTime = _startTime    this.userId = _userId    this.countyId = _countyId    this.staytime = _staytime    this.cityId = _cityId  }  override def toString :String = {      productNo +" "+ lacId + " "+ moment + " " + startTime + "  " + userId  + "  "+ countyId + "  "+ staytime + "  " + cityId  }}

结果图:二次排序实现了



最后的加法操作还是有点问题,如果大家有好大建议,请给我留言,不胜感激