Spark——二次排序(scala)

来源:互联网 发布:商机是什么 知乎 编辑:程序博客网 时间:2024/05/21 18:33

Spark实现二次排序的两个版本(Scala):
(1)利用分组,手动将第二个元素按规则排序
(2)自定义数据类型,继承Ordered和Serializable接口,实现compare方法。
(此方法和Hadoop中MapReduce实现二次排序的方法类似)

case class MySecType(first: String,second: Int) extends Ordered[MySecType] with Serializable{  override def compare(that: MySecType): Int = {    if(this.first != that.first)      this.first.compareTo(that.first)    else      that.second - this.second  }}

自定义数据类型以及比较规则(属于第二种方法)

object SecondarySort {  def main(args: Array[String]) {    val sparkConf: SparkConf = new SparkConf()      .setAppName("MySparkWork")      .setMaster("local[2]")    //.setMaster("spark://192.168.147.100:7077")    val sc: SparkContext = new SparkContext(sparkConf)    //--------------------Read data from HDFS-----------------    val secdata = sc.textFile("hdfs://192.168.147.100:8020/user/hadoop/secondsortdata")    secdata.cache()    //----------The first method to secondary sort---------    val result = secdata.map {      line => {        val arr = line.split(" ")        (arr(0), arr(1).toInt)      }    }.groupByKey()      .map(tuple => {      val key = tuple._1      val iter = tuple._2      val sortValues = iter.toList.sorted.reverse      (key, sortValues)    })    result.foreach(println)    //----------The first method to secondary sort---------    val myresult = secdata.map(      line => {        val arr = line.split(" ")        val mytype = MySecType(arr(0), arr(1).toInt)        mytype      }    ).sortBy(x => x, true, 1)    myresult.foreach(println)    //----------The second method to secondary sort--------    sc.stop()  }}

第一种方法的结果如下
(aa,List(97, 80, 78, 69))
(bb,List(98, 97, 92, 78, 34, 32, 23))
(cc,List(98, 87, 86, 85, 72))

第二种方法的结果如下
MySecType(aa,97)
MySecType(aa,80)
MySecType(aa,78)
MySecType(aa,69)
MySecType(bb,98)
MySecType(bb,97)
MySecType(bb,92)
MySecType(bb,78)
MySecType(bb,34)
MySecType(bb,32)
MySecType(bb,23)
MySecType(cc,98)
MySecType(cc,87)
MySecType(cc,86)
MySecType(cc,85)
MySecType(cc,72)

二次排序在Spark中的两种实现方式,欢迎批评指正~~

1 0
原创粉丝点击