Spark 如何过滤重复的对象

来源：互联网发布：php 静态变量的特点编辑：程序博客网时间：2024/06/06 04:41

数据如下所示：

hello   world
hello   spark
hello   hive
hello   world
hello   spark
hello   hive

最终需要的只是

hello   world
hello   spark
hello   hive

这三个，重复的丢掉。有两种实现方法。

第一：在程序将文本加载进来形成line Rdd的时候调用distinct直接过滤掉，如：

lineRdd.distinct()

但是上面这种情况仅限于测试，因为实际数据不可能只有两列的数据，最常见的就是要根据其中几列的值来判断是否相同，如果相同则去重。

第二：将数据封装成对象的时候直接过滤掉，这个就要使用reduceByKey（）了，因为distinct只能过滤数字和字符串的值，而且distinct的源码也是这样实现的。可以先看一下源码：

/** * Return a new RDD containing the distinct elements in this RDD. */def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {  map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)}/** * Return a new RDD containing the distinct elements in this RDD. */def distinct(): RDD[T] = withScope {  distinct(partitions.length)}

具体的源码解析可以参照这篇文章：http://blog.csdn.net/u014393917/article/details/50602431

下面是我参考源码的实现方式对一个对象进行的过滤：

代码实现：

 def main (args: Array[String]) {    val conf = new SparkConf().setMaster("local[*]").setAppName("MapTest")    conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")    // 注册要序列化的自定义类型。    conf.registerKryoClasses(Array(classOf[StringPojo]))    val context = new SparkContext(conf)    val lineRdd = context.textFile("E:\\sourceData\\maptest.txt")//    lineRdd.distinct()    val wordsRdd=lineRdd.map{line =>      val words=line.split("\t")      new StringPojo(words(0),words(1))    }    val pairRdd1= wordsRdd.map(pojo=>(pojo.name+pojo.secondName,1))    pairRdd1.reduceByKey(_+_).foreach(println)   val pairRdd= wordsRdd.map(pojo=>(pojo.name+pojo.secondName,pojo))    pairRdd.reduceByKey((x,y)=>x).map(_._2).foreach(println)//    pairRdd.distinct().foreach(println)//    val distRdd=wordsRdd.distinct()//    distRdd.foreach(println)  }

实体类：

class StringPojo(val name:String,val secondName:String) {  override def toString: String = {    super.toString    this.name + "|" + this.secondName  }  override def hashCode(): Int = {//    println("name:"+ this.secondName+",hashCode:"+ this.secondName.hashCode)    this.secondName.hashCode  }}

实验结果：

(helloworld,2)
(hellospark,2)
(hellohive,2)

hello|spark
hello|world
hello|hive

上面的过程就是如何过滤rdd中重复的对象的过程了，但是我有一个疑问：String和我自定义的对象都是对象类型，为什么字符串可以直接调用distinct去重，而我自定义的确不行呢？一开始我以为是hashCode方法没有重写，但是重写了之后还是没有去重。但是我觉得自定义对象肯定也可以的，可能我漏了什么地方。

0 0