spark-knn的简易实现

来源:互联网 发布:杰西平克曼 知乎 编辑:程序博客网 时间:2024/06/03 09:34

spark-knn,spark是一个很优秀的分布式计算框架,本文实现的knn是基于欧几里得距离公式实现的,下面开始起简单的实现,可能有多问题希望大家能够给指出来。

    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)    val conf = new SparkConf( ).setAppName("knn")    conf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer")    val sc = new SparkContext( conf )    val k:Int = 6    val path = "hdfs://master:9000/knn.txt"    val data = sc.textFile( path ).map( line =>{      val pair = line.split( "\\s+" )      ( pair( 0 ).toDouble,pair( 1 ).toDouble ,pair( 2 ) )    } )    val total:Array[ RDD[(Double,Double,String)] ] = data.randomSplit(Array( 0.7,0.3 ) )    val train = total( 0 ).cache()    val test = total( 1 ).cache()    train.count()    test.count()    val bcTrainSet = sc.broadcast( train.collect() )    val bck = sc.broadcast( k )    val resultSet = test.map{ line => {      val x = line._1      val y = line._2      val trainDatas = bcTrainSet.value      val set = scala.collection.mutable.ArrayBuffer.empty[(Double, String)]      trainDatas.foreach( e => {        val tx = e._1.toDouble        val ty = e._2.toDouble        val distance = Math.sqrt( Math.pow( x - tx, 2 ) + Math.pow( y - ty, 2 ) )        set.+= (( distance, e._3 ) )      })      val list = set.sortBy( _._1 )      val categoryCountMap = scala.collection.mutable.Map.empty[String, Int]      val k = bck.value      for ( i <- 0 until k ){        val category = list(i)._2        val count = categoryCountMap.getOrElse( category, 0 ) + 1        categoryCountMap += ( category -> count )      }      val ( rCategory,frequency ) = categoryCountMap.maxBy( _._2 )      ( x, y, rCategory )    }}    resultSet.repartition(1).saveAsTextFile( "hdfs://master:9000/knn/result" )

以上实现是最简单的实现方式。可以采用加权方法,例如在统计次数的时候使用距离的倒数乘以次数作为最终的次数

0 0