MLlib数据统计基本概念

来源:互联网 发布:淘宝极有家装修日记 编辑:程序博客网 时间:2024/05/22 05:09
备注:kimi.txt中的内容如下:     1     2     3     4     5
一.求数据的均值和标准差
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.stat.Statistics
import
org.apache.spark.{SparkConf, SparkContext}
object testVector { def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local")
.setAppName("testVector");
val sc = new SparkContext(conf);
var rdd = sc.textFile("kimi.txt")
.map(_.split(' ')
.map(_.toDouble))
.map(line => Vectors.dense(line));
var
summary = Statistics.colStats(rdd);
println(summary.mean);//计算均值
println(summary.variance);//计算标准差
}
}
程序结果:[3.0][2.5]
二.距离计算
1.欧几里得距离(normL1):指在m维空间中两个点之间的真实距离,或者向量的自然长度(即该点到原点的距离)。
2.曼哈段距离(normL2):两个点在标准坐标系上的绝对轴距总和。

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.stat.Statisticsimport org.apache.spark.{SparkConf, SparkContext}object testVector {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setMaster("local")    .setAppName("testVector");    val sc = new SparkContext(conf);    var rdd = sc.textFile("kimi.txt")    .map(_.split(' ')    .map(_.toDouble))    .map(line => Vectors.dense(line));    var summary = Statistics.colStats(rdd);    println(summary.normL1);    println(summary.normL2);  }}
程序结果:
[15.0]
[7.416198487095663]
三.相关系数
x.txt,y.txt内容:
1 2 3 4 5
2 4 6 8 10
import org.apache.spark.mllib.stat.Statisticsimport org.apache.spark.{SparkConf, SparkContext}object testVector {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setMaster("local")    .setAppName("testVector");    val sc = new SparkContext(conf);    var rddX = sc.textFile("x.txt")    .flatMap(_.split(' ')    .map(_.toDouble));    var rddY = sc.textFile("y.txt")      .flatMap(_.split(' ')        .map(_.toDouble));    var correlation: Double = Statistics.corr(rddX,rddY);//皮尔逊相关系数 1.0    println(correlation);    val correlation2: Double = Statistics.corr(rddX,rddY,"spearman");//斯皮尔曼相关系数 1.0000000000000009    println(correlation2);  }}单个数据集相关系数的计算
import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.stat.Statisticsimport org.apache.spark.{SparkConf, SparkContext}object testVector {  def main(args: Array[String]): Unit = {    val conf = new SparkConf().setMaster("local")    .setAppName("testVector");    val sc = new SparkContext(conf);    var rdd = sc.textFile("x.txt")    .map(_.split(' ')    .map(_.toDouble))    .map(line => Vectors.dense(line))    println(Statistics.corr(rdd,"spearman"));  }}
1.0                 0.9999999999999998  0.9999999999999998  ... (5 total)
0.9999999999999998  1.0                 0.9999999999999998  ...
0.9999999999999998  0.9999999999999998  1.0                 ...
0.9999999999999998  0.9999999999999998  0.9999999999999998  ...
0.9999999999999998  0.9999999999999998  0.9999999999999998  ...

 
0 0