Spark MLlib 1.6 -- 统计基础篇

来源：互联网发布：南方房产测绘软件编辑：程序博客网时间：2024/05/21 17:57

· Summary statistics

· Correlations

· Stratified sampling

· Hypothesis testing

· Streaming Significance Testing

· Random data generation

· Kernel density estimation

2.1 统计概览

在Statistics类中提供基本列统计RDD[Vector]功能

colStats()返回MultivariateStatisticalSummary 的实例，这个实例可以按列计算最大，最小，均值，方差，非0个数统计，列的1范数。

Scala MultivariateStatisticalSummary API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary

import org.apache.spark.mllib.linalg.Vector

import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics}

val observations: RDD[Vector] = ... // an RDD of Vectors

// Compute column summary statistics.

val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)

println(summary.mean) // a dense vector containing the mean value for each column

println(summary.variance) // column-wise variance

println(summary.numNonzeros) // number of nonzeros in each column

2.2 相关统计

计算两个数据序列（可以使向量或矩阵）的相关系数。在spark.mllib中，我们提供成对计算相关系数，实现了Pearson’s相关和Spearman’s相关。相关统计的结果依赖于计算对象，如果是两个RDD[Double]的计算，结果是Double类型，如果是两个RDD[Vector]计算，结果是一个Matrix矩阵。

Scala Statistics API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.Statistics

import org.apache.spark.SparkContext

import org.apache.spark.mllib.linalg._

import org.apache.spark.mllib.stat.Statistics

val sc: SparkContext = ...

val seriesX: RDD[Double] = ... // a series

val seriesY: RDD[Double] = ... // must have the same number of partitions and cardinality as seriesX

// compute the correlation using Pearson's method. Enter "spearman" for Spearman's method. If a

// method is not specified, Pearson's method will be used by default.

val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")

val data: RDD[Vector] = ... // note that each Vector is a row and not a column

// calculate the correlation matrix using Pearson's method. Use "spearman" for Spearman's method.

// If a method is not specified, Pearson's method will be used by default.

val correlMatrix: Matrix = Statistics.corr(data, "pearson")

2.3 分层采样(Stratified sampling)

在spark.mllib中提供计算原始RDD 键值对的分层采样方法：sampleByKey 和 sampleByKeyExact 。在分层采样中，键可以看做标签类，相应的值可以看做属性。如，键可以使男人或女人，文档ID，相应的值可以使人的年龄或文档的单次。 sampleByKey 方法随机采样一系列观测值，过程就像逐个遍历所有样本点，通过抛银币决定取舍，因此只需要确定采样点个数。sampleByKeyExact 比分层随机采样方法sampleByKey需要更多地样本，才能保证采样点个数有99.99%的置信度，sampleByKeyExact暂不支持python.

sampleByKeyExact() 采样由[ f_k , n_k ] 完全决定，对任意一个键k 属于 K 键集合，f_k是预期键对应采样点值得占比（分数），n_k 是这个键k在整个集合中值的个数。无放回采样（即采样的数据取走，不会出现重复）方法需要一个参数（withReplacement默认是false） , 而又放回采样方法需要两个参数。

import org.apache.spark.SparkContext

import org.apache.spark.SparkContext._

import org.apache.spark.rdd.PairRDDFunctions

val sc: SparkContext = ...

val data = ... // an RDD[(K, V)] of any key value pairs

val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

// Get an exact sample from each stratum

val approxSample = data.sampleByKey(withReplacement = false, fractions)

val exactSample = data.sampleByKeyExact(withReplacement = false, fractions)

2.4 假设检验

假设检验在统计上用于判定统计结果又多大统计意义，及统计结果有多大置信度。Spark.mllib 暂支持Pearson’s chi-squared 检验，检验结果的适用性和独立性。输入数据需要验证适用性和独立性。适用性检验需要输入Vector ，独立性需要数据Matrix 。

Spark.mllib 支持输入RDD[LabledPoint] ，使用chi-squared独立性来决定特征的选择。

Statistics 提供方法运行Pearson’s chi-squared 检验，下例用于假设检验。

import org.apache.spark.SparkContext

import org.apache.spark.mllib.linalg._

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.stat.Statistics._

val sc: SparkContext = ...

val vec: Vector = ... // a vector composed of the frequencies of events

// compute the goodness of fit. If a second vector to test against is not supplied as a parameter,

// the test runs against a uniform distribution.

val goodnessOfFitTestResult = Statistics.chiSqTest(vec)

println(goodnessOfFitTestResult) // summary of the test including the p-value, degrees of freedom,

// test statistic, the method used, and the null hypothesis.

val mat: Matrix = ... // a contingency matrix

// conduct Pearson's independence test on the input contingency matrix

val independenceTestResult = Statistics.chiSqTest(mat)

println(independenceTestResult) // summary of the test including the p-value, degrees of freedom...

val obs: RDD[LabeledPoint] = ... // (feature, label) pairs.

// The contingency table is constructed from the raw (feature, label) pairs and used to conduct

// the independence test. Returns an array containing the ChiSquaredTestResult for every feature

// against the label.

val featureTestResults: Array[ChiSqTestResult] = Statistics.chiSqTest(obs)

var i = 1

featureTestResults.foreach { result =>

println(s"Column $i:\n$result")

i += 1

} // summary of the test

Statistics 提供1-sample, 2-sided Kolmogorov-Smirnov检验概率分布是否相等。提供理论分布名称和理论分布参数，或者根据已知理论分布计算累计分布函数，用户可以检验样本点是否出自来验证概率分布。在特殊例子中，如正态分布，不用没有提供正态分布参数，则检验会使用标准正态分布参数。

Scala Statistics API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.Statistics

import org.apache.spark.mllib.stat.Statistics

val data: RDD[Double] = ... // an RDD of sample data

// run a KS test for the sample versus a standard normal distribution

val testResult = Statistics.kolmogorovSmirnovTest(data, "norm", 0, 1)

println(testResult) // summary of the test including the p-value, test statistic,

// and null hypothesis

// if our p-value indicates significance, we can reject the null hypothesis

// perform a KS test using a cumulative distribution function of our making

val myCDF: Double => Double = ...

val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)

2.4.1 流式显著性测试

Spark.mllib 提供在线测试实现，如A/B在线测试。此测试需要在spark streaming DStream[(Boolean, Double)] 上使用，每个流单元的第一个元素是逻辑真假，假代表对照组（false），而真代表实验组(true) , 第二个元素是观测值。

流式显著性检验支持这两个参数：

1 peacePeriod （平稳周期），默认最初启动后可以忽略的数据组数。

2 windowSize (窗尺寸) ，每次假设检验使用的数据批次数，若设为0 ，则累计处理之前所有批次。

StreamingTest 支持流式假设检验。

val data = ssc.textFileStream(dataDir).map(line => line.split(",") match {

case Array(label, value) => BinarySample(label.toBoolean, value.toDouble)

})

val streamingTest = new StreamingTest()

.setPeacePeriod(0)

.setWindowSize(0)

.setTestMethod("welch")

val out = streamingTest.registerStream(data)

out.print()

完整例子代码见：examples/src/main/scala/org/apache/spark/examples/mllib/StreamingTestExample.scala

2.5 随机数发生器

随机数发生器在随机算法，随机模板和性能测试中很有用。Spark.mllib 的随机发生器RDD 带i.i.d. 随机数据来自给定分布：均匀分布，标准正态， Possion （泊松分布）。

RandomRDDs 提供工厂方法来生成随机双精度浮点RDD 和随机向量RDD。下例生辰随机双精度浮点RDD，这些随机值来自标准正态分布N(0,1)，做平移和伸缩后映射到N(1,4)。

Scala RandomRDD API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs

import org.apache.spark.SparkContext

import org.apache.spark.mllib.random.RandomRDDs._

val sc: SparkContext = ...

// Generate a random double RDD that contains 1 million i.i.d. values drawn from the

// standard normal distribution `N(0, 1)`, evenly distributed in 10 partitions.

val u = normalRDD(sc, 1000000L, 10)

// Apply a transform to get a random double RDD following `N(1, 4)`.

val v = u.map(x => 1.0 + 2.0 * x)

2.6 核密度估计

核密度估计在经验概率分布图中用处很大，这种分布图不需要假设观测值来自特定的某个分布。通过给定点集，来计算随机变量的概率密度函数。通过计算经验分布在特定点的PDF（偏导数），作为标准正态分布在每个采样点附近的PDF。

KernelDensity 提供方法计算RDD采样点集的核密度估计，见下例：

Scala KernelDensity API：　http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity

import org.apache.spark.mllib.stat.KernelDensity

import org.apache.spark.rdd.RDD

val data: RDD[Double] = ... // an RDD of sample data

// Construct the density estimator with the sample data and a standard deviation for the Gaussian

// kernels

val kd = new KernelDensity()

.setSample(data)

.setBandwidth(3.0)

// Find density estimates for the given values

val densities = kd.estimate(Array(-1.0, 2.0, 5.0))

0 0