Spark MLlib 1.6 -- 分类和回归篇

来源：互联网发布：画框图软件编辑：程序博客网时间：2024/06/04 19:08

· Linear models

· classification (SVMs, logistic regression)

· linear regression (least squares, Lasso, ridge)

· Decision trees

· Ensembles of decision trees

· random forests

· gradient-boosted trees

· Naive Bayes

· Isotonic regression

Spark.mllib 实现以下ML问题：两个标签类的分类，多个标签类的分类，和回归分析。

下表列出每类问题的支持算法：

Problem Type

Supported Methods

Binary Classification

linear SVMs, logistic regression, decision trees, random forests,

gradient-boosted trees, naive Bayes

线性支持向量机，逻辑回归，决策树，随机森林，梯度提升决策树，

朴素贝叶斯决策

Multiclass Classification

logistic regression, decision trees, random forests, naive Bayes

逻辑回归，决策树，随机森林，朴素贝叶斯决策

Regression

linear least squares, Lasso, ridge regression, decision trees,

random forests, gradient-boosted trees, isotonic regression

线性最小二乘法，最小化的绝对收缩和选择算子，岭回归，

决策树，随机森林，梯度提升决策树，保序回归

3.1 线性模型 –spark.mllib

· Mathematical formulation

o Loss functions

o Regularizers

o Optimization

· Classification

o Linear Support Vector Machines (SVMs)

o Logistic regression

· Regression

o Linear least squares, Lasso, and ridge regression

o Streaming linear regression

· Implementation (developer)

3.1.1 数学公式

许多标准机器学习问题可以转化为凸优化问题，如，凸函数f 的最小值是依赖于d维向量w(称为权重向量)，可以把问题转化为：

求 \min_{w \in R^d }{ f(x) } 问题。此处f 函数形如：

F(w) = \Lamda * R(w) + frac{1,n} * \Sum|_{i=1}|^{n} {L(w;x_i,y_i)|

此处向量 x_i \in R^d是训练测试数据， 1 <= I <= n , y_i \in R是相应的类标签，类标签在分类问题是需要预测的。如果方法是线性的，如果

L(w;x,y) 可以表示成w^{T} x 和 y 的函数, 下面会讲解不是凸优化问题的情况。

目标函数f 有两个点：正规化决定模型的复杂程度，损失决定模型的误差，损失函数L(w; . , . ) 是w 的凸函数，正规化参数 \Lamda >= 0 (名为regParam ) 来权衡两个目标：错误最小和模型复杂度最低 (为了防止过拟合)。

3.1.1.1 损失函数

下表总结损失函数，集损失函数的梯度函数

3.1.1.2 正规化

正则化可以使模型处理相对简单，并且可以避免模型过拟合。支持以下正则化 spark.mllib

此处 sign(w) 是符号向量，每个元素是w 向量相应位置的符号函数 sign(x_i)

L2-正规化相对L1-正规化处理简单，是因为L2的正规函数是连续光滑函数，而L1的正规函数则不是。L1正规化可以使权向量中稀少的值变得不那么重要，使模型在特征选择上处理更容易理解。弹性网络（elastic net）是L1和L2正规化的组合。不建议训练模型时不适用正则化，特别是训练向本数很少时。

3.1.1.3 最优化

线性方法使用凸最优化方法优化目标函数。Spark.mllib使用两种方法SGD 和L-BFGS（见最优化章节）。当前，大多数算法APIs 支持随机梯度下降（SGD）和大部分支持L-BFGS。

3.1.2 分类

分类算法的目标是把数据分门别类。最简单分类是两分类问题，即分成两类（正类和负类）。如果多余两类，一般称为多类别分类问题。Spark.mllib 支持两种线性分类：线性支持向量机(SVM)和逻辑回归。线性SVN 只支持两分类，而逻辑回归支持两分类和多分类。这两种算法都支持L1和L2正规化。训练集为RDD[LabeledPoint] 在MLlib , 而类标签为 0, 1,2,… 。注意，数学公式中，两分类的类标签表示为： +1 （正类）和 -1 （负类）。

3.1.2.1 线性支持向量机（SVM）

线性SVN是处理大多数分类问题的首选，线性方法描述见上面表达式(1)，其中损失函数形如：

L(w;x,y) := max{ 0, 1 – y w^t x }

默认，线性SVN训练集需要使用L2正规化。同时支持L1正规化，在此情况下，变成线性算法。

线性SVN算法输出SVN模型。给定新数据点，表示为x , 模型基于w^T x 的预测。默认，如果 w^T x >= 0 ，则归为正类，否则归为负类。

例子：

下例中展示如何加载测试数据，执行算法训练，并预测结果与训练集的错误。

Scala SVMWithSGD API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD

Scala SVMModel API :

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.SVMModel

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}

import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics

import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1)

// Run training algorithm to build the model

val numIterations = 100

val model = SVMWithSGD.train(training, numIterations)

// Clear the default threshold.

model.clearThreshold()

// Compute raw scores on the test set.

val scoreAndLabels = test.map { point =>

val score = model.predict(point.features)

(score, point.label)

}

// Get evaluation metrics.

val metrics = new BinaryClassificationMetrics(scoreAndLabels)

val auROC = metrics.areaUnderROC()

println("Area under ROC = " + auROC)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = SVMModel.load(sc, "myModelPath")

SVMWitSGD.train() 方法默认使用L2正规化，且使用正规参数1.0 。如果想修改此默认，需要创建新的SVMWithSGD 实例，并用setter方法重新配置。其他spark.mllib算法也支持setter方法重新配置，例如，下例给出L1正规化且SVM正规化参数位0.1 ，训练样本迭代200次。

import org.apache.spark.mllib.optimization.L1Updater

val svmAlg = new SVMWithSGD()

svmAlg.optimizer.

setNumIterations(200).

setRegParam(0.1).

setUpdater(new L1Updater)

val modelL1 = svmAlg.run(training)

3.1.2.2 逻辑回归

逻辑回归广泛用于预测两类别分类问题。它也符合等式(1) , 并且损失函数形如：

L(w;x,y) := log( 1 + exp(-y w^T x) )

对于两类别分类问题，算法输出两类别逻辑回归模型，给定新的测试点，记为x , 模型通过逻辑函数

F(z) = 1 / { 1 + e^(-z)}

此处 z = w^T x , 如果 f(w^T x) > 0.5 , 认为是正类，否则认为是负类，可以看到此分类方法分类和SVN不太一样，多了一个随机函数f( )

两类别分类逻辑回归可以推广到多类别逻辑回归，用来处理多类别分类问题。例如，假设有K可能的输出结果，选取其中一个作为对比值，剩下K-1个输出值分别去和对比值做两类别回归。在spark.mllib ，这个对比值就是类别0 ，详见统计学习基础：http://statweb.stanford.edu/~tibs/ElemStatLearn/

对于多类别分类问题，算法会输出K-1个逻辑回归模型，给定一个新测试点，带入K-1个模型算出最大概率值得类别，记为预测结果。

我们实现两个算法解决逻辑回归：小批梯队下降法（mini-batch gradient descent）和L-BFGS ，我们建议优先选L-BFGS,它的收敛性更快一些。

例子：

下面例子将如何加载多类别数据集，将数据集分为训练和测试，使用LogisticRegressionWithLBFGS 做逻辑回归。模型再用测试数据集去评估优劣。

Scala LogisticRegressionWithLBFGS API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS

Scala LogisticRegressionModel API :

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel

import org.apache.spark.SparkContext

import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel}

import org.apache.spark.mllib.evaluation.MulticlassMetrics

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.util.MLUtils

// Load training data in LIBSVM format.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split data into training (60%) and test (40%).

val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0).cache()

val test = splits(1)

// Run training algorithm to build the model

val model = new LogisticRegressionWithLBFGS()

.setNumClasses(10)

.run(training)

// Compute raw scores on the test set.

val predictionAndLabels = test.map { case LabeledPoint(label, features) =>

val prediction = model.predict(features)

(prediction, label)

}

// Get evaluation metrics.

val metrics = new MulticlassMetrics(predictionAndLabels)

val precision = metrics.precision

println("Precision = " + precision)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = LogisticRegressionModel.load(sc, "myModelPath")

3.1.3 回归

3.1.3.1 线性最小二乘，Lasso , 岭回归

最小二乘在回归问题中经常使用。同样是线性算法符合公式(1) ，损失函数形为：

使用不同的正规化方法得到不同最小二乘法：正交最小二乘法或线性最小二乘法（不适用正规化）；岭回归使用L2正规化；Lasso使用L1正规化。对所有这些模型，平均损失（训练集错误率） 1/n \SUM|_(i=1) ^n| ( w^T x_i – y_i )^2 , 称为均方误差。

例子：

下例展示如何加载训练数据，转化成标签点的RDD，例子使用LinearRegressionWithSGD 构建线性模型来预测类标签。最后计算均方差错误来评估拟合优度。

Scala LinearRegressionWithSGD API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD

Scala LinearRegressionModel API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionModel

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.regression.LinearRegressionModel

import org.apache.spark.mllib.regression.LinearRegressionWithSGD

import org.apache.spark.mllib.linalg.Vectors

// Load and parse the data

val data = sc.textFile("data/mllib/ridge-data/lpsa.data")

val parsedData = data.map { line =>

val parts = line.split(',')

LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))

}.cache()

// Building the model

val numIterations = 100

val model = LinearRegressionWithSGD.train(parsedData, numIterations)

// Evaluate model on training examples and compute training error

val valuesAndPreds = parsedData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()

println("training Mean Squared Error = " + MSE)

// Save and load model

model.save(sc, "myModelPath")

val sameModel = LinearRegressionModel.load(sc, "myModelPath")

RidgeRegressionWithSGD 和 LassoWithSGD 使用同LinearRegressionWithSGD

为了运行上例代码，需要查看spark 快速指南中 Self-Contained Applications 章节（https://spark.apache.org/docs/latest/quick-start.html#self-contained-applications）

3.1.3.2 流线性回归

当数据是以流的形式进入模型，最好选取在线回归模型，更新数据每批生成的周期。Spark.mllib 流线性回归暂支持正交最小二乘。除了拟合度是计算每批次数据的到，拟合度计算方法和离线是一样。

例子

下例展示如何从文件生成训练数据流和测试数据流。把流数据解释为标签点，拟合在线线性回归模型，预测下一个流的类标签。

首先，引入必须的输入数据和模型类

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGD

然后生成训练数据流和测试数据流。假设StreamingContext ssc 已经生成，详见Spark Streaming Programming Guide (https://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing)

下例中我们使用标签点来代表训练和测试数据，实际中建议测试数据使用无标签向量。

val trainingData = ssc.textFileStream("/training/data/dir").map(LabeledPoint.parse).cache()

val testData = ssc.textFileStream("/testing/data/dir").map(LabeledPoint.parse)

初始化模型权重为0

val numFeatures = 3

val model = new StreamingLinearRegressionWithSGD()

.setInitialWeights(Vectors.zeros(numFeatures))

注册训练数据流和测试数据流，将预测结果打印出来。

model.trainOn(trainingData)

model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()

ssc.start()

ssc.awaitTermination()

现在可以把训练和测试数据流保存在不同的文件夹下。每行记录的数据点格式( y , [ x1,x2,x3]) ,此处y 是类标签， x1,x2,x3是特征向量。训练数据文件只要保存在/training/data/dir，模型就会随时更新，测试数据文件保存在/testing/data/dir 下就会计算类标签预测。注意，训练数据越多，预测结果越好。

3.1.4 实现（开发者）

Spark.mllib实现了简单分布式版本的SGD(stochastic gradient descent),这个SGD是基于(underlying) 梯度下降法。所有提供的算法接受正规化参数作为输入(regParam) , 同时还有其他SGD的各种参数(stepSize , numIterations , miniBatchFraction ) 。罪域每个参数，我们提供三种可能的正规化（none , L1 , L2 ）

逻辑回归，L-BFGS版本的实现基于LogisticRegressionWithLBFGS类，这个实现支持两类别逻辑回归和多类别逻辑回归，而SGD只支持两类别逻辑回归。尽管，L-BFGS不支持L1正规化, 单SGD只支持L1正规化。当L1正规化不是必选是，强烈推荐L-BFGS算法，因为它收敛更快，比SGD算法更精确的逼近逆 Hessian 矩阵，这个Hessian 矩阵通过拟牛顿法(quasi-Newton methond)

算法的Scala 实现：

· SVMWithSGD

· LogisticRegressionWithLBFGS

· LogisticRegressionWithSGD

· LinearRegressionWithSGD

· RidgeRegressionWithSGD

· LassoWithSGD

Python 调用scala 实现： PythonMLLibAPI.

3.2 朴素贝叶斯

假设特征向量的两两维度独立，则使用朴素贝叶斯可以计算多分类，并且它的训练效率很高。计算一遍所有训练样本，可以算每个类的类条件概率分布函数，然后给定一个测试样本，可以分别计算给定测试样本（观测值）的条件下，给定类的条件概率，将测试样本划分到类条件概率做的类别。

Spark.mllib 支持多类别朴素贝叶斯和伯努利朴素贝叶斯（Bernoulli naive Bayes）。这些算法多用于文档分类。在文档上下文中，观测值是每个文档特定单词的出现频率（多类别朴素贝叶斯），或者在伯努利朴素贝叶斯中，观测值是每个文档特定单词的0（文档中未出现此单词）和1（文档中出现此单词）值，这样才能确保每个特征值非负值。模型算法可以使用“multinomial” or “bernoulli” ，默认是“multinomial”。额外平滑参数
λ(default to 1.0)。文档分类中，输入特征向量一般是稀松的，稀松向量可以节省内存和网络IO，且训练样本只用计算一次，因此可以不用缓存。

3.2.1 例子

NaiveBayes 实现了多分类朴素贝叶斯，输入训练数据RDD[LabeledPoint]和lambda 平滑因子。配置模型可选参数，计算出NaiveBayesModel 的实例可以用来预测和分类。

Scala NaiveBayes API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes

Scala NaiveBayesModel API :

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel

import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}

import org.apache.spark.mllib.linalg.Vectors

import org.apache.spark.mllib.regression.LabeledPoint

val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")

val parsedData = data.map { line =>

val parts = line.split(',')

LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))

}

// Split data into training (60%) and test (40%).

val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)

val training = splits(0)

val test = splits(1)

val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")

val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))

val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()

// Save and load model

model.save(sc, "target/tmp/myNaiveBayesModel")

val sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")

3.3 决策树

· Basic algorithm

· Node impurity and information gain

· Split candidates

· Stopping rule

· Usage tips

· Problem specification parameters

· Stopping criteria

· Tunable parameters

· Caching and checkpointing

· Scaling

· Examples

· Classification

· Regression

决策树算法常用于机器学习中分类和回归问题，由于以下优点，决策树得到广泛使用：

1 处理特征分类时，对分类结果容易直观解释

2 容易扩展到多类情况

3 不去要对特征向量进行规整（）

4 可以处理非线性问题

5 可以直观观察特征的比对交互过程

决策树算法族，诸如随机森林和随机深林的提升算法在处理分类和回归问题是效率最高。

Spark.mllib的决策树使用连续特征和归类特征，应用于两类别分类和多类别分类，以及回归问题。决策树实现按行分片处理，最多允许分布式训练百万行数据。

随机森林和梯度提升树详见 Ensembles guide (https://spark.apache.org/docs/latest/mllib-ensembles.html)

3.3.1 基本算法

决策树是贪婪算法，它会按二分去遍历整个特征向量空间。算法预测相同标签类的叶节点集。每一种分片的结果都是在决策节点上，所有可能的划分方法中选取最优的方法，选取最优的依据是信息增益(information gain) 最大化。换句话说，在每个决策节点，选取使 argmax(s) IG(D,s) , 信息增益最大化的参数s ,此处， IG(D,s) 是在数据集D 上应用划分方法s 所得到的信息增益。

3.3.1.1 节点混杂度和信息增益

节点混杂度是测量节点上标签集均一性。当前实现两种分类混杂度（Gini 混杂度和熵）, 一种回归混杂度。

信息增益不同于父节点的混杂度，以及子节点的混杂度带权重之和。假设划分s 将数据集D 划分为D1 和D2 ,其中D1有N1个元素，D2有N2个元素。

信息增益

IG(D,s) = Impurity(D) – N1/N Impurity(D1) – N2/N Impurity(D2)

3.3.1.2 拆分可选集

3.3.1.2.1 连续特征

在单机上小数据集上，给定特征向量，不管怎么划分它的特征向量值是唯一的。有些实现先对特征向量进行排序，然后使用排序后的特征向量快速计算。

但对大量特征向量排序是不可取的，有些实现相对样本集进行抽样，对抽样样本集计算分位数（如四分位数），然后在对全部特征向量按分为点近似划分备选集。这种方式把特征向量空间划分为几个区域，max’Bins 参数可以设置最多允许多少个这样的区域。

需要注意次数区域数不能大于样本的分类树N(默认maxBins 为32，虽然大多数情况这个数不适用) 。当区域数大于类别数时，决策树算法会自动将这个区域数调低。

3.3.1.2.2 特征归类

假设可以讲特征向量归为M个类，共有2^(M-1) – 1种可能的分法，对于两类别分类和回归问题，通过按分位点将类别标签从新排序，我们可以把这个可能的分法降为M-1，如，两类别分类问题有一个类别特征，三个归类A,B和C，其中类别标签1占比分别是0.2 , 0.6 和0.4 ，则从新排序后是A,C,B . 有两种划分法A| C,B 或 A,C | B 。

在多类别分类中，共有 2^(M-1) – 1 中可能划分。但是当 2^(M-1) – 1 比maxBins 参数大时，使用类似两类别分类和回归相似的启发式方法。M 个类特征按混杂度排序，共有M – 1 中分类划分。

3.3.1.3 停止规则

决策树递归结束条件：当一下任一个满足后结束：

1 树的深度等于maxDepth 训练参数

2 没有一种划分可以使得信息增益大于minInfoGain

3 没有一种划分可以生成新的子节点，这个子节点要满足minInstancesPerNode 训练参数。

3.3.2 使用提示

在本节中会讨论决策树的各种参数，下面分别说明。对于初学者需要掌握”Problem specification parameters” 的参数和maxDepth 参数。

3.3.2.1 问题关键参数

下面参数是关键参数，不需要优化。

1) algo(算法) : Classification or Regression

2) numClasses: 类别数（只在分类中有用）

3) categoricalFeaturesInfo: 确定哪些特征值需要分类，以及这些特征值可以划分为几类。这个是用map 表示，每个元素的key 是特征值索引（从0 开始），每个元素的value 是特征值可以划分的类别数（标签值从0开始）。所有不在此map中的特征就认为是连续特征（即不需要划分类别的特征）

i> 例如： Map( 0 ->2 , 4 -> 10 ) , 特征值索引为0的可以划分为两类别（列别标签0,1），特征值缩影为4的可以划分为10个类别（0,1,…9） . 需要注意特征值索引和特征值对应类别标签都是从0 开始。

ii> 需要注意，即使不配置categoricalFeaturesInfo这个参数，算法任然可以运行，但处于对性能的考虑，还是建议仔细配置这个参数。

3.3.2.2 停止原则

此参数决定何时算法停止。配置此参数是需要仔细斟酌，防止决策树过分拟合。

maxDepth: 决策树最大深度，深度越大，训练所要花费的时间越多，更容易过分拟合。

minInstancesPerNode: 为了保证每个节点能进一步划分，每个子节点至少要包含的训练样本数。这个在随机森林里用的多，因为它的深度比一棵树要多。

minInfoGain: 每个节点进一步划分是，必须保证这个划分至少可以获得的信息增益数。

3.3.2.3 可优化参数

一下参数可以用于优化。

maxBins : 当离散连续特征是，允许的最大bin数

i> 增大maxBins允许算法更多地分类划分，可能会获得更好的决策规则，但是却增加计算的复杂度。

ii> 这个值至少要大于任一个特征划分的最大类别数。

maxMemoryInMB: 用于计算足够多统计量所需要的最大内存。

i>默认内存使用256M ,基本能保证决策树在大多数场景工作。

通过减少训练过程中数据的传递，因此提高这个值可以加速训练的过程。然而，因为每次迭代是和数据交互的次数正比于maxMemoryInMB , 所以提升这个值可以降低收益（信息增益）。

实现细节：决策树算法计算待划分节点集的统计信息。每个归类的节点的总数是有内存决定的。maxMemoryInMB 确定每个worker 可以使用统计量所占用的MB。

subsamplingRate: 训练集数据用于决策树训练的采样比例。这个参数决定训练样本的采样比例，因此直接影响训练树算法体系（使用随机森林和梯度提升树）。对于单个决策树没有太大作用，主要是训练样本数目并不是主要限制决策树算法准确性。

Impurity: 选择特征划分所需满足的混杂度，这个参数和algo参数要对应。

3.3.2.4 缓存和检查点

MLlib1.2中添加特性处理“放大”的深度树和决策树系统。当maxDepth 很大时，算法最好开启树节点(Node ID) 的缓存和定期检查点。当numTrees设置很大时，这些参数同样可用于随机树(RandomForest)。

useNodeIdCache: 当设置为true时，算法在每次迭代时不会把模型（决策树或决策森林）传递给executors.

对于深度很大的树（或树系统），设置这个参数可以加速在workers上的计算，对于随机森林算法，可以降低每次迭代时模型和executor的交互。

实现细节：默认，算法迭代期间模型和executor交互，保证在每个树节点可以匹配相应的训练数据，当把此参数设置为开启，算法会在树节点上缓存训练数据，这样可以减少模型和executor的交互。

树节点(Node ID) 每次缓存生成一个RDDs。这样在反复迭代时，这样很冗长的线性依赖会降低系统运算性能，另一方面，定期检查点可以缓解RDD 线性依赖，它会把RDD线性依赖之前的RDD缓存到文件系统，需要事先设置useNodeIdCache 为true

checkpointDir : 树节点（Node ID）把RDD保存的HDFS文件路径

checkpointInterval: 树节点（Node ID）缓存RDD 的周期。把此参数设置过短会导致频繁写HDFS，设置过长，一旦executor 失败时如果没有把RDD线性依赖都保存在文件系统上，则需要全部从新计算。

3.3.3 放大

计算的时间消耗基本线性正比于训练样本数目，特征数，和maxBins 参数。但交互时间消耗近似线性正比于特征数和maxBins.

算法可以读取稀松向量数据和紧致向量数据。然而，对于稀松向量数据并没有做任何优化。

3.3.4 例子

3.3.4.1 分类

下例展现如何加载LIBSVM 数据文件，将数据解析成RDD[LabeledPoint] ，然后运用决策树算法进行分类，算法使用Gini混杂度和树深度最大5 。测试数据的错误率用于评估算法准确性。

Scala DecisionTree API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree

Scala DecisionTreeModel API : https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel

import org.apache.spark.mllib.tree.DecisionTree

import org.apache.spark.mllib.tree.model.DecisionTreeModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.

// Empty categoricalFeaturesInfo indicates all features are continuous.

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val impurity = "gini"

val maxDepth = 5

val maxBins = 32

val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,

impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error

val labelAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()

println("Test Error = " + testErr)

println("Learned classification tree model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myDecisionTreeClassificationModel")

val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeClassificationModel")

完整例子见"examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeClassificationExample.scala"

3.3.4.2 回归

下例展现如何加载LIBSVM 数据文件，将数据解析成RDD[LabeledPoint] ，然后运用决策树算法进行回归，算法使用Gini混杂度和树深度最大5 。均方差错误（MSE）用于计算算法的拟合程度。

Scala DecisionTree API :

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree

Scala DecisionTreeModel API :

https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel

import org.apache.spark.mllib.tree.DecisionTree

import org.apache.spark.mllib.tree.model.DecisionTreeModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a DecisionTree model.

// Empty categoricalFeaturesInfo indicates all features are continuous.

val categoricalFeaturesInfo = Map[Int, Int]()

val impurity = "variance"

val maxDepth = 5

val maxBins = 32

val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity,

maxDepth, maxBins)

// Evaluate model on test instances and compute test error

val labelsAndPredictions = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()

println("Test Mean Squared Error = " + testMSE)

println("Learned regression tree model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myDecisionTreeRegressionModel")

val sameModel = DecisionTreeModel.load(sc, "target/tmp/myDecisionTreeRegressionModel")

完整例子："examples/src/main/scala/org/apache/spark/examples/mllib/DecisionTreeRegressionExample.scala"

3.4 集成学习

· Gradient-Boosted Trees vs. Random Forests

· Random Forests

· Basic algorithm

o Training

o Prediction

· Usage tips

· Examples

o Classification

o Regression

· Gradient-Boosted Trees (GBTs)

· Basic algorithm

o Losses

· Usage tips

o Validation while training

· Examples

o Classification

o Regression

集成学习就是把多种模型整合在一起的学习算法，spark.mllib支持两种主要的集成学习：梯度提升树（GradientBoostedTrees ）和随机森林。

3.4.1 梯度提升树 vs 随机森林

梯度提升树和随机森林都属于集成树算法，但训练的过程截然不同：

1 GBT每次训练一颗树模型，而随机森林可以同时训练多个树模型，因此，训练多个树模型时，GBT明显比随机森林耗费更多的时间。

虽然GBT每次训练树模型时更耗时，但是可以使用选择每次训练一颗简单树，这样可能反而会比训练随机森林更省时。

2 随机森林更不益于过度拟合，同时训练多个树模型的随机森林可以降低拟合过程中模型间的相似性，但使用GBT训练多个树模型很容易得出过度拟合的结果（统计上看来，随机树训练时减少多个树模型的变量数，但GBT训练多个树模型，为了降低无偏性会增加变量树）

简言之，两种算法都是高效的，区别只在于针于特定问题时需要取舍。

3.4.2 随机森林

随机森林是决策算法中集成树算法。并且随机森林是最成功的机器学习算法之一，适用于分类和回归问题。随机森林综合多个决策树来降低训练过度的概率。和决策树类似，随机森林可以用于多类别的特征归类，它在优点就是可以不用对特征进行缩放，可以扩展到非线性特征以及特征交互(feature interaction).

Spark.mllib 支持两类别和多类别随机森林，以及连续特征和特征归类的回归。随机森林是通过决策树的实现，所以建议详细阅读决策树章节。

3.4.2.1 算法基础

随机森林算法同时训练多颗决策树，因此算法可并行执行。算法在每颗树的训练过程会引入随机特性，以降低多个决策树训练结果的相关性。将多颗决策树联合在一起用于测试集预测可以减少预测的可变性，同时还可以提高测试集预测的效率。

3.4.2.1.1 训练

训练过程中引入的随机特性包括：

1）每次迭代训练过程对原始数据进行子采集，以获得不同的训练集（a.k.a bootstrapping）

2）在每个树子节点上随机地使用特征子集划分方法

除了以上随机特性，随机森林的决策树训练算法和单个决策树的训练相同。

3.4.2.1.2 预测

为了使随机森林能更好的预测测试集，需要考虑随机森林预测结果是一个集合。针对分类和回归两个不同的问题，需要使用不同的策略把预测结果集合转译成最终的结果。

分类问题：少数服从多数原则，随机森林中每颗决策树会输出一个类标签，将测试样本归到所有这些类标签中出现次数的类别上。

回归问题：平均值原则，随机森林中每颗决策树会输出一个实数，最终的结果是这些预测实数的均值。

3.4.2.2 使用提示

下面给出随机森林算法中各种参数的配置说明，以下省略决策树章节中出现的参数说明。

1> numTrees : 随机森林中决策树颗数

i) 提高此参数可以降低预测中不确定性，提高测试集的准确性

ii) 训练时间线性正比于决策树颗数

2> maxDepth 随机森林中决策树的最大深度

i) 增加决策树的深度可以提升随机森林的预测能力，但会耗费更多的训练时间且很容易训练过度。

ii) 一般，随机森林算法的树深度可以大于单颗决策树的深度，因为单颗决策树训练很容易出现训练过度。

下面两个参数可以加速算法训练过程，一般不建议优化

3> subsamplingRate : 此参数设置算法迭代中，使用的训练数据在原始数据集中占比，建议使用默认1.0，但使用更少的数据训练可以极大的提升训练速度。

4> featureSubsetStrategy: 决策树每个子节点的特征集用于进一步划分备选特征数。这个参数是一个分数或总特征数的一个函数。降低此参数可以加速训练，但如果太低会严重影响算法的准确性。

3.4.2.3 例子

3.4.2.3.1 分类

下例给出加载LIBSVM数据文件，转化为LabeledPoint 的RDD，并使用随机森林进行分类，使用测试数据集的误差来评测算法的准确性。

RandomForest Scala Docs API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest

RandomForestModel Scala Docs API:

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.

// Empty categoricalFeaturesInfo indicates all features are continuous.

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 3 // Use more in practice.

val featureSubsetStrategy = "auto" // Let the algorithm choose.

val impurity = "gini"

val maxDepth = 4

val maxBins = 32

val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,

numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error

val labelAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()

println("Test Error = " + testErr)

println("Learned classification forest model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myRandomForestClassificationModel")

val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

完整的例子见："examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala"

3.4.2.3.2 回归

下例给出加载LIBSVM数据文件，转化为LabeledPoint 的RDD，并使用随机森林进行回归，使用测试数据集的MSE（均方差）来评测算法的可用性。

RandomForest Scala Docs API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.RandomForest

RandomForestModel Scala Docs API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.tree.RandomForest

import org.apache.spark.mllib.tree.model.RandomForestModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a RandomForest model.

// Empty categoricalFeaturesInfo indicates all features are continuous.

val numClasses = 2

val categoricalFeaturesInfo = Map[Int, Int]()

val numTrees = 3 // Use more in practice.

val featureSubsetStrategy = "auto" // Let the algorithm choose.

val impurity = "variance"

val maxDepth = 4

val maxBins = 32

val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,

numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)

// Evaluate model on test instances and compute test error

val labelsAndPredictions = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()

println("Test Mean Squared Error = " + testMSE)

println("Learned regression forest model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myRandomForestRegressionModel")

val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")

完整例子见："examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala"

3.4.3 梯度提升树（GBTs）

梯度提升树（Gradient-Boosted Tress ,GBTs）是决策树算法的集成，GBT 迭代训练决策树以使损失函数的值最小。和决策树一样，GBT可以处理归类特征，同时可以扩展到多类别分类问题。它在优点就是可以不用对特征进行缩放，可以扩展到非线性特征以及特征交互(feature interaction).

Spark.mllib 支持GBT算法可用于二类别分类和回归问题，算法不限制是连续特征或归类特征。Spark.mllib 使用已有决策树算法实现GBT算法，请详细查看决策树章节的说明。

注意：

GBT暂不支持多类分类问题，如果需要处理多类别分类，请使用决策树或随机森林。

3.4.3.1 算法基础

梯度提升算法迭代训练一序列的决策树，每次迭代时，算法使用训练出的多颗决策树进行预测，对当前训练集数据分类结果准确性进行评估，将预测分类错误的数据集再标签化(re-labeled)来强调这部分错误集。这样下次，决策树就可以逐渐修正之前分类错误的数据集。

错误数据集的再标签化实现是通过损失函数（下面讨论）来实现的，每次迭代，GBT 算法会使训练集上的损失函数值下降。（译者：为了使这个损失函数下降更快，或者使算法快速收敛，此算法使用梯度下降法）

3.4.3.1.1 损失

下表中列出当前spark.mllib GBTs 算法中支持的几种损失函数。需要注意，每个损失函数有最佳的应用场景，如分类或回归，但并不是通过的损失函数（即不可能同时适用于分类和回归）

说明：

N 是样本个数

y_i 是样本i类标签

x_i 是样本i的特征

F(x_i) 是样本i 的预测类标签

3.4.3.2 使用提示

下面详细讨论GBT算法的参数，此处省略决策树的相应参数，如需了解请查看决策树说明章节。

I）loss: 算法中使用的损失函数，针对分类和回归问题需要选取合适的损失函数。同时，使用不同损失函数得到的模型也是不同的。

II) numIterations : GBT算法中训练迭代的次数。注意每次迭代生成一颗决策树，也就是算法要求的决策树颗数。提高此值会此算法耗费更多的迭代次数，当然迭代出的模型会更准确，同时，测试数据准确率的计算的时间也会变长。

III) learningRate : 建议最好不要优化此参数，如果算法不稳定，可以降低此值来使算法稳定。

IV) algo : 配置分类或回归问题（classification vs regreesion ）

3.4.3.2.1 训练中校验

当GBT算法训练的决策树过多时，会导致算法过拟合。为了防止过拟合，需要在训练过程中进行校验。方法runWithValidation 用于此种校验。此方法有两个参数，参数1是训练的样本集（RDD），参数2 是校验样本集。

当校验误差超出算法可以允许的范围(BoostingStrategy 的validationTol参数)时，算法会停止训练。实践中，校验误会降低后再增长，这也就是校验误差不是单调的，建议用户设置尽量大的误差允许范围，每次迭代使用evaluateEachIteration（每次迭代的损失）检查校验误差曲线，选择更优化的迭代次数。

3.4.3.3 例子

3.4.3.3.1 分类

下例给出加载LIBSVM数据文件，转化为LabeledPoint 的RDD，并使用梯度提升树进行分类和损失计算，使用测试数据集的错误率来评测算法的准确性。

GradientBoostedTrees Scala DOCS API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees

GradientBoostedTreesModel Scala DOCS API :

http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

import org.apache.spark.mllib.tree.GradientBoostedTrees

import org.apache.spark.mllib.tree.configuration.BoostingStrategy

import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.

// The defaultParams for Classification use LogLoss by default.

val boostingStrategy = BoostingStrategy.defaultParams("Classification")

boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.

boostingStrategy.treeStrategy.numClasses = 2

boostingStrategy.treeStrategy.maxDepth = 5

// Empty categoricalFeaturesInfo indicates all features are continuous.

boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error

val labelAndPreds = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()

println("Test Error = " + testErr)

println("Learned classification GBT model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myGradientBoostingClassificationModel")

val sameModel = GradientBoostedTreesModel.load(sc,

"target/tmp/myGradientBoostingClassificationModel")

完整例子见"examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala"

3.4.3.3.2 回归

下例给出加载LIBSVM数据文件，转化为LabeledPoint 的RDD，并使用梯度提升树进行回归计算,损失函数选取方差（SE squared error），使用测试数据集的均方差（MSE mean squared error）来评测算法的适用性。

GradientBoostedTrees Scala Docs API : http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees

GradientBoostedTreesModel Scala Docs API :

import org.apache.spark.mllib.tree.GradientBoostedTrees

import org.apache.spark.mllib.tree.configuration.BoostingStrategy

import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel

import org.apache.spark.mllib.util.MLUtils

// Load and parse the data file.

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")

// Split the data into training and test sets (30% held out for testing)

val splits = data.randomSplit(Array(0.7, 0.3))

val (trainingData, testData) = (splits(0), splits(1))

// Train a GradientBoostedTrees model.

// The defaultParams for Regression use SquaredError by default.

val boostingStrategy = BoostingStrategy.defaultParams("Regression")

boostingStrategy.numIterations = 3 // Note: Use more iterations in practice.

boostingStrategy.treeStrategy.maxDepth = 5

// Empty categoricalFeaturesInfo indicates all features are continuous.

boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()

val model = GradientBoostedTrees.train(trainingData, boostingStrategy)

// Evaluate model on test instances and compute test error

val labelsAndPredictions = testData.map { point =>

val prediction = model.predict(point.features)

(point.label, prediction)

}

val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()

println("Test Mean Squared Error = " + testMSE)

println("Learned regression GBT model:\n" + model.toDebugString)

// Save and load model

model.save(sc, "target/tmp/myGradientBoostingRegressionModel")

val sameModel = GradientBoostedTreesModel.load(sc,

"target/tmp/myGradientBoostingRegressionModel")

完整见 "examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala"

3.5 保序回归

保序回归是说：对于给定随机变量Y的有限观察集，记为Y=y_1,y_2,…,y_n, y_i \In R(实数)，及相应自变量X = x_1,x_2,…,x_n , 在保证x_1<= x_2 <= … <= x_n的前提下，使以下拟合函数取最小值：

f(x) = \Sum|_{i=1} |^{n} w_i (y_i – x_i)^2

此处， w_i是正的权重。这个拟合函数称为保序回归，并且这个拟合函数是唯一的。以上问题可看成在自变量全序条件下的最小二乘问题，显然保序函数是一个单调函数。

Spark.mllib 支持保序回归的PAVA(pool adjacent violators algorithm)算法，此算法可实现并行保序回归。输入训练集是三元组的RDD，三元组第一个元素是双精度浮点数代表标签，第二和第三个元素是特征值和对应的权重。除此而外，IsotonicRegression算法可以设置参数 isotonic 默认是true .此参数是true表示保序算法要求是单调递增，false表示保序算法要求是单调递减。

训练结果返回保序回归模型，可用于预测已知或未知特征的类标签。保序回归模型是分段线性函数，预测分类的规则：

1）如果预测的输入集和训练特征集匹配，则返回相应的预测结果。为了防止同一个输入特征返回多个预测结果（可相同或不同），对同一个特征需要定义哪些返回无效。

2）如果预测的输入集比训练特征集每个都低（或高），则返回最低（或最高）的特征标签。取最低（或最高）特征的标签值可以防止同一个特征输入返回多个预测标签值。

3）如果输入特征落在两个训练特征之间，那么预测结果看成是分段线性函数，需要对这两个输入特征进行插值后再进行预测。为了防止同一个特征返回多个预测结果可以使用前面几点中相同的处理方式。

3.5.1 例子

原始文件每行格式为

标签，特征如 4710.28,500.00.

从这个原始文件读取数据后分成训练集和测试集。用训练集训练模型并计算测试集真实标签与预测标签的均方差（MSE）

IsotonicRegressionScala Docs API :http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegression

IsotonicRegressionModelScala Docs API :http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegressionModel

importorg.apache.spark.mllib.regression.{IsotonicRegression,IsotonicRegressionModel}

val data= sc.textFile("data/mllib/sample_isotonic_regression_data.txt")

// Create label,feature, weight tuples from input data with weight set to default value 1.0.

val parsedData= data.map{ line =>

val parts= line.split(',').map(_.toDouble)

(parts(0), parts(1),1.0)

}

// Split datainto training (60%) and test (40%) sets.

val splits= parsedData.randomSplit(Array(0.6,0.4), seed =11L)

val training= splits(0)

val test= splits(1)

// Createisotonic regression model from training data.

// Isotonicparameter defaults to true so it is only shown for demonstration

val model=newIsotonicRegression().setIsotonic(true).run(training)

// Create tuplesof predicted and real labels.

valpredictionAndLabel= test.map{ point =>

val predictedLabel= model.predict(point._2)

(predictedLabel, point._1)

}

// Calculatemean squared error between predicted and real labels.

valmeanSquaredError= predictionAndLabel.map{case(p, l)=> math.pow((p- l),2)}.mean()

println("MeanSquared Error = "+meanSquaredError)

// Save and loadmodel

model.save(sc,"target/tmp/myIsotonicRegressionModel")

val sameModel=IsotonicRegressionModel.load(sc,"target/tmp/myIsotonicRegressionModel")

完整的例子见"examples/src/main/scala/org/apache/spark/examples/mllib/IsotonicRegressionExample.scala"

0 0