Spark MLlib算法

来源：互联网发布：应聘数据分析师面试题编辑：程序博客网时间：2024/05/16 07:58

Spark MLlib算法

官方文档

Mathematical formulation数学公式

Loss functions损失函数

hinge loss
logistic loss
squared loss

Regularizers正则化

L1
L2
elastic net
zero (unregularized)

Optimization优化

spark使用 SGD 和 L-BFGS 这两种梯度下降方法

libSVM的数据格式

Label 1:value 2:value ….

Label：是类别的标识，比如上节train.model中提到的1 -1，你可以自己随意定，比如-10，0，15。当然，如果是回归，这是目标值，就要实事求是了。

Value：就是要训练的数据，从分类的角度来说就是特征值，数据之间用空格隔开

比如: -15 1:0.708 2:1056 3:-0.3333

需要注意的是，如果特征值为0，特征冒号前面的(姑且称做序号)可以不连续。如：
-15 1:0.708 3:-0.3333

表明第2个特征值为0，从编程的角度来说，这样做可以减少内存的使用，并提高做矩阵内积时的运算速度。我们平时在matlab中产生的数据都是没有序号的常规矩阵，所以为了方便最好编一个程序进行转化。

分类

线性分类有线性支持向量机svm和LR，其中线性支持向量机svm只支持二分类，LR支持二分类和多分类

Linear Support Vector Machines (SVMs)线性支持向量机

import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}import org.apache.spark.mllib.evaluation.BinaryClassificationMetricsimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM format.val data = MLUtils.loadLibSVMFile(sc, "/Users/yuyin/Downloads/software/spark/spark-1.6.2-bin-hadoop1-scala2.11/data/mllib/sample_libsvm_data.txt")// Split data into training (60%) and test (40%).val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)val training = splits(0).cache()val test = splits(1)// Run training algorithm to build the modelval numIterations = 100val model = SVMWithSGD.train(training, numIterations)// Clear the default threshold.model.clearThreshold()// Compute raw scores on the test set.计算分数值val scoreAndLabels = test.map { point =>  val score = model.predict(point.features)  (score, point.label)}// Get evaluation metrics.获取评测指标val metrics = new BinaryClassificationMetrics(scoreAndLabels)val auROC = metrics.areaUnderROC()println("Area under ROC = " + auROC)// Save and load modelmodel.save(sc, "target/tmp/scalaSVMWithSGDModel")val sameModel = SVMModel.load(sc, "target/tmp/scalaSVMWithSGDModel")

添加正则项

import org.apache.spark.mllib.optimization.L1Updaterval svmAlg = new SVMWithSGD()svmAlg.optimizer.setNumIterations(200).setRegParam(0.1).setUpdater(new L1Updater)val modelL1 = svmAlg.run(training)

Logistic regression

import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS}import org.apache.spark.mllib.evaluation.MulticlassMetricsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM format.val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")// Split data into training (60%) and test (40%).val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)val training = splits(0).cache()val test = splits(1)// Run training algorithm to build the modelval model = new LogisticRegressionWithLBFGS()  .setNumClasses(10)  .run(training)// Compute raw scores on the test set.val predictionAndLabels = test.map { case LabeledPoint(label, features) =>  val prediction = model.predict(features)  (prediction, label)}// Get evaluation metrics.val metrics = new MulticlassMetrics(predictionAndLabels)val accuracy = metrics.accuracyprintln(s"Accuracy = $accuracy")// Save and load modelmodel.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")val sameModel = LogisticRegressionModel.load(sc,  "target/tmp/scalaLogisticRegressionWithLBFGSModel")

Regression回归

Linear least squares, Lasso, and ridge regression

import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.regression.LinearRegressionModelimport org.apache.spark.mllib.regression.LinearRegressionWithSGD// Load and parse the dataval data = sc.textFile("data/mllib/ridge-data/lpsa.data")val parsedData = data.map { line =>  val parts = line.split(',')  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))}.cache()// Building the modelval numIterations = 100val stepSize = 0.00000001val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)// Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point =>  val prediction = model.predict(point.features)  (point.label, prediction)}val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()println("training Mean Squared Error = " + MSE)// Save and load modelmodel.save(sc, "target/tmp/scalaLinearRegressionWithSGDModel")val sameModel = LinearRegressionModel.load(sc, "target/tmp/scalaLinearRegressionWithSGDModel")

RidgeRegressionWithSGD 和LassoWithSGD可以以类似的方式被使用LinearRegressionWithSGD。
spark.mllib实现的随机梯度下降（SGD）的一个简单的分布式版本

Algorithms are all implemented in Scala:
1. SVMWithSGD
2. LogisticRegressionWithLBFGS
3. LogisticRegressionWithSGD
4. LinearRegressionWithSGD
5. RidgeRegressionWithSGD
6. LassoWithSGD

Logistic回归选择

L-BFGS支持二进制和多项Logistic回归,SGD版本只支持二分类，L-BFGS版本不支持L1正则化，但SGD 1支持L1正则化，当不需要L1正则化时，强烈推荐L-BFGS版本，因为通过使用准牛顿法近似逆Hessian矩阵，与SGD相比，它收敛得更快和更准确
官方文档

Streaming linear regression

在线回归模型

import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGDval trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache() val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)val numFeatures = 3 val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.zeros(numFeatures))model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()ssc.start() ssc.awaitTermination()

评价标准

官方文档

import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGSimport org.apache.spark.mllib.evaluation.BinaryClassificationMetricsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM formatval data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")// Split data into training (60%) and test (40%)val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)training.cache()// Run training algorithm to build the modelval model = new LogisticRegressionWithLBFGS()  .setNumClasses(2)  .run(training)// Clear the prediction threshold so the model will return probabilitiesmodel.clearThreshold// Compute raw scores on the test setval predictionAndLabels = test.map { case LabeledPoint(label, features) =>  val prediction = model.predict(features)  (prediction, label)}// Instantiate metrics objectval metrics = new BinaryClassificationMetrics(predictionAndLabels)// Precision by thresholdval precision = metrics.precisionByThresholdprecision.foreach { case (t, p) =>  println(s"Threshold: $t, Precision: $p")}// Recall by thresholdval recall = metrics.recallByThresholdrecall.foreach { case (t, r) =>  println(s"Threshold: $t, Recall: $r")}// Precision-Recall Curveval PRC = metrics.pr// F-measureval f1Score = metrics.fMeasureByThresholdf1Score.foreach { case (t, f) =>  println(s"Threshold: $t, F-score: $f, Beta = 1")}val beta = 0.5val fScore = metrics.fMeasureByThreshold(beta)f1Score.foreach { case (t, f) =>  println(s"Threshold: $t, F-score: $f, Beta = 0.5")}// AUPRCval auPRC = metrics.areaUnderPRprintln("Area under precision-recall curve = " + auPRC)// Compute thresholds used in ROC and PR curvesval thresholds = precision.map(_._1)// ROC Curveval roc = metrics.roc// AUROCval auROC = metrics.areaUnderROCprintln("Area under ROC = " + auROC)

0 0