Spark MLlib算法
来源:互联网 发布:应聘数据分析师面试题 编辑:程序博客网 时间:2024/05/16 07:58
Spark MLlib算法
官方文档
Mathematical formulation数学公式
Loss functions损失函数
- hinge loss
- logistic loss
- squared loss
Regularizers正则化
- L1
- L2
- elastic net
- zero (unregularized)
Optimization优化
spark使用 SGD 和 L-BFGS 这两种梯度下降方法
libSVM的数据格式
Label 1:value 2:value ….
Label:是类别的标识,比如上节train.model中提到的1 -1,你可以自己随意定,比如-10,0,15。当然,如果是回归,这是目标值,就要实事求是了。
Value:就是要训练的数据,从分类的角度来说就是特征值,数据之间用空格隔开
比如: -15 1:0.708 2:1056 3:-0.3333
需要注意的是,如果特征值为0,特征冒号前面的(姑且称做序号)可以不连续。如:
-15 1:0.708 3:-0.3333
表明第2个特征值为0,从编程的角度来说,这样做可以减少内存的使用,并提高做矩阵内积时的运算速度。我们平时在matlab中产生的数据都是没有序号的常规矩阵,所以为了方便最好编一个程序进行转化。
分类
线性分类有线性支持向量机svm和LR,其中线性支持向量机svm只支持二分类,LR支持二分类和多分类
Linear Support Vector Machines (SVMs)线性支持向量机
import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD}import org.apache.spark.mllib.evaluation.BinaryClassificationMetricsimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM format.val data = MLUtils.loadLibSVMFile(sc, "/Users/yuyin/Downloads/software/spark/spark-1.6.2-bin-hadoop1-scala2.11/data/mllib/sample_libsvm_data.txt")// Split data into training (60%) and test (40%).val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)val training = splits(0).cache()val test = splits(1)// Run training algorithm to build the modelval numIterations = 100val model = SVMWithSGD.train(training, numIterations)// Clear the default threshold.model.clearThreshold()// Compute raw scores on the test set.计算分数值val scoreAndLabels = test.map { point => val score = model.predict(point.features) (score, point.label)}// Get evaluation metrics.获取评测指标val metrics = new BinaryClassificationMetrics(scoreAndLabels)val auROC = metrics.areaUnderROC()println("Area under ROC = " + auROC)// Save and load modelmodel.save(sc, "target/tmp/scalaSVMWithSGDModel")val sameModel = SVMModel.load(sc, "target/tmp/scalaSVMWithSGDModel")
添加正则项
import org.apache.spark.mllib.optimization.L1Updaterval svmAlg = new SVMWithSGD()svmAlg.optimizer.setNumIterations(200).setRegParam(0.1).setUpdater(new L1Updater)val modelL1 = svmAlg.run(training)
Logistic regression
import org.apache.spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithLBFGS}import org.apache.spark.mllib.evaluation.MulticlassMetricsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM format.val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")// Split data into training (60%) and test (40%).val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L)val training = splits(0).cache()val test = splits(1)// Run training algorithm to build the modelval model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(training)// Compute raw scores on the test set.val predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label)}// Get evaluation metrics.val metrics = new MulticlassMetrics(predictionAndLabels)val accuracy = metrics.accuracyprintln(s"Accuracy = $accuracy")// Save and load modelmodel.save(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")val sameModel = LogisticRegressionModel.load(sc, "target/tmp/scalaLogisticRegressionWithLBFGSModel")
Regression回归
Linear least squares, Lasso, and ridge regression
import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.regression.LinearRegressionModelimport org.apache.spark.mllib.regression.LinearRegressionWithSGD// Load and parse the dataval data = sc.textFile("data/mllib/ridge-data/lpsa.data")val parsedData = data.map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))}.cache()// Building the modelval numIterations = 100val stepSize = 0.00000001val model = LinearRegressionWithSGD.train(parsedData, numIterations, stepSize)// Evaluate model on training examples and compute training errorval valuesAndPreds = parsedData.map { point => val prediction = model.predict(point.features) (point.label, prediction)}val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2) }.mean()println("training Mean Squared Error = " + MSE)// Save and load modelmodel.save(sc, "target/tmp/scalaLinearRegressionWithSGDModel")val sameModel = LinearRegressionModel.load(sc, "target/tmp/scalaLinearRegressionWithSGDModel")
RidgeRegressionWithSGD 和LassoWithSGD可以以类似的方式被使用LinearRegressionWithSGD。
spark.mllib实现的随机梯度下降(SGD)的一个简单的分布式版本
Algorithms are all implemented in Scala:
1. SVMWithSGD
2. LogisticRegressionWithLBFGS
3. LogisticRegressionWithSGD
4. LinearRegressionWithSGD
5. RidgeRegressionWithSGD
6. LassoWithSGD
Logistic回归选择
L-BFGS支持二进制和多项Logistic回归,SGD版本只支持二分类,L-BFGS版本不支持L1正则化,但SGD 1支持L1正则化,当不需要L1正则化时,强烈推荐L-BFGS版本,因为通过使用准牛顿法近似逆Hessian矩阵,与SGD相比,它收敛得更快和更准确
官方文档
Streaming linear regression
在线回归模型
import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.StreamingLinearRegressionWithSGDval trainingData = ssc.textFileStream(args(0)).map(LabeledPoint.parse).cache() val testData = ssc.textFileStream(args(1)).map(LabeledPoint.parse)val numFeatures = 3 val model = new StreamingLinearRegressionWithSGD() .setInitialWeights(Vectors.zeros(numFeatures))model.trainOn(trainingData) model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()ssc.start() ssc.awaitTermination()
评价标准
官方文档
import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGSimport org.apache.spark.mllib.evaluation.BinaryClassificationMetricsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.mllib.util.MLUtils// Load training data in LIBSVM formatval data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_binary_classification_data.txt")// Split data into training (60%) and test (40%)val Array(training, test) = data.randomSplit(Array(0.6, 0.4), seed = 11L)training.cache()// Run training algorithm to build the modelval model = new LogisticRegressionWithLBFGS() .setNumClasses(2) .run(training)// Clear the prediction threshold so the model will return probabilitiesmodel.clearThreshold// Compute raw scores on the test setval predictionAndLabels = test.map { case LabeledPoint(label, features) => val prediction = model.predict(features) (prediction, label)}// Instantiate metrics objectval metrics = new BinaryClassificationMetrics(predictionAndLabels)// Precision by thresholdval precision = metrics.precisionByThresholdprecision.foreach { case (t, p) => println(s"Threshold: $t, Precision: $p")}// Recall by thresholdval recall = metrics.recallByThresholdrecall.foreach { case (t, r) => println(s"Threshold: $t, Recall: $r")}// Precision-Recall Curveval PRC = metrics.pr// F-measureval f1Score = metrics.fMeasureByThresholdf1Score.foreach { case (t, f) => println(s"Threshold: $t, F-score: $f, Beta = 1")}val beta = 0.5val fScore = metrics.fMeasureByThreshold(beta)f1Score.foreach { case (t, f) => println(s"Threshold: $t, F-score: $f, Beta = 0.5")}// AUPRCval auPRC = metrics.areaUnderPRprintln("Area under precision-recall curve = " + auPRC)// Compute thresholds used in ROC and PR curvesval thresholds = precision.map(_._1)// ROC Curveval roc = metrics.roc// AUROCval auROC = metrics.areaUnderROCprintln("Area under ROC = " + auROC)
- Spark MLlib SVM算法
- Spark MLlib FPGrowth算法
- Spark MLlib 算法
- Spark MLlib SVM算法
- Spark MLlib FPGrowth算法
- Spark MLlib算法
- spark mllib 决策树算法
- spark/MLlib 协同过滤算法
- Spark MLlib FPGrowth算法,mllibfpgrowth
- Spark MLlib 伪逆算法
- Spark-mllib特征提取算法
- Spark-mllib特征转换算法
- Spark-mllib特征选择算法
- spark mllib ALS算法简介
- spark mllib k-means算法实现
- Spark MLlib Linear Regression线性回归算法
- Spark MLlib Logistic Regression逻辑回归算法
- Spark MLlib KMeans聚类算法
- 多态的使用----自行编写维护list以保存对象
- 关键字register的用法及注意事项
- 大学程序学习之旅
- sublime安装完毕之后的常用配置
- 模块化建立项目流程(Maven聚合模块)
- Spark MLlib算法
- 列表和表格---学习笔记02
- 类的赋值
- jsp中的搜索条件回显
- Java中方法多态以及多接口实现
- PHP 数据库mysql(一)
- vim查找或删除部分重复的行
- 优化MySchool SQL编程 第三章
- 拷贝构造函数、赋值运算符、构造函数的区别及关系?