Spark MLlib之机器学习(二)
来源:互联网 发布:澳门mac口红专柜价格 编辑:程序博客网 时间:2024/04/20 12:28
通过上一篇的简介,我们对Spark MLlib的基础有了一些了解。那么,从这一篇开始,我们进入实战阶段。因为是介绍Spark MLlib的应用,所以我这里不会详细介绍算法的推导,后续我会抽时间整理成专题进行介绍。而这一篇主要介绍Spark MLlib中的监督学习算法:Logistics Regression、Naive Bayes、SVM(Support Vector Machine)、Decision Tree,和Linear Regression。
值得一提的是,虽然Spark MLlib中已经提供了常用算法的接口,但是在看了它的源代码后,如果发现其性能和稳定性不如自己的实现过程好或者其他原因,也可以自己实现这些算法。
1.Supervised Learning(监督学习)
首先,我们先了解一下监督学习的定义,以下是Wikipedia给出的定义:
"Supervised learning is the machine learning task of inferring a function from labeled training data."
可以简单的理解为:监督学习是为了从数据中找规律(即函数)。从一组数中找规律是我们初中就接触的东西,如:1,2,4,8.......它的规律就是2^x次方,x是0到无穷大的整数。
而本篇要介绍的监督学习算法中,Logistics Regression、Naive Bayes、SVM(Support Vector Machine)、Decision Tree又属于分类算法。分类算法的定义说的正式点儿是:根据文本的特征或属性,划分到已有的类别中。概况说就四个字——分门别类。
2.Linear Regression
Linear Regression是来确定两种或两种以上变量间相互依赖的定量关系的一种统计分析方法。在Spark MLlib中提供了两种实现Linear Regression的接口:LinearRegressionWithSGD和LassoWithSGD。其实,LassoWithSGD可以看做是LinearRegression的加强版,是处理如果特征比样本点还多,也就是说输入数据的矩阵X不是满秩矩阵的时候,相当于缩减系数来“理解”数据。这里以LinearRegressionWithSGD的使用为例,代码如下:
import org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LinearRegressionWithSGDimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.{SparkContext, SparkConf}object LinearRegression { def main(args: Array[String]): Unit ={ val length = args.length if(length != 2 && length != 3){ System.err.println("Usage: <input file> <iteration number> <step size(optional)>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Iteration number val iteration = args(1).toInt //Step size, default vaule is 0.01 val stepSize = if(length == 3) args(2).toInt else 0.01 //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(_.toDouble))) } //Train model val model = LinearRegressionWithSGD.train(parseData, iteration, stepSize) //Check its coefficients val weight = model.weights println(weight) }}
3.Logistics Regression
Logistics Regression主要用于二分类,如是否是垃圾邮件,它的y值是:0和1。算法详细介绍可以看我的另一篇博文Logistic Regression笔记。在Spark MLlib中也提供两种实现Logistics Regression的接口:LogisticRegressionWithSGD和LogisticRegressionWithLBFGS。而LogisticRegressionWithLBFGS是优选的,因为它消除了优化步长。虽然二者接口使用大同小异,但是为了更直观的看到步长,这里就以LogisticsRegressionWithSGD接口为例,代码如下:
import org.apache.spark.SparkConfimport org.apache.spark.SparkContextimport org.apache.spark.mllib.classification.LogisticRegressionWithSGDimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointobject LogisticRegression { def main (args: Array[String]): Unit = { val length = args.length if(length != 2 && length != 3){ System.err.println("Usage: <input file> <iteration number> <step size(optional)>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Iteration number val iteration = args(1).toInt //Step size, default value is 0.01 val stepSize = if(length == 3) args(2).toDouble else 0.01 //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(x => x.toDouble))) } //Train a model val model = LogisticRegressionWithSGD.train(parseData, iteration, stepSize) //Check its coefficients val weight = model.weights println(weight) sc.stop() }}
4.Naive Bayes
Naive Bayes是基于贝叶斯定理与特征条件独立假设的分类方法。算法的详细介绍可以参考我的另一篇博文Naive Bayes笔记。在Spark MLlib中其实现接口就叫:NaiveBayes,代码如下:
import org.apache.spark.mllib.classification.NaiveBayesimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.{SparkContext, SparkConf}object NaiveBayesDemo { def main (args: Array[String]): Unit = { val length = args.length if(length != 2){ System.err.println("Usage: <input file> <lambda>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Lambda, default value is 1L val lambda = if(length == 2) args(1).toDouble else 1L //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(_.toDouble))) } //Split the data half and half into the training and test datasets val splits = parseData.randomSplit(Array(0.5, 0.5), seed = 11L) val training = splits(0) val test = splits(1) //Train a model with the training dataset val model = NaiveBayes.train(training, lambda) //Predict the label of the test dataset val prediction = test.map(p => (model.predict(p.features), p.label)) println(prediction) sc.stop() }}
5.SVM(Support Vector Machine)
SVM是找到一个超平面把数据分为1和-1两类,而最靠近分隔超平面的点叫做支持向量(Support Vector)。在Spark MLlib中,其实现接口是SVMWithSGD,代码如下:
import org.apache.spark.mllib.classification.SVMWithSGDimport org.apache.spark.mllib.linalg.Vectorsimport org.apache.spark.mllib.regression.LabeledPointimport org.apache.spark.{SparkContext, SparkConf}object SVM { def main(args: Array[String]): Unit ={ if(args.length != 2){ System.err.println("Usage: <input file> <iteration number>") System.exit(1) } val conf = new SparkConf() val sc = new SparkContext(conf) val data = sc.textFile(args(0)) //Iteration number val iteration = args(1).toInt //Parse the data into LabeledPoint val parseData = data.map{line => val elem = line.split(":") LabeledPoint(elem(0).toDouble, Vectors.dense(elem(1).split(" ").map(x => x.toDouble))) } //Train a model val model = SVMWithSGD.train(parseData, iteration) //Check its coefficients val weight = model.weights println(weight) sc.stop() }}
6.Decision Tree(待整理)
- Spark MLlib之机器学习(二)
- spark mllib机器学习之二 DecisionTree
- Spark MLlib之机器学习(一)
- Spark MLlib之机器学习(三)
- 二 Spark机器学习MLlib: LogisticRegression
- spark mllib机器学习之三 FPGrowth
- spark mllib机器学习之四 kmeans
- spark mllib机器学习之五 LinearRegressionWithSGD
- spark mllib机器学习之六 ALS
- spark mllib机器学习之七 TFIDF
- spark之MLlib机器学习-Kmeans
- spark之MLlib机器学习-线性回归
- Spark MLlib机器学习之朴素贝叶斯小试牛刀
- Spark机器学习库(MLlib)指南
- MLlib On Spark(机器学习算法)
- Spark机器学习库(MLlib)指南
- Spark学习之基于MLlib的机器学习
- Spark机器学习库mllib之协同过滤
- java学习_IO(1)
- Web前端工程师职业学习路线图
- HDU 2689 Sort it(逆序对-BIT)
- matlab+opencv
- [Qt] QString 和 char* 转换
- Spark MLlib之机器学习(二)
- 顺序表的实现C++封装
- android TabHost制作底部导航栏
- FileInputFormat类的输入路径
- 条款04重难点
- Reorder List
- 台大机器学习第三讲和第四讲
- Android数据存储时关于/data/data
- 在成长中学习编程,在编程中成长(1)