交叉验证原理及Spark MLlib使用实例(Scala/Java/Python)
来源:互联网 发布:linux hadoop搭建 编辑:程序博客网 时间:2024/05/31 06:22
交叉验证
方法思想:
CrossValidator将数据集划分为若干子集分别地进行训练和测试。如当k=3时,CrossValidator产生3个训练数据与测试数据对,每个数据对使用2/3的数据来训练,1/3的数据来测试。对于一组特定的参数表,CrossValidator计算基于三组不同训练数据与测试数据对训练得到的模型的评估准则的平均值。确定最佳参数表后,CrossValidator最后使用最佳参数表基于全部数据来重新拟合估计器。
示例:
注意对参数网格进行交叉验证的成本是很高的。如下面例子中,参数网格hashingTF.numFeatures有3个值,lr.regParam有2个值,CrossValidator使用2折交叉验证。这样就会产生(3*2)*2=12中不同的模型需要进行训练。在实际的设置中,通常有更多的参数需要设置,且我们可能会使用更多的交叉验证折数(3折或者10折都是经使用的)。所以CrossValidator的成本是很高的,尽管如此,比起启发式的手工验证,交叉验证仍然是目前存在的参数选择方法中非常有用的一种。
Scala:
import org.apache.spark.ml.Pipelineimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.evaluation.BinaryClassificationEvaluatorimport org.apache.spark.ml.feature.{HashingTF, Tokenizer}import org.apache.spark.ml.linalg.Vectorimport org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}import org.apache.spark.sql.Row// Prepare training data from a list of (id, text, label) tuples.val training = spark.createDataFrame(Seq( (0L, "a b c d e spark", 1.0), (1L, "b d", 0.0), (2L, "spark f g h", 1.0), (3L, "hadoop mapreduce", 0.0), (4L, "b spark who", 1.0), (5L, "g d a y", 0.0), (6L, "spark fly", 1.0), (7L, "was mapreduce", 0.0), (8L, "e spark program", 1.0), (9L, "a e c l", 0.0), (10L, "spark compile", 1.0), (11L, "hadoop software", 0.0))).toDF("id", "text", "label")// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression() .setMaxIter(10)val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))// We use a ParamGridBuilder to construct a grid of parameters to search over.// With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,// this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.val paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures, Array(10, 100, 1000)) .addGrid(lr.regParam, Array(0.1, 0.01)) .build()// We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.// This will allow us to jointly choose parameters for all Pipeline stages.// A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric// is areaUnderROC.val cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator) .setEstimatorParamMaps(paramGrid) .setNumFolds(2) // Use 3+ in practice// Run cross-validation, and choose the best set of parameters.val cvModel = cv.fit(training)// Prepare test documents, which are unlabeled (id, text) tuples.val test = spark.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (7L, "apache hadoop"))).toDF("id", "text")// Make predictions on test documents. cvModel uses the best model found (lrModel).cvModel.transform(test) .select("id", "text", "probability", "prediction") .collect() .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }Java:
import java.util.Arrays;import org.apache.spark.ml.Pipeline;import org.apache.spark.ml.PipelineStage;import org.apache.spark.ml.classification.LogisticRegression;import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator;import org.apache.spark.ml.feature.HashingTF;import org.apache.spark.ml.feature.Tokenizer;import org.apache.spark.ml.param.ParamMap;import org.apache.spark.ml.tuning.CrossValidator;import org.apache.spark.ml.tuning.CrossValidatorModel;import org.apache.spark.ml.tuning.ParamGridBuilder;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;// Prepare training documents, which are labeled.Dataset<Row> training = spark.createDataFrame(Arrays.asList( new JavaLabeledDocument(0L, "a b c d e spark", 1.0), new JavaLabeledDocument(1L, "b d", 0.0), new JavaLabeledDocument(2L,"spark f g h", 1.0), new JavaLabeledDocument(3L, "hadoop mapreduce", 0.0), new JavaLabeledDocument(4L, "b spark who", 1.0), new JavaLabeledDocument(5L, "g d a y", 0.0), new JavaLabeledDocument(6L, "spark fly", 1.0), new JavaLabeledDocument(7L, "was mapreduce", 0.0), new JavaLabeledDocument(8L, "e spark program", 1.0), new JavaLabeledDocument(9L, "a e c l", 0.0), new JavaLabeledDocument(10L, "spark compile", 1.0), new JavaLabeledDocument(11L, "hadoop software", 0.0)), JavaLabeledDocument.class);// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.Tokenizer tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words");HashingTF hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol()) .setOutputCol("features");LogisticRegression lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01);Pipeline pipeline = new Pipeline() .setStages(new PipelineStage[] {tokenizer, hashingTF, lr});// We use a ParamGridBuilder to construct a grid of parameters to search over.// With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,// this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.ParamMap[] paramGrid = new ParamGridBuilder() .addGrid(hashingTF.numFeatures(), new int[] {10, 100, 1000}) .addGrid(lr.regParam(), new double[] {0.1, 0.01}) .build();// We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.// This will allow us to jointly choose parameters for all Pipeline stages.// A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.// Note that the evaluator here is a BinaryClassificationEvaluator and its default metric// is areaUnderROC.CrossValidator cv = new CrossValidator() .setEstimator(pipeline) .setEvaluator(new BinaryClassificationEvaluator()) .setEstimatorParamMaps(paramGrid).setNumFolds(2); // Use 3+ in practice// Run cross-validation, and choose the best set of parameters.CrossValidatorModel cvModel = cv.fit(training);// Prepare test documents, which are unlabeled.Dataset<Row> test = spark.createDataFrame(Arrays.asList( new JavaDocument(4L, "spark i j k"), new JavaDocument(5L, "l m n"), new JavaDocument(6L, "mapreduce spark"), new JavaDocument(7L, "apache hadoop")), JavaDocument.class);// Make predictions on test documents. cvModel uses the best model found (lrModel).Dataset<Row> predictions = cvModel.transform(test);for (Row r : predictions.select("id", "text", "probability", "prediction").collectAsList()) { System.out.println("(" + r.get(0) + ", " + r.get(1) + ") --> prob=" + r.get(2) + ", prediction=" + r.get(3));}Python:
from pyspark.ml import Pipelinefrom pyspark.ml.classification import LogisticRegressionfrom pyspark.ml.evaluation import BinaryClassificationEvaluatorfrom pyspark.ml.feature import HashingTF, Tokenizerfrom pyspark.ml.tuning import CrossValidator, ParamGridBuilder# Prepare training documents, which are labeled.training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0), (4, "b spark who", 1.0), (5, "g d a y", 0.0), (6, "spark fly", 1.0), (7, "was mapreduce", 0.0), (8, "e spark program", 1.0), (9, "a e c l", 0.0), (10, "spark compile", 1.0), (11, "hadoop software", 0.0)], ["id", "text", "label"])# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.tokenizer = Tokenizer(inputCol="text", outputCol="words")hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")lr = LogisticRegression(maxIter=10)pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.# This will allow us to jointly choose parameters for all Pipeline stages.# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.# We use a ParamGridBuilder to construct a grid of parameters to search over.# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build()crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice# Run cross-validation, and choose the best set of parameters.cvModel = crossval.fit(training)# Prepare test documents, which are unlabeled.test = spark.createDataFrame([ (4, "spark i j k"), (5, "l m n"), (6, "mapreduce spark"), (7, "apache hadoop")], ["id", "text"])# Make predictions on test documents. cvModel uses the best model found (lrModel).prediction = cvModel.transform(test)selected = prediction.select("id", "text", "probability", "prediction")for row in selected.collect(): print(row)
0 0
- 交叉验证原理及Spark MLlib使用实例(Scala/Java/Python)
- scala---文档主题生成模型(LDA)算法原理及Spark MLlib调用实例(Scala/Java/python)
- Spark MLlib TF-IDF算法原理及调用实例(Scala/Java/python)
- 逻辑回归算法原理及Spark MLlib调用实例(Scala/Java/python)
- 决策树算法原理及Spark MLlib调用实例(Scala/Java/python)
- 随机森林(Random Forest)算法原理及Spark MLlib调用实例(Scala/Java/python)
- 梯度迭代树(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)
- 多层感知机(MLP)算法原理及Spark MLlib调用实例(Scala/Java/Python)
- 朴素贝叶斯算法原理及Spark MLlib调用实例(Scala/Java/Python)
- 广义线性模型(GLMs)算法原理及Spark MLlib调用实例(Scala/Java/Python)
- 决策树回归算法原理及Spark MLlib调用实例(Scala/Java/python)
- 梯度迭代树回归(GBDT)算法原理及Spark MLlib调用实例(Scala/Java/python)
- 生存回归(加速失效时间模型)算法原理及Spark MLlib调用实例(Scala/Java/python)
- 保序回归算法原理及Spark MLlib调用实例(Scala/Java/python)
- K均值(K-means)算法原理及Spark MLlib调用实例(Scala/Java/python)
- 文档主题生成模型(LDA)算法原理及Spark MLlib调用实例(Scala/Java/python)
- 二分K均值算法原理及Spark MLlib调用实例(Scala/Java/Python)
- 协同过滤(ALS)算法原理及Spark MLlib调用实例(Scala/Java/Python)
- activity间数据传递实例_自定义短信发送器
- 破解压缩文件密码rarcrack
- Canvas笔记——动态添加视图树
- strlen和sizeof
- PHP设计模式系列 - 建造者模式
- 交叉验证原理及Spark MLlib使用实例(Scala/Java/Python)
- 最强大脑解密:记忆法和记忆力
- DMZ主机
- JMeter资料整理
- 数学方法
- CATALINA_BASE与CATALINA_HOME的区别
- STM32F407全编译问题
- 浅谈前端移动端页面开发(布局篇)
- 全景krpano相关问题解答