spark学习笔记-spark上做kaggle的机器学习分类任务
来源:互联网 发布:ubuntu如何安装wine 编辑:程序博客网 时间:2024/05/17 23:02
1. 下载数据,并写入hdfs中
miaofu@master:~$ hadoop fs -ls /user/miaofu/covtype-rw-r--r-- 2 miaofu supergroup 75169317 2016-09-17 23:20 /user/miaofu/covtype
2. 启动spark集群
miaofu@master:~/spark-1.6.2-bin-hadoop2.6$ jps6649 ResourceManager10821 Worker2434 NameNode2680 DataNode2938 SecondaryNameNode31714 SparkSubmit10705 Master32000 Jps6786 NodeManager
3. 进入spark shell
miaofu@master:~/spark-1.6.2-bin-hadoop2.6$ bin/spark-shell --master spark://master:707716/09/19 13:19:18 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableWelcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 1.6.2 /_/Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.7.0_95)Type in expressions to have them evaluated.Type :help for more information.Spark context available as sc.16/09/19 13:19:30 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)16/09/19 13:19:30 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)16/09/19 13:19:37 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.016/09/19 13:19:37 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException16/09/19 13:19:40 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)16/09/19 13:19:40 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)SQL context available as sqlContext.
4. 读入数据,并简单分析
scala> val rawData = sc.textFile("hdfs:////user/miaofu/covtype")rawData: org.apache.spark.rdd.RDD[String] = hdfs:////user/miaofu/covtype MapPartitionsRDD[1] at textFile at <console>:27scala> rawData.counts()<console>:30: error: value counts is not a member of org.apache.spark.rdd.RDD[String] rawData.counts() ^scala> rawData.count()res1: Long = 581012 scala> var line = rawData.take(4)(1)line: String = 2590,56,2,212,-6,390,220,235,151,6225,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5scala> lineres3: String = 2590,56,2,212,-6,390,220,235,151,6225,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,5scala> val values = line.split(",").map(_.toDouble)values: Array[Double] = Array(2590.0, 56.0, 2.0, 212.0, -6.0, 390.0, 220.0, 235.0, 151.0, 6225.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0)scala> valuesres4: Array[Double] = Array(2590.0, 56.0, 2.0, 212.0, -6.0, 390.0, 220.0, 235.0, 151.0, 6225.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 5.0)scala> values.initres5: Array[Double] = Array(2590.0, 56.0, 2.0, 212.0, -6.0, 390.0, 220.0, 235.0, 151.0, 6225.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)scala> values.lastres6: Double = 5.0
5. 构建训练,测试,验证集
import org.apache.spark.mllib.linalg._import org.apache.spark.mllib.regression._scala> val data1 = rawData.map{ line => | line.split(",").map(_.toDouble) | }data1: org.apache.spark.rdd.RDD[Array[Double]] = MapPartitionsRDD[4] at map at <console>:37scala> data1res16: org.apache.spark.rdd.RDD[Array[Double]] = MapPartitionsRDD[4] at map at <console>:37scala> val data2 = data1.map{ v=> | LabeledPoint(v.last-1,Vectors.dense(v.init)) | }data2: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[5] at map at <console>:39scala> val Array(trainData,cvData,testData) = | data2.randomSplit(Array(0.8,0.1,0.1))trainData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[9] at randomSplit at <console>:63cvData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[10] at randomSplit at <console>:63testData: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[11] at randomSplit at <console>:63scala> trainData.cache()res27: trainData.type = MapPartitionsRDD[9] at randomSplit at <console>:63scala> cvData.cache()res28: cvData.type = MapPartitionsRDD[10] at randomSplit at <console>:63scala> testData.cache()res29: testData.type = MapPartitionsRDD[11] at randomSplit at <console>:63
scala> import org.apache.spark.mllib.evaluation._import org.apache.spark.mllib.evaluation._scala> import org.apache.spark.mllib.tree._import org.apache.spark.mllib.tree._scala> import org.apache.spark.mllib.tree.model._import org.apache.spark.mllib.tree.model._import org.apache.spark.rdd._scala> def getMetrics(model:DecisionTreeModel,data:RDD[LabeledPoint]): | MulticlassMetrics = { | val predictionsAndLabels = data.map( e => | (model.predict(e.features),e.label) | ) | new MulticlassMetrics(predictionsAndLabels) | }getMetrics: (model: org.apache.spark.mllib.tree.model.DecisionTreeModel, data: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint])org.apache.spark.mllib.evaluation.MulticlassMetrics
7. 模型训练与测试
scala> val model = DecisionTree.trainClassifier( | trainData,7,Map[Int,Int](),"gini",4,100 )model: org.apache.spark.mllib.tree.model.DecisionTreeModel = DecisionTreeModel classifier of depth 4 with 31 nodesscala> modelres30: org.apache.spark.mllib.tree.model.DecisionTreeModel = DecisionTreeModel classifier of depth 4 with 31 nodesscala> val metrics = getMetrics(model,cvData)metrics: org.apache.spark.mllib.evaluation.MulticlassMetrics = org.apache.spark.mllib.evaluation.MulticlassMetrics@2c181731scala> metrics.confusionMatrix def confusionMatrix: org.apache.spark.mllib.linalg.Matrix scala> metrics.confusionMatrix def confusionMatrix: org.apache.spark.mllib.linalg.Matrix scala> metrics.confusionMatrixres31: org.apache.spark.mllib.linalg.Matrix = 14260.0 6593.0 7.0 0.0 0.0 0.0 340.0 5485.0 22277.0 483.0 20.0 3.0 0.0 38.0 0.0 443.0 3042.0 82.0 0.0 0.0 0.0 0.0 0.0 169.0 104.0 0.0 0.0 0.0 0.0 864.0 27.0 0.0 14.0 0.0 0.0 0.0 440.0 1168.0 100.0 0.0 0.0 0.0 1101.0 26.0 0.0 0.0 0.0 0.0 927.0 scala> metrics.precisionres32: Double = 0.7002568389843655scala> metrics.asInstanceOf confusionMatrix fMeasure falsePositiveRate isInstanceOf labels precision recall toString truePositiveRate weightedFMeasure weightedFalsePositiveRate weightedPrecision weightedRecall weightedTruePositiveRate scala> metrics.asInstanceOf confusionMatrix fMeasure falsePositiveRate isInstanceOf labels precision recall toString truePositiveRate weightedFMeasure weightedFalsePositiveRate weightedPrecision weightedRecall weightedTruePositiveRate scala> metrics.asInstanceOf confusionMatrix fMeasure falsePositiveRate isInstanceOf labels precision recall toString truePositiveRate weightedFMeasure weightedFalsePositiveRate weightedPrecision weightedRecall weightedTruePositiveRate scala> metrics.asInstanceOf confusionMatrix fMeasure falsePositiveRate isInstanceOf labels precision recall toString truePositiveRate weightedFMeasure weightedFalsePositiveRate weightedPrecision weightedRecall weightedTruePositiveRate scala> metrics.recallres33: Double = 0.7002568389843655
8. web UI
0 0
- spark学习笔记-spark上做kaggle的机器学习分类任务
- 笔记:Spark上的机器学习
- spark机器学习笔记:(四)用Spark Python构建分类模型(上)
- Spark分类模型--来源Spark机器学习
- Spark机器学习之分类与回归
- spark机器学习库之决策树分类
- spark机器学习笔记:(五)用Spark Python构建分类模型(下)
- spark机器学习笔记:(五)用Spark Python构建分类模型(下)
- 分类解读Spark下的39个机器学习库
- Spark 机器学习实践 :Iris数据集的分类
- spark上的scala学习笔记
- Spark的任务调度学习
- Spark的学习笔记
- Spark机器学习的主要内容
- 机器学习(三)--- spark学习笔记
- 《machine learning with spark》学习笔记--分类
- Spark Streaming流任务学习笔记
- Spark机器学习笔记1--Spark Python编程入门
- PopupWindow的使用
- 大数据系列修炼-Scala课程36
- 委托
- 基于TX2440开发板在ADS1.2中编写LED的驱动(GPIO的使用)裸机程序
- label(也可以说字符串)上不同颜色 和 不同大小 的设置
- spark学习笔记-spark上做kaggle的机器学习分类任务
- Jvm内存管理
- iOS学习之 plist文件的读写
- 查询网址主机托管信息
- Linux-Shell脚本编程-学习-5-Shell编程-使用结构化命令-if-then-else-elif
- 移动应用开发
- USACO2014Open Fair Photography
- sql语句union用法,根据字段顺序进行合并,而不是字段名称
- Android 7.0新特性---删除三项广播