「报错」Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
来源:互联网 发布:怎么搜索微博域名 编辑:程序博客网 时间:2024/05/22 08:14
场景:
多分类
出错代码:
/** 词向量映射*/val hashingTF = new HashingTF().setInputCol("words").setOutputCol("rawFeatures").setNumFeatures(500). transform(DF_classAndDoc)/** 计算逆向文本频率 */val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")val rescaled = idf.fit(hashingTF).//对每个单词计算逆文本频率 transform(hashingTF)//转换词频向量为TF-IDF向量/** 转化DF为训练模型RDDArray[Double]*/val labelAndFeaturesRDD = rescaled.select($"label", $"features").rdd.map{ case Row(label: String, features: Vector) => LabeledPoint(label.toDouble, features) // features.toDense}labelAndFeaturesRDD
说明:
LabeledPoint() 是 mllib 中的方法,如上使用的是spark-2.1.0的 ML 包,IDF计算所得为:org.apache.spark.ml.linalg.Vector类型 。 所以会报类型不匹配错误。
spark2 与 spark1 不兼容, 测试spark-1.6.3 如上代码可行,无错,
解决:
import org.apache.spark.ml.Pipelineimport org.apache.spark.ml.classification.LogisticRegressionimport org.apache.spark.ml.evaluation.MulticlassClassificationEvaluatorimport org.apache.spark.ml.feature.{HashingTF, Tokenizer}import org.apache.spark.ml.linalg.{Vector => mlV}import org.apache.spark.ml.tuning.{CrossValidator, ParamGridBuilder}import org.apache.spark.sql.Row// Prepare training data from a list of (id, text, label) tuples.val training = spark.createDataFrame(Seq( (0L, "a b c d e spark", 1.0), (1L, "b d", 0.0), (2L, "spark f g h", 1.0), (3L, "hadoop mapreduce", 0.0), (4L, "b spark who", 1.0), (5L, "g d a y", 0.0), (6L, "spark fly", 1.0), (7L, "was mapreduce", 0.0), (8L, "e spark program", 1.0), (9L, "a e c l", 0.0), (10L, "spark compile", 1.0), (11L, "hadoop software", 0.0))).toDF("id", "text", "label")// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")val hashingTF = new HashingTF().setInputCol(tokenizer.getOutputCol).setOutputCol("features")val lr = new LogisticRegression().setFamily("multinomial")//.LogisticRegressionWithLBFGS().setNumClasses(5)//.setMaxIter(10)val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))val paramGrid = new ParamGridBuilder().addGrid(hashingTF.numFeatures, Array(10, 100, 1000)).addGrid(lr.regParam, Array(0.1, 0.01)).build()val cv = new CrossValidator().setEstimator(pipeline).setEvaluator(new MulticlassClassificationEvaluator).setEstimatorParamMaps(paramGrid).setNumFolds(2) // Use 3+ in practiceval cvModel = cv.fit(training)val test = spark.createDataFrame(Seq( (4L, "spark i j k"), (5L, "l m n"), (6L, "mapreduce spark"), (3L, "hadoop mapreduce"), (7L, "apache hadoop"))).toDF("id", "text").select("text")cvModel.transform(test).select("id", "text", "probability", "prediction"). collect().foreach { case Row(id: Long, text: String, prob: mlV, prediction: Double) => println(s"($id, $text) --> prob=$prob, prediction=$prediction") }
2 0
- 「报错」Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
- org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
- spark sql 中 java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Gener
- 当对象包含嵌套对象时,使用Spark SQL执行sql查询抛出scala.MatchError异常
- spark-sql-catalyst
- Spark SQL -- Catalyst
- Scala spark 报错
- Spark SQL Catalyst深入理解
- scala学习-DescriptionResourcePathLocationType value toDF is not a member of org.apache.spark.rdd.R
- org.apache.spark.sql.api.java.JavaSQLContext
- Spark SQL Catalyst源码分析之SqlParser
- Spark SQL Catalyst源码分析之Analyzer
- Spark SQL Catalyst源码分析之Optimizer
- Spark SQL Catalyst源码分析之Analyzer
- Spark SQL Catalyst源码分析之UDF
- org.apache.spark
- Hive On Spark报错:Failed to execute spark task, org.apache.hadoop.hive.ql.metadata.HiveException
- Spark加载放在Tomcat容器中的mlib模型报错:org.apache.hadoop.fs.ChecksumException
- ELK logstash 处理多行事件(25th)
- windows10 CCID驱动黄色感叹号的问题:Microsoft Usbccid Smartcard Reader (WUDF)
- mysql数据库的磁盘空间占用
- MatConvNet的简单介绍和手写识别运用
- OleDB Destination 用法
- 「报错」Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
- js前端调试的几个小技巧
- 支付系统路由系统设计
- oracle误删数据的恢复
- Sqlserver like参数化
- Script component 用法
- Linu编译器vim的基本操作
- .canvas绘制出来的东西有锯齿,怎么解决?
- SpringMVC controller层模板