spark sql 中 java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Gener
来源:互联网 发布:java大学基础教程 pdf 编辑:程序博客网 时间:2024/06/06 08:28
最近在做推荐系统的项目,使用了spark 2.0,需要给每个userid 推荐出一个list ,使用了udf函数
udf ()在每行执行的时候 得到
org.apache.spark.SparkException: Failed to execute user defined function($anonfun$calcMaxSimilarity$2$1: (string, array<string>) => array<struct<_1:string,_2:float>>)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:826)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
at com.allyes.awise.eng.rec.personal.cf.ItemBasedRec$$anonfun$calcMaxSimilarity$2$1$$anonfun$apply$4$$anonfun$10.apply(ItemBasedRec.scala:153)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.allyes.awise.eng.rec.personal.cf.ItemBasedRec$$anonfun$calcMaxSimilarity$2$1$$anonfun$apply$4.apply(ItemBasedRec.scala:153)
at com.allyes.awise.eng.rec.personal.cf.ItemBasedRec$$anonfun$calcMaxSimilarity$2$1$$anonfun$apply$4.apply(ItemBasedRec.scala:113)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:221)
at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:428)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at com.allyes.awise.eng.rec.personal.cf.ItemBasedRec$$anonfun$calcMaxSimilarity$2$1.apply(ItemBasedRec.scala:113)
at com.allyes.awise.eng.rec.personal.cf.ItemBasedRec$$anonfun$calcMaxSimilarity$2$1.apply(ItemBasedRec.scala:99)
其中udf 是一个dataFrame,然后itemFactor也是一个dataFrame,因为在dataFrame内部不能嵌套使用df,所以只能 将itemDF.collect到driver端,然后再通过broadCast到executor端, 问题就出在collect到driver端这个过程
原来的写法如下:
val itemFeatures = joinedDf.filter(_ != null).rdd.map((r: Row) =>{ (r.getString(0),(r.getAs[Seq[Float]](1),r.getAs[Seq[(String,Float)]](2)))}).collect().toMap
从row对象获取各个列的时候,会根据类型分别取出是String, Seq的,但是之前一直按Seq[Float] 这种去取
r.getAs[Seq[(String,Float)]](2)
始终报如上错误,能看到这个Seq的值,但就是对这个Seq进行转换就不行,打印也不行,
后来对亏了同事妹子帮忙查了半天才查到问题: 在row 中的这些列取得时候,要根据类型取,简单的像String,Seq[String] 这种类型就可以直接取出来,但是像
r.getAs[Seq[(String,Float)]](2)这种直接取得花就会丢失schema信息,虽然值能取到,但是schema信息丢了,在dataFrame中操作的时候就会抛错,其实这也是自己认为的,不知道对不对,欢迎各位大神指教
正确写法:
//convert dataframe data to broadcast base data.//fields: id, featuresval itemFeatures = joinedDf.filter(_ != null).rdd.map((r: Row) =>{ val row: Seq[(String, Float)] = r.getAs[Seq[Row]](2).map(x =>{ (x.getString(0),x.getFloat(1)) }) (r.getString(0),(r.getAs[Seq[Float]](1),row))}).collect().toMap
具体原因,以后下周再看看,周五啦,赶紧下班喽.....
- spark sql 中 java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Gener
- 「报错」Spark: scala.MatchError (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
- org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.Tuple2
- java.lang.ClassCastException:org.apache.spark.rdd.MapPartitionsRDD
- spark-sql-catalyst
- Spark SQL -- Catalyst
- org.apache.spark.sql.api.java.JavaSQLContext
- Spark SQL Catalyst深入理解
- spark2.2错误 java.lang.UnsupportedOperationException: No Encoder found for org.apache.spark.sql.Row解决
- SparkR读取CSV格式文件错误java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.spark.u
- 基于Spark的Hive编程中,“Error:(8, 37) java: 程序包org.apache.spark.sql.api.java不存在”的解决办法
- Spark SQL Catalyst源码分析之SqlParser
- Spark SQL Catalyst源码分析之Analyzer
- Spark SQL Catalyst源码分析之Optimizer
- Spark SQL Catalyst源码分析之Analyzer
- Spark SQL Catalyst源码分析之UDF
- org.apache.jasper.JasperException: java.lang.ClassCastException
- org.apache.jasper.JasperException: java.lang.ClassCastException
- empty 与 !
- xxxx
- 线上应用故障排查之一:高CPU占用
- 不用CSS3的弹性盒模型实现双飞翼布局
- 如何理解卷积神经网络中的权值共享
- spark sql 中 java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.Gener
- 如何将同一个APP中的不同activity在Recent(最近任务)中显示?
- 两种常用方法使用JS动态添加事件
- 你真的了解WIFI吗?
- Jenkins 集成Egret发布代码
- AV Foundation学习之补充(二)
- 有趣的浏览器发展史
- 怎么提升个人的代码编写能力
- Redis配置文件解析