Spark 2.0 DataFrame map操作中Unable to find encoder for type stored in a Dataset问题的分析与解决

来源:互联网 发布:php 命名空间 大小写 编辑:程序博客网 时间:2024/06/05 15:27

我们在进行dataframe map操作的时候,经常会报出"Unable to find encoder for type stored in a Dataset"的问题,其错误描述如下:

******error: Unable to find encoder for type storedin a Dataset. Primitive types (Int, String, etc) and Product types (caseclasses) are supported by importing spark.implicits._ Support for serializingother types will be added in future releases. resDf_upd.map(row => {******


我们在查看spark官方文档之后,发现其对spark有了一条这样的描述:
Dataset isSpark SQL’s strongly-typed API for workingwith structured data, i.e. recordswith a known schema.
Datasetsare lazy and structured queryexpressions are only triggered when anaction is invoked. Internally, aDataset represents a logicalplan thatdescribes the computation query required to producethe data (for a givenSpark SQLsession).
A Dataset isa result of executing aquery expression against data storage like files, Hivetables or JDBCdatabases. The structured query expression can be described by aSQL query, aColumn-based SQL expression or a Scala/Java lambda function. Andthat is whyDataset operations are available in three variants.


从这可以看出,要想对dataset进行操作,需要进行相应的encode操作。下面是官网给的例子:
// No pre-defined encodersfor Dataset[Map[K,V]], define explicitly
implicit val mapEncoder =org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
// Primitive types and caseclasses can be also defined as
// implicit valstringIntMapEncoder: Encoder[Map[String, Any]] =ExpressionEncoder()
// row.getValuesMap[T]retrieves multiple columns at once into aMap[String, T]
teenagersDF.map(teenager=> teenager.getValuesMap[Any](List("name","age"))).collect()
// Array(Map("name"-> "Justin", "age"-> 19))


所以说,要进行map操作,就要先定义一个Encoder。。但是这就大大增加了系统的工作量。
幸运的dataset为了更简单一些提供了一个转化RDD的操作。我们只需要将之前dataframe.map在中间修改为:dataframe.rdd.map即可。是不是很神奇呢?
阅读全文
0 0
原创粉丝点击