Spark 2.0 DataFrame map操作中Unable to find encoder for type stored in a Dataset问题的分析与解决

来源：互联网发布：php 命名空间大小写编辑：程序博客网时间：2024/06/05 15:27

我们在进行dataframe map操作的时候，经常会报出"Unable to find encoder for type stored in a Dataset"的问题，其错误描述如下：

******error: Unable to find encoder for type storedin a Dataset. Primitive types (Int, String, etc) and Product types (caseclasses) are supported by importing spark.implicits._ Support for serializingother types will be added in future releases. resDf_upd.map(row => {******

我们在查看spark官方文档之后，发现其对spark有了一条这样的描述：
Dataset isSpark SQL’s strongly-typed API for workingwith structured data, i.e. recordswith a known schema.
Datasetsare lazy and structured queryexpressions are only triggered when anaction is invoked. Internally, aDataset represents a logicalplan thatdescribes the computation query required to producethe data (for a givenSpark SQLsession).
A Dataset isa result of executing aquery expression against data storage like files, Hivetables or JDBCdatabases. The structured query expression can be described by aSQL query, aColumn-based SQL expression or a Scala/Java lambda function. Andthat is whyDataset operations are available in three variants.

从这可以看出，要想对dataset进行操作，需要进行相应的encode操作。下面是官网给的例子：
// No pre-defined encodersfor Dataset[Map[K,V]], define explicitly
implicit val mapEncoder =org.apache.spark.sql.Encoders.kryo[Map[String,Any]]
// Primitive types and caseclasses can be also defined as
// implicit valstringIntMapEncoder: Encoder[Map[String, Any]] =ExpressionEncoder()
// row.getValuesMap[T]retrieves multiple columns at once into aMap[String, T]
teenagersDF.map(teenager=> teenager.getValuesMap[Any](List("name","age"))).collect()
// Array(Map("name"-> "Justin", "age"-> 19))

所以说，要进行map操作，就要先定义一个Encoder。。但是这就大大增加了系统的工作量。幸运的dataset为了更简单一些提供了一个转化RDD的操作。我们只需要将之前dataframe.map在中间修改为：dataframe.rdd.map即可。是不是很神奇呢？

阅读全文

0 0