Failed to merge incompatible data types StringType and BinaryType

来源：互联网发布：居住证知乎编辑：程序博客网时间：2024/05/21 01:28

使用spark1.4.0加载parquet报错：

org.apache.spark.SparkException: Failed to merge incompatible schemas StructType(StructField(ip,StringType,true), StructField(log_time,StringType,true), StructField(pos_type,StringType,true), StructField(pos_value,StringType,true), StructField(user_id,StringType,true), StructField(device_id,StringType,true), StructField(cookie_id,StringType,true), StructField(from_source,IntegerType,true), StructField(platform,StringType,true), StructField(version,StringType,true), StructField(channel,StringType,true), StructField(c_detail,StringType,true), StructField(user_role,StringType,true), StructField(user_type,StringType,true), StructField(school,StringType,true), StructField(child,StringType,true), StructField(list_version,StringType,true), StructField(tags,StringType,true), StructField(url,StringType,true), StructField(refer,StringType,true), StructField(deal_id,StringType,true), StructField(deal_n,IntegerType,true), StructField(deal_x,IntegerType,true), StructField(deal_y,IntegerType,true), StructField(deal_source_type,StringType,true), StructField(deal_exposure_time,StringType,true), StructField(exposure_num,StringType,true), StructField(img_version,StringType,true), StructField(screen_version,StringType,true), StructField(page,IntegerType,true), StructField(deal_show_type,StringType,true), StructField(log_time_stamp,LongType,true), StructField(deal_exposure_time_stamp,LongType,true)) and StructType(StructField(ip,BinaryType,true), StructField(log_time,BinaryType,true), StructField(pos_type,BinaryType,true), StructField(pos_value,BinaryType,true), StructField(user_id,BinaryType,true), StructField(device_id,BinaryType,true), StructField(cookie_id,BinaryType,true), StructField(from_source,IntegerType,true), StructField(platform,BinaryType,true), StructField(version,BinaryType,true), StructField(channel,BinaryType,true), StructField(c_detail,BinaryType,true), StructField(user_role,BinaryType,true), StructField(user_type,BinaryType,true), StructField(school,BinaryType,true), StructField(child,BinaryType,true), StructField(list_version,BinaryType,true), StructField(tags,BinaryType,true), StructField(url,BinaryType,true), StructField(refer,BinaryType,true), StructField(deal_id,BinaryType,true), StructField(deal_n,IntegerType,true), StructField(deal_x,IntegerType,true), StructField(deal_y,IntegerType,true), StructField(deal_source_type,BinaryType,true), StructField(deal_exposure_time,BinaryType,true), StructField(exposure_num,BinaryType,true), StructField(img_version,BinaryType,true), StructField(screen_version,BinaryType,true), StructField(page,IntegerType,true), StructField(deal_show_type,BinaryType,true), StructField(log_time_stamp,LongType,true), StructField(deal_exposure_time_stamp,LongType,true))
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$2.apply(newParquet.scala:531)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$2.apply(newParquet.scala:529)
at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51)
at scala.collection.IndexedSeqOptimized$class.reduceLeft(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayBuffer.reduceLeft(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.reduceLeftOption(TraversableOnce.scala:190)
at scala.collection.AbstractTraversable.reduceLeftOption(Traversable.scala:105)
at scala.collection.TraversableOnce$class.reduceOption(TraversableOnce.scala:197)
at scala.collection.AbstractTraversable.reduceOption(Traversable.scala:105)
at org.apache.spark.sql.parquet.ParquetRelation2$.readSchema(newParquet.scala:529)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.org$apache$spark$sql$parquet$ParquetRelation2$MetadataCache$$readSchema(newParquet.scala:434)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$11.apply(newParquet.scala:369)
at scala.Option.orElse(Option.scala:257)
at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache.refresh(newParquet.scala:369)
at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache$lzycompute(newParquet.scala:126)
at org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$metadataCache(newParquet.scala:124)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$dataSchema$1.apply(newParquet.scala:165)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.sql.parquet.ParquetRelation2.dataSchema(newParquet.scala:165)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:506)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:505)
at org.apache.spark.sql.sources.LogicalRelation.<init>(LogicalRelation.scala:30)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:438)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:264)
at com.zhe800.toona.lr.computation.QianBai$.main(QianBai.scala:817)
at com.zhe800.toona.lr.computation.QianBai.main(QianBai.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:665)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:170)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:193)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:112)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: org.apache.spark.SparkException: Failed to merge incompatible data types StringType and BinaryType
at org.apache.spark.sql.types.StructType$.merge(StructType.scala:265)
at org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$4.apply(StructType.scala:239)
at org.apache.spark.sql.types.StructType$$anonfun$merge$1$$anonfun$apply$4.apply(StructType.scala:237)
at scala.Option.map(Option.scala:145)
at org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:237)
at org.apache.spark.sql.types.StructType$$anonfun$merge$1.apply(StructType.scala:233)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.sql.types.StructType$.merge(StructType.scala:233)
at org.apache.spark.sql.types.StructType.merge(StructType.scala:191)
at org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$readSchema$2.apply(newParquet.scala:530)
... 36 more

这是直接拷贝的别人的代码，一样的环境一样的代码报错真是纠结。通过看spark官方文档：http://spark.apache.org/docs/latest/sql-programming-guide.html

其中有说明：

spark.sql.parquet.binaryAsStringfalseSome other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do not differentiate between binary data and strings when writing out the Parquet schema. This flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.因此在代码中添加：

sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")

即可解决。

0 0