SparkSQL schema创建DataFrame

来源：互联网发布：网络运维与管理杂志编辑：程序博客网时间：2024/04/28 19:19

通过case class创建DataFrame

通过case class 把rdd转化为DF是我们常用的方法，然后DF.registerTempTable将DF转化为表格进行SQL操作。而在早期版本中（1.4.1），case class有一个限制，就是字段不能超过22个，元组同样有这个限制。所以当字段较多时，就不方便用case class来创建DataFrame。

SparkSQL schema创建DataFrame

Spark提供了另外一种创建DataFrame的方式，即创建schema。一下是官方文档中的示例：

//***地址：http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#inferring-the-schema-using-reflection// sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)// Create an RDDval people = sc.textFile("examples/src/main/resources/people.txt")// The schema is encoded in a stringval schemaString = "name age"// Import Row.import org.apache.spark.sql.Row;// Import Spark SQL data typesimport org.apache.spark.sql.types.{StructType,StructField,StringType};// Generate the schema based on the string of schemaval schema =  StructType(    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))// Convert records of the RDD (people) to Rows.val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))// Apply the schema to the RDD.val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)// Register the DataFrames as a table.peopleDataFrame.registerTempTable("people")// SQL statements can be run by using the sql methods provided by sqlContext.val results = sqlContext.sql("SELECT name FROM people")// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by field index or by field name.results.map(t => "Name: " + t(0)).collect().foreach(println)

当我将上面的例子搬运到我的代码中时得到一个错误：

failure: Lost task 17.3 in stage 0.0 (TID 52, zdh7en): java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.sql.types.UTF8String  at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$1$$anonfun$apply$48.apply(Cast.scala:354)  at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:111)

大体的意思就是类型错误，看代码发现我直接将一下代码拷贝过来并没有改动，而我的字段有很多类型，该段代码则表示所有字段都是String类型：

val schema =StructType(schemaString.split(",").map(fieldName => StructField(fieldName, StringType, true)))

所以应该这么写：

 val schema=StructType(Array(StructField("x1",IntegerType,true),      StructField("x2",IntegerType,true),      StructField("x3",StringType,true),      StructField("x4",DecimalType(18,8),true)))

需要注意的是Deciaml类型不能直接写一个DecimalType,括号中的参数一定要说明。还有一点需要说明，如果你的代码中已经创建了一个hc（HiveContext）对象，就不需要再创建sqlContext。

0 0