SparkSQL schema创建DataFrame

来源:互联网 发布:网络运维与管理杂志 编辑:程序博客网 时间:2024/04/28 19:19

通过case class创建DataFrame

通过case class 把rdd转化为DF是我们常用的方法,然后DF.registerTempTable将DF转化为表格进行SQL操作。而在早期版本中(1.4.1),case class有一个限制,就是字段不能超过22个,元组同样有这个限制。所以当字段较多时,就不方便用case class来创建DataFrame。

SparkSQL schema创建DataFrame

Spark提供了另外一种创建DataFrame的方式,即创建schema。一下是官方文档中的示例:

//***地址:http://spark.apache.org/docs/1.4.1/sql-programming-guide.html#inferring-the-schema-using-reflection// sc is an existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)// Create an RDDval people = sc.textFile("examples/src/main/resources/people.txt")// The schema is encoded in a stringval schemaString = "name age"// Import Row.import org.apache.spark.sql.Row;// Import Spark SQL data typesimport org.apache.spark.sql.types.{StructType,StructField,StringType};// Generate the schema based on the string of schemaval schema =  StructType(    schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))// Convert records of the RDD (people) to Rows.val rowRDD = people.map(_.split(",")).map(p => Row(p(0), p(1).trim))// Apply the schema to the RDD.val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)// Register the DataFrames as a table.peopleDataFrame.registerTempTable("people")// SQL statements can be run by using the sql methods provided by sqlContext.val results = sqlContext.sql("SELECT name FROM people")// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by field index or by field name.results.map(t => "Name: " + t(0)).collect().foreach(println)

当我将上面的例子搬运到我的代码中时得到一个错误:

failure: Lost task 17.3 in stage 0.0 (TID 52, zdh7en): java.lang.ClassCastException: java.lang.Integer cannot be cast to org.apache.spark.sql.types.UTF8String  at org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToDouble$1$$anonfun$apply$48.apply(Cast.scala:354)  at org.apache.spark.sql.catalyst.expressions.Cast.org$apache$spark$sql$catalyst$expressions$Cast$$buildCast(Cast.scala:111)

大体的意思就是类型错误,看代码发现我直接将一下代码拷贝过来并没有改动,而我的字段有很多类型,该段代码则表示所有字段都是String类型:

val schema =StructType(schemaString.split(",").map(fieldName => StructField(fieldName, StringType, true)))

所以应该这么写:

 val schema=StructType(Array(StructField("x1",IntegerType,true),      StructField("x2",IntegerType,true),      StructField("x3",StringType,true),      StructField("x4",DecimalType(18,8),true)))

需要注意的是Deciaml类型不能直接写一个DecimalType,括号中的参数一定要说明。还有一点需要说明,如果你的代码中已经创建了一个hc(HiveContext)对象,就不需要再创建sqlContext。

0 0
原创粉丝点击