SparkSQL（一）

来源：互联网发布：手机淘宝充值怎么退款编辑：程序博客网时间：2024/04/30 15:43

Spark支持两种方式将RDDs转为为SchemaRDDs：一种方法是使用反射来推断包含特定对象类型的RDD的模式（schema）；一种方法是通过编程接口来实现，这个接口允许你构造一个模式，然后在存在的RDDs上使用它。

1.反射推断模式

Spark SQL的Scala接口支持将包含case class的RDDs自动转换为SchemaRDDs。而case class定义了表的模式。case class的参数名字通过反射来读取，然后作为列的名字。这个RDD可以隐式转化为一个SchemaRDD，然后注册为一个表。

import org.apache.spark.{SparkContext, SparkConf}import org.apache.spark.sql.SQLContext/** * Created by Administrator on 2015/10/17. */object UseCaseClass {  case class Person(firstName: String, lastName: String, age: Int)  def main(args: Array[String]): Unit ={    if(args.length != 1){      System.err.println("Usage: <data path>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    //Loaded with sqlContext(which is the instance of HiveContext not SQLContext)    val sqlContext = new SQLContext(sc)    import sqlContext.implicits._    val data = sc.textFile(args(0))    val personRDD = data.map(_.split(",")).map(person => Person(person(0), person(1), person(2).toInt))    //Convert the personRDD into the personDF DataFrame    val personDF = personRDD.toDF()    //Register the personDF as a table    personDF.registerTempTable("person")    //Run a SQL query against it    val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")    people.collect().foreach(println)    sc.stop()  }}

2.编程指定模式

当case class不能提前确定（如，记录的结构是经过编码的字符串），这时case class就不能正常工作了。此时，我们可以用下面三个步骤创建一个SchemaRDD：

从原来的RDD创建Row类型的RDD；
用StructType和StructField创建一个由StructType表示的模式与上一步创建的RDD的Row对象结构相同；
将Row类型的RDD转换为DataFrame。

在此，先看下StrutType和StructField的定义：

StructType(fields: Array[StructField])StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)

name：field的名称；
dataType：field的数据类型，一共有以下几种数据类型：

IntegerTypeFloatTypeBooleanTypeShortTypeLongTypeByteTypeDoubleTypeStringType

nullable：field是否可以为空，默认可以为空；
metadata：field的元数据，Metadata是Map[String, Any]类型，可以存储任何类型的元数据。

import org.apache.spark.{SparkContext, SparkConf}import org.apache.spark.sql.SQLContextimport org.apache.spark.sql.Rowimport org.apache.spark.sql.types.{IntegerType, StringType, StructType, StructField}/** * Created by Administrator on 2015/10/17. */object SpecifySchema {  def main(args: Array[String]): Unit ={    if(args.length != 1){      System.err.println("Usage: <data path>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    //Loaded with sqlContext(which is the instance of HiveContext not SQLContext)    val sqlContext = new SQLContext(sc)    val data = sc.textFile(args(0))    //Convert the RDD of array[string] to the RDD of the Row objects    val personRow = data.map(_.split(",")).map(person => Row(person(0), person(1), person(2).toInt))    /**    * Create schema using the StructType and StructField objects.    * The StructField object takes parameters in the form of param name, param type, and nullability    */    val schema = StructType(      Array(        StructField("firstName", StringType, true),        StructField("lastName", StringType, true),        StructField("age", IntegerType, true)      )    )    //Apply schema to create the personDF DataFrame    val personDF = sqlContext.createDataFrame(personRow, schema)    //Register the personDF as a table    personDF.registerTempTable("person")    //Run a SQL query against it    val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")    people.collect().foreach(println)    sc.stop()  }}

0 0