SparkSQL(一)

来源:互联网 发布:手机淘宝充值怎么退款 编辑:程序博客网 时间:2024/04/30 15:43

Spark支持两种方式将RDDs转为为SchemaRDDs:一种方法是使用反射来推断包含特定对象类型的RDD的模式(schema);一种方法是通过编程接口来实现,这个接口允许你构造一个模式,然后在存在的RDDs上使用它。

1.反射推断模式

Spark SQL的Scala接口支持将包含case class的RDDs自动转换为SchemaRDDs。而case class定义了表的模式。case class的参数名字通过反射来读取,然后作为列的名字。这个RDD可以隐式转化为一个SchemaRDD,然后注册为一个表。
import org.apache.spark.{SparkContext, SparkConf}import org.apache.spark.sql.SQLContext/** * Created by Administrator on 2015/10/17. */object UseCaseClass {  case class Person(firstName: String, lastName: String, age: Int)  def main(args: Array[String]): Unit ={    if(args.length != 1){      System.err.println("Usage: <data path>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    //Loaded with sqlContext(which is the instance of HiveContext not SQLContext)    val sqlContext = new SQLContext(sc)    import sqlContext.implicits._    val data = sc.textFile(args(0))    val personRDD = data.map(_.split(",")).map(person => Person(person(0), person(1), person(2).toInt))    //Convert the personRDD into the personDF DataFrame    val personDF = personRDD.toDF()    //Register the personDF as a table    personDF.registerTempTable("person")    //Run a SQL query against it    val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")    people.collect().foreach(println)    sc.stop()  }}

2.编程指定模式

当case class不能提前确定(如,记录的结构是经过编码的字符串),这时case class就不能正常工作了。此时,我们可以用下面三个步骤创建一个SchemaRDD:
  1. 从原来的RDD创建Row类型的RDD;
  2. 用StructType和StructField创建一个由StructType表示的模式与上一步创建的RDD的Row对象结构相同;
  3. 将Row类型的RDD转换为DataFrame。
在此,先看下StrutType和StructField的定义:
StructType(fields: Array[StructField])StructField(name: String, dataType: DataType, nullable: Boolean = true, metadata: Metadata = Metadata.empty)
  • name:field的名称;
  • dataType:field的数据类型,一共有以下几种数据类型:
IntegerTypeFloatTypeBooleanTypeShortTypeLongTypeByteTypeDoubleTypeStringType
  • nullable:field是否可以为空,默认可以为空;
  • metadata:field的元数据,Metadata是Map[String, Any]类型,可以存储任何类型的元数据。
import org.apache.spark.{SparkContext, SparkConf}import org.apache.spark.sql.SQLContextimport org.apache.spark.sql.Rowimport org.apache.spark.sql.types.{IntegerType, StringType, StructType, StructField}/** * Created by Administrator on 2015/10/17. */object SpecifySchema {  def main(args: Array[String]): Unit ={    if(args.length != 1){      System.err.println("Usage: <data path>")      System.exit(1)    }    val conf = new SparkConf()    val sc = new SparkContext(conf)    //Loaded with sqlContext(which is the instance of HiveContext not SQLContext)    val sqlContext = new SQLContext(sc)    val data = sc.textFile(args(0))    //Convert the RDD of array[string] to the RDD of the Row objects    val personRow = data.map(_.split(",")).map(person => Row(person(0), person(1), person(2).toInt))    /**    * Create schema using the StructType and StructField objects.    * The StructField object takes parameters in the form of param name, param type, and nullability    */    val schema = StructType(      Array(        StructField("firstName", StringType, true),        StructField("lastName", StringType, true),        StructField("age", IntegerType, true)      )    )    //Apply schema to create the personDF DataFrame    val personDF = sqlContext.createDataFrame(personRow, schema)    //Register the personDF as a table    personDF.registerTempTable("person")    //Run a SQL query against it    val people = sqlContext.sql("SELECT * FROM person WHERE age < 30")    people.collect().foreach(println)    sc.stop()  }}


0 0
原创粉丝点击