Spark SQL

来源：互联网发布：网络美女排行榜2016 编辑：程序博客网时间：2024/06/12 20:41

DataFrame

DataFrame是一种以RDD为基础的分布式数据集，类似于传统数据库中的二维表格。带有schema信息的RDD，主要对结构化的数据高度抽象。
DataFrame和RDD的区别：DataFrame带有schema元信息，DataFrame所表示的二维表数据集的每一列都带有名称和类型，这使得SparkSQL得意洞察等多的结构信息，从而对藏于DataFrame背后的数据源以及作用于DataFrame之上的变换进行了针对性的优化，最终达到大幅提升运行时效率的目标。RDD无法得知数据元素的具体内部结构，Spark　Core 只能在stage层面进行简单通用的流水线优化。

Spark DataFrame

Spark DataFrame:
1.分布式数据集
2.类似关系型数据库汇总的table，或者excel里面的一张sheet。
3.拥有丰富的操作函数，类似于rdd中的算子
4.一个dataframe可以被注册成一张数据表，然后用sql语言在上边操作。
5.丰富的创建方式：
已有的rdd
结构化数据文件
json数据集
hive表
外部数据库

注：

（1）使用hive数据源的时候需要将hive-site.xml放到spark的conf目录下。scp conf/hive-site.xml root@node22:/usr/hadoop/spark/conf

（2）访问hdfs上的文件:8020端口

DataFrame和RDD的互操作

Spark Sql的scala接口支持rdd转化为dataframe，case类定义表的模式，case类的参数名称是使用放射读取的，并成为列的名称，case类也可以嵌套或者包含复杂类型，如序列或者数组，次rdd可以隐式的转化为dataframe，然后将其注册为表。表可以在随后的sql语句中使用。

UDF

object UDF {

val myUDF = (str : String) => {
str.length
}

def main(args: Array[String]) {
val conf = new SparkConf().setAppName("UDF").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val names = Array("libai", "dufu", "baijuyi", "wangchanlin", "hezhizang")
val namesRDD = sc.parallelize(names,4)
val namesRowRDD = namesRDD.map(name => Row(name))
val structType = StructType(Array(StructField("name",StringType,true)))
val namesDF = sqlContext.createDataFrame(namesRowRDD,structType)

namesDF.registerTempTable("names")
// sqlContext.udf.register("strLength", (str: String) => str.length)
sqlContext.udf.register("strLength", myUDF)

sqlContext.sql("select name , strLength(name) from names").collect().foreach(println)
}
}

生成DataFrame

1.反射：

case class Age(id: Int, age:Int)object SSQL02 {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("Age").setMaster("local")    val sc = new SparkContext(conf)    val sqlContext = new SQLContext(sc)    import sqlContext.implicits._    val lines = sc.textFile("D:\\***")    val df = lines.map(_.split("\\t")).map(line => Age(line(0).trim.toInt, line(1).trim.toInt)).toDF()    df.registerTempTable("Age")    val allAge = sqlContext.sql(" select * from Age").collect().foreach(println)    val allAge1 = sqlContext.sql("select * from Age")    allAge1.map(word => "id : " + word(0) + " age: " + word(1)).collect().foreach(println)    allAge1.map(word => "id : " + word.getAs("id") + " name : " + word.getAs("age")).collect().foreach(println)  }}

2.动态生成：（

Row(p(0), p(1).trim) 不能添加toInt

）

object SSQL03 {  def main(args: Array[String]) {    val conf = new SparkConf().setAppName("ssql03").setMaster("local")    val sc = new SparkContext(conf)    val sqlContext = new SQLContext(sc)    val people = sc.textFile("D:\\***")    val schemaString = "name age"    // Import Row.    import org.apache.spark.sql.Row;    // Import Spark SQL data types    // Generate the schema based on the string of schema    val schema =      StructType(        schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))    // Convert records of the RDD (people) to Rows.    val rowRDD = people.map(_.split("\\t")).map(p => Row(p(0), p(1).trim))    // Apply the schema to the RDD.    val peopleDataFrame = sqlContext.createDataFrame(rowRDD, schema)    // Register the DataFrames as a table.    peopleDataFrame.registerTempTable("people")    // SQL statements can be run by using the sql methods provided by sqlContext.    val results = sqlContext.sql("SELECT * FROM people")    results.map(t => "Name: " + t(0) + " Age: " + t(1)).collect().foreach(println)  }}

问题

1.SparkSQL: no typetag available for xxxx

case class 类要定义在Object类的上面

阅读全文

0 0