Spark SQL重点
来源:互联网 发布:js event button 编辑:程序博客网 时间:2024/05/26 05:52
Spark SQL允许Spark执行用SQL,HiveQL或者Scala表示的关系查询。这个模块的核心是一个新类型的RDD-SchemaRDD。
def main(args: Array[String]){ //caseclass Customer(name:String,age:Int,gender:String,address:String) //屏蔽日志 Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF) val sparkConf = newSparkConf().setAppName("customers") val sc = newSparkContext(sparkConf) valsqlContext = new SQLContext(sc) val schema = StructType( StructField("name", StringType, false) :: StructField("age", IntegerType, true) :: Nil)
val r =sc.textFile(args(0)) val people =r.map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt)) val dataFrame =sqlContext.createDataFrame(people, schema) dataFrame.printSchema
dataFrame.registerTempTable("people") sqlContext.sql("select * frompeople where age <25").collect.foreach(println) }
def main(args: Array[String]){ //屏蔽日志 Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF) val sparkConf = newSparkConf().setAppName("customers") val sc = newSparkContext(sparkConf) val sqlContext = newSQLContext(sc) // The schema is encoded in astring val schemaString = "nameage"
// Generate the schema basedon the string of schema val schema= StructType( schemaString.split(" ").map(fieldName =>StructField(fieldName, StringType, true)))
val people =sc.textFile(args(0)) // Convert records of the RDD (people) toRows. val rowRDD = people.map(_.split(",")).map(p=> Row(p(0), p(1).trim)) val dataFrame =sqlContext.createDataFrame(rowRDD, schema) dataFrame.printSchema
dataFrame.registerTempTable("people") sqlContext.sql("select * from people where age<25").collect.foreach(println) }
def main(args: Array[String]) { //屏蔽日志 Logger.getLogger("org.apache.spark").setLevel(Level.WARN) Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF) val sparkConf = newSparkConf().setAppName("customers") val sc = newSparkContext(sparkConf) val sqlContext = new SQLContext(sc) // A JSON dataset is pointedto by path. // The path can be either asingle text file or a directory storing text files. val path = "xrli/people.json" // Create a SchemaRDD from the file(s) pointedto by path val people =sqlContext.jsonFile(path)
// The inferred schema can be visualized usingthe printSchema() method. people.printSchema()
// Register this SchemaRDD as atable. people.registerTempTable("people")
// SQL statements can be run by using the sqlmethods provided by sqlContext. val teenagers = sqlContext.sql("SELECT name FROMpeople WHERE age >= 13 AND age <= 19")
val anotherPeopleRDD =sc.parallelize( """{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}""":: Nil) val anotherPeople =sqlContext.jsonRDD(anotherPeopleRDD) anotherPeople.printSchema() anotherPeople.registerTempTable("anotherPeople") sqlContext.sql("SELECT name FROManotherPeople") }
val sparkConf = newSparkConf().setAppName("customers") val sc = newSparkContext(sparkConf) val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) sqlContext.sql("CREATETABLE IF NOT EXISTS SparkHive (key INT, value STRING)") sqlContext.sql("LOAD DATALOCAL INPATH 'xrli/kv1.txt' INTO TABLE src")
// Queries are expressedin HiveQL sqlContext.sql("FROM srcSELECT key, value").collect().foreach(println)
1、使用反射来推断包含特定对象类型的RDD的模式(schema)。在你写spark程序的同时,当你已经知道了模式,这种基于反射的方法可以使代码更简洁并且程序工作得更好。
例如
sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Person(p(0), p(1).trim.toInt))people.registerTempTable("people")
/2、方法是通过一个编程接口来实现,这个接口允许你构造一个模式,然后在存在的RDDs上使用它。虽然这种方法更冗长,但是它允许你在运行期之前不知道列以及列的类型的情况下构造SchemaRDDs。
importorg.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import SparkContext._
import org.apache.log4j.{Level, Logger}
object SparkSQL {
}
输入的文件是下面这样
John,15
HanMM,20
Lixurui,27
Shanxin,22
输出结果
下面这种写法或许更清楚:
objectSparkSQL {
}
Spark SQL能够自动推断JSON数据集的模式,加载它为一个SchemaRDD(最新的被DataFrame所替代)。这种转换可以通过下面两种方法来实现
- jsonFile :从一个包含JSON文件的目录中加载。文件中的每一行是一个JSON对象
- jsonRDD :从存在的RDD加载数据,这些RDD的每个元素是一个包含JSON对象的字符串
注意,作为jsonFile的文件不是一个典型的JSON文件,每行必须是独立的并且包含一个有效的JSON对象。结果是,一个多行的JSON文件经常会失败
例如people.json如下:
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}
importorg.apache.spark._
importorg.apache.spark.sql._
importorg.apache.spark.sql.types._
importSparkContext._
importorg.apache.log4j.{Level, Logger}
objectSparkJSON {
//root
// |-- age: integer (nullable = true)
// |-- name: string (nullable = true)
//Alternatively, a SchemaRDD can be created for a JSON datasetrepresented by
// anRDD[String] storing one JSON object per string.
}
结果
Spark SQL 还能跟hive互通,互通需要手动配置一下,参考下面http://lxw1234.com/archives/2015/06/294.htm
然后就能使用了。比如写进代码
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._
import SparkContext._
import org.apache.log4j.{Level, Logger}
class SparkSQLHive {
}
0 0
- Spark SQL重点
- Sql-重点
- scala 重点语法总结
- Spark Stream 教程
- SQL优化是重点
- SQL重点复习
- SQL学习重点概述
- PL/SQL(重点)
- SQL语句优化--重点
- Spark Streaming+Spark SQL
- 再读C++ primer,提取重点
- spark sql
- Spark SQL
- Spark SQL
- spark-sql
- spark sql
- spark sql
- spark sql
- 3.11
- 社交网络分析:网络中心性
- Spark map 处理表格数据
- 66.Binary Tree Preorder Traversal-二叉树的前序遍历(容易题)
- Spark Stream 教程
- Spark SQL重点
- Spark mlib FPGrowth&nb…
- Spark的最短路径详解
- 读书笔记之三十二----《信用…
- 评分卡模型剖析之一(woe、I…
- 数据挖掘技术(四)——聚类
- scikitlearn/theano多分类问题详解
- Weka 分类 注意点
- 深度学习keras程序失败的解决办法