SparkSQL 入门操作

来源：互联网发布：linux 多个 mysql 编辑：程序博客网时间：2024/05/02 07:26

1.前提
启动hadoop,spark
2.进入saprk-shell

bin/spark-shell --master spark://c1:7077 --executor-memory 2g

3.SQL操作

文本文件customers.txt中的内容如下：

100, John Smith, Austin, TX, 78727200, Joe Johnson, Dallas, TX, 75201300, Bob Jones, Houston, TX, 77028400, Andy Davis, San Antonio, TX, 78227500, James Williams, Austin, TX, 78727

直接写SQL的方式：

//// 用编程的方式指定模式//// 用已有的Spark Context对象创建SQLContext对象val sqlContext = new org.apache.spark.sql.SQLContext(sc)// 创建RDD对象,实际路径是hdfs://user/root/data/customers.txtval rddCustomers = sc.textFile("data/customers.txt")// 用字符串编码模式val schemaString = "customer_id name city state zip_code"// 导入Spark SQL数据类型和Rowimport org.apache.spark.sql._import org.apache.spark.sql.types._;// 用模式字符串生成模式对象val schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))// 将RDD（rddCustomers）记录转化成Row。val rowRDD = rddCustomers.map(_.split(",")).map(p => Row(p(0).trim,p(1),p(2),p(3),p(4)))// 将模式应用于RDD对象。val dfCustomers = sqlContext.createDataFrame(rowRDD, schema)// 将DataFrame注册为表dfCustomers.registerTempTable("customers")// 用sqlContext对象提供的sql方法执行SQL语句。val custNames = sqlContext.sql("SELECT name FROM customers")// SQL查询的返回结果为DataFrame对象，支持所有通用的RDD操作。// 可以按照顺序访问结果行的各个列。custNames.map(t => "Name: " + t(0)).collect().foreach(println)// 用sqlContext对象提供的sql方法执行SQL语句。val customersByCity = sqlContext.sql("SELECT name,zip_code FROM customers ORDER BY zip_code")// SQL查询的返回结果为DataFrame对象，支持所有通用的RDD操作。// 可以按照顺序访问结果行的各个列。customersByCity.map(t => t(0) + "," + t(1)).collect().foreach(println)

DataFrame方式：
从文本文件中加载用户数据并从数据集中创建一个DataFrame对象。然后运行DataFrame函数，执行特定的数据选择查询。

// 首先用已有的Spark Context对象创建SQLContext对象val sqlContext = new org.apache.spark.sql.SQLContext(sc)// 导入语句，可以隐式地将RDD转化成DataFrameimport sqlContext.implicits._// 创建一个表示客户的自定义类case class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)// 用数据集文本文件创建一个Customer对象的DataFrameval dfCustomers = sc.textFile("data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF()// 将DataFrame注册为一个表dfCustomers.registerTempTable("customers")// 显示DataFrame的内容dfCustomers.show()// 打印DF模式dfCustomers.printSchema()// 选择客户名称列dfCustomers.select("name").show()// 选择客户名称和城市列dfCustomers.select("name", "city").show()// 根据id选择客户dfCustomers.filter(dfCustomers("customer_id").equalTo(500)).show()// 根据邮政编码统计客户数量dfCustomers.groupBy("zip_code").count().show()

整合Hive
hive-site.xml放入spark中

cp ${HIVE_HOME}/conf /hive-site.xml  ${SPARK_HONE}/conf/hive-site.xml

启动：

bin/spark-shell --master spark://c1:7077 --executor-memory 2g --driver-class-path /usr/local/hive/lib/mysql-connector-java-5.1.35.jar

val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)//sal表是hive已经存在的表sqlContext.sql("select * from sal").collect().foreach(println)

0 0