Apache Spark进行大数据处理 -- 第二部分：Spark SQL

来源：互联网发布：朴素贝叶斯算法编辑：程序博客网时间：2024/04/28 21:37

在上一篇文章中，我们了解了什么是Apache Spark以及如何它是如何协助我们进行大数据处理分析。
Spark SQL是Apache Spark大数据框架的一部分，用来对结构化数据进行处理，且允许使用SQL查询Spark数据。我们可以执行ETL抽取不通格式的数据（如JSON，Parquet或者数据库），然后进行特定的查询。
在序列文章的第二部分，我们将了解Spark SQL库，它是如何通过SQL来查询批文件，JSON数据集或Hive表中的数据。
Spark 1.3是上个月发布的大数据框架最新版本。在这个版本之前，Spark SQL模块还处于“Alpha”阶段，但现在已经为正式发布版了。这个版本包含以下的几个新特新：

数据帧（DataFrame）
新发布的版本提供了一个编程抽象叫做数据帧，它可以作为分布式SQL查询引擎。
数据源（Data Sources）
有了新增加的数据源API，Spark SQL现在可以更容易的计算更多格式的结构化数据，包括：Parquet，JSON和Apache Avro库。
JDBC服务（JDBC Server）
内嵌的JDBC Server能让我们更方便的连接关系数据库表中的数据，并能使用传统的BI工具进行大数据分析。

Spark SQL组件

在使用Spark SQL的时候有2个主要的组件，数据帧（DataFrame）和SQL上下文（SQLContext）。
首先介绍数据帧。

数据帧（DataFrame）

一个数据帧是一个在命名字段下的分布式数据集合。它基于R语言中数据帧的概念，和关系型数据库的表比较类似。
在Spark SQL API的之前版本叫SchemaRDD，现在改名叫数据帧。
可以调用rdd方法将数据帧转换为RDD对象，调用后将数据帧的内容作为RDD对象的行返回。
数据帧可以从不同的数据源创建，如：

已存在的RDD对象
结构化数据文件
JSON数据集
外部数据库

Spark SQL和数据帧API支持以下编程语言：

Scala
https://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.sql.package
Java
https://spark.apache.org/docs/1.3.0/api/java/index.html?org/apache/spark/sql/api/java/package-summary.html
Python
https://spark.apache.org/docs/1.3.0/api/python/pyspark.sql.html

本文中的Spark SQL示例代码使用Spark Scala脚本编写。

SQL上下文（SQLContext）

Spark SQL提供SQLContext来封装所有Spark中相关的功能。你可以从我们前面的示例中看到的已存在的Spark上下文（SparkContext）中创建SQL上下文。下面的代码片段展示了如何创建一个SQL上下文对象。

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

同样，HiveContext提供了SQLContext提供的功能的超集。它可以用来使用HiveSQL解析器编写查询，读取Hive表中的数据。

注意，在Spark程序中使用HiveContext并不需要Hive环境。

JDBC数据源

Spark SQL的其他特性包括数据源（包括JDBC数据源）。
JDBC数据源能使用JDBC API来读取关系型数据库中的数据。相较于JdbcRDD，我们推荐优先使用这种方法，因为数据源返回的结果是数据帧（DataFrame），它可以被Spark SQL处理或者和其他数据源做关联。

Spark SQL示例程序

在上一篇文章中，我们学习了如何在本地安装Spark框架，如何启动以及使用Spark Scala脚本编程和Spark进行交互。要安装最新版本的Spark，可以从Spark官网下载。
在本文的示例中，我们将使用同样的Spark脚本执行Spark SQL代码。这些示例代码是Windows环境下的。
为了保证Spark脚本有足够的内存，在执行spark-shell时使用driver-memory命令行参数，如下所示。

spark-shell.cmd --driver-memory 1G

Spark SQL程序

一旦你加载完Spark脚本后，你就能使用Spark SQL API进行数据分析查询。
第一个示例，我们将从一个文本文件加载客户数据，并从这个数据集创建一个数据帧（DataFrame）对象。然后，我们可以通过执行数据帧的方法来查询数据。
首先我们看一下名为customers.txt的文本文件内容：

100, John Smith, Austin, TX, 78727200, Joe Johnson, Dallas, TX, 75201300, Bob Jones, Houston, TX, 77028400, Andy Davis, San Antonio, TX, 78227500, James Williams, Austin, TX, 78727

下面的代码片段中展示的Spark SQL命令，你可以在Spark脚本控制台中执行。

// Create the SQLContext first from the existing Spark Contextval sqlContext = new org.apache.spark.sql.SQLContext(sc)// Import statement to implicitly convert an RDD to a DataFrameimport sqlContext.implicits._// Create a custom class to represent the Customercase class Customer(customer_id: Int, name: String, city: String, state: String, zip_code: String)// Create a DataFrame of Customer objects from the dataset text file.val dfCustomers = sc.textFile("data/customers.txt").map(_.split(",")).map(p => Customer(p(0).trim.toInt, p(1), p(2), p(3), p(4))).toDF()// Register DataFrame as a table.dfCustomers.registerTempTable("customers")// Display the content of DataFramedfCustomers.show()// Print the DF schemadfCustomers.printSchema()// Select customer name columndfCustomers.select("name").show()// Select customer name and city columnsdfCustomers.select("name", "city").show()// Select a customer by iddfCustomers.filter(dfCustomers("customer_id").equalTo(500)).show()// Count the customers by zip codedfCustomers.groupBy("zip_code").count().show()

在上述示例中，模式由反射推导出。我们同样可以通过程序指定数据集的模式。这种方式在数据是字符时，不能一开始就能定义客户类时很有用。
下述代码示例展示了如何通过使用StructType，StringType和StructField类来指定模式：

//// Programmatically Specifying the Schema//// Create SQLContext from the existing SparkContext.val sqlContext = new org.apache.spark.sql.SQLContext(sc)// Create an RDDval rddCustomers = sc.textFile("data/customers.txt")// The schema is encoded in a stringval schemaString = "customer_id name city state zip_code"// Import Spark SQL data types and Row.import org.apache.spark.sql._import org.apache.spark.sql.types._;// Generate the schema based on the string of schemaval schema = StructType(schemaString.split(" ").map(fieldName => StructField(fieldName, StringType, true)))// Convert records of the RDD (rddCustomers) to Rows.val rowRDD = rddCustomers.map(_.split(",")).map(p => Row(p(0).trim,p(1),p(2),p(3),p(4)))// Apply the schema to the RDD.val dfCustomers = sqlContext.createDataFrame(rowRDD, schema)// Register the DataFrames as a table.dfCustomers.registerTempTable("customers")// SQL statements can be run by using the sql methods provided by sqlContext.val custNames = sqlContext.sql("SELECT name FROM customers")// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.custNames.map(t => "Name: " + t(0)).collect().foreach(println)// SQL statements can be run by using the sql methods provided by sqlContext.val customersByCity = sqlContext.sql("SELECT name,zip_code FROM customers ORDER BY zip_code")// The results of SQL queries are DataFrames and support all the normal RDD operations.// The columns of a row in the result can be accessed by ordinal.customersByCity.map(t => t(0) + "," + t(1)).collect().foreach(println)

你可以从其他数据源加载数据，如JSON数据文件，Hive表或者使用JDBC数据源从关系型数据库的表。
正如你看到的，Spark SQL提供了一个友好的SQL接口来和从不同数据源加载的数据进行交互，所使用的SQL查询语法是我们所熟知的。这对于项目的非技术人员如数据分析人员和DBA非常有用。

总结

在本文中，我们了解了Apache Spark SQL是如何提供SQL接口，使我们能够使用熟知的SQL查询语法跟Spark数据进行交互的。Spark SQL对于非技术成员（如业务人员和数据分析人员）进行数据分析是非常强大的。
在下一篇文章中，我们将了解如何使用Spark Streaming进行实时数据或流数据处理。这个库对于任何组织在整个数据处理和生命周期管理来说都是非常重要的部分，因为流数据处理能给我们系统带来实时洞察力。这对于像欺诈检测，在线交易系统，时间处理决策等场景来说至关重要。

原文：https://www.infoq.com/articles/apache-spark-sql

1 0