SparkSQL-2.0-新特性

来源：互联网发布：nginx 配置http2 编辑：程序博客网时间：2024/05/23 19:13

1- SparkSession

SparkSession

SparkSession.builder()

import org.apache.spark.sql.SparkSessionval spark = SparkSession  .builder()  .appName("Spark SQL basic example")  .config("spark.some.config.option", "some-value")  .getOrCreate()// For implicit conversions like converting RDDs to DataFramesimport spark.implicits._

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/SparkSQLExample.scala" in the Spark repo.

SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup.

2-Global Temporary View

Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database global_temp,

// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people")// Global temporary view is tied to a system preserved database `global_temp`spark.sql("SELECT * FROM global_temp.people").show()// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+// Global temporary view is cross-sessionspark.newSession().sql("SELECT * FROM global_temp.people").show()// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

3-Untyped User-Defined Aggregate Functions

Users have to extend the UserDefinedAggregateFunction abstract class to implement a custom untyped aggregate function. For example, a user-defined average can look like:

import org.apache.spark.sql.expressions.MutableAggregationBufferimport org.apache.spark.sql.expressions.UserDefinedAggregateFunctionimport org.apache.spark.sql.types._import org.apache.spark.sql.Rowimport org.apache.spark.sql.SparkSessionobject MyAverage extends UserDefinedAggregateFunction {  // Data types of input arguments of this aggregate function  def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil)  // Data types of values in the aggregation buffer  def bufferSchema: StructType = {    StructType(StructField("sum", LongType) :: StructField("count", LongType) :: Nil)  }  // The data type of the returned value  def dataType: DataType = DoubleType  // Whether this function always returns the same output on the identical input  def deterministic: Boolean = true  // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to  // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides  // the opportunity to update its values. Note that arrays and maps inside the buffer are still  // immutable.  def initialize(buffer: MutableAggregationBuffer): Unit = {    buffer(0) = 0L    buffer(1) = 0L  }  // Updates the given aggregation buffer `buffer` with new input data from `input`  def update(buffer: MutableAggregationBuffer, input: Row): Unit = {    if (!input.isNullAt(0)) {      buffer(0) = buffer.getLong(0) + input.getLong(0)      buffer(1) = buffer.getLong(1) + 1    }  }  // Merges two aggregation buffers and stores the updated buffer values back to `buffer1`  def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = {    buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0)    buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1)  }  // Calculates the final result  def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1)}// Register the function to access itspark.udf.register("myAverage", MyAverage)val df = spark.read.json("examples/src/main/resources/employees.json")df.createOrReplaceTempView("employees")df.show()// +-------+------+// |   name|salary|// +-------+------+// |Michael|  3000|// |   Andy|  4500|// | Justin|  3500|// |  Berta|  4000|// +-------+------+val result = spark.sql("SELECT myAverage(salary) as average_salary FROM employees")result.show()// +--------------+// |average_salary|// +--------------+// |        3750.0|// +--------------+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/sql/UserDefinedUntypedAggregation.scala" in the Spark repo.

4-Type-Safe User-Defined Aggregate Functions

User-defined aggregations for strongly typed Datasets revolve around the Aggregator abstract class. For example, a type-safe user-defined average can look like:

import org.apache.spark.sql.expressions.Aggregatorimport org.apache.spark.sql.Encoderimport org.apache.spark.sql.Encodersimport org.apache.spark.sql.SparkSessioncase class Employee(name: String, salary: Long)case class Average(var sum: Long, var count: Long)object MyAverage extends Aggregator[Employee, Average, Double] {  // A zero value for this aggregation. Should satisfy the property that any b + zero = b  def zero: Average = Average(0L, 0L)  // Combine two values to produce a new value. For performance, the function may modify `buffer`  // and return it instead of constructing a new object  def reduce(buffer: Average, employee: Employee): Average = {    buffer.sum += employee.salary    buffer.count += 1    buffer  }  // Merge two intermediate values  def merge(b1: Average, b2: Average): Average = {    b1.sum += b2.sum    b1.count += b2.count    b1  }  // Transform the output of the reduction  def finish(reduction: Average): Double = reduction.sum.toDouble / reduction.count  // Specifies the Encoder for the intermediate value type  def bufferEncoder: Encoder[Average] = Encoders.product  // Specifies the Encoder for the final output value type  def outputEncoder: Encoder[Double] = Encoders.scalaDouble}val ds = spark.read.json("examples/src/main/resources/employees.json").as[Employee]ds.show()// +-------+------+// |   name|salary|// +-------+------+// |Michael|  3000|// |   Andy|  4500|// | Justin|  3500|// |  Berta|  4000|// +-------+------+// Convert the function to a `TypedColumn` and give it a nameval averageSalary = MyAverage.toColumn.name("average_salary")val result = ds.select(averageSalary)result.show()// +--------------+// |average_salary|// +--------------+// |        3750.0|// +--------------+

阅读全文

0 0