Spark SQL

来源:互联网 发布:c语言课程设计题目汇总 编辑:程序博客网 时间:2024/06/15 12:30

Spark SQL

概念

  • Datasets

本质类似关系数据库中的记录,TA是分布式数据集合,数据集可以从JVM对象构建,然后使用函数转换(map,flatMap,filter等)进行操作.

  • DataFrames

本质类似关系数据库中的表. DataFrame可以从各种各样的源构建,例如:结构化数据文件,Hive中的表,外部数据库或现有的RDD。

实战

1,Starting Point: SparkSession

  • The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession;SparkSession spark = SparkSession  .builder()  .appName("Java Spark SQL basic example")  .config("spark.some.config.option", "some-value")  .getOrCreate();

2,Creating DataFrames

  • With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");// Displays the content of the DataFrame to stdoutdf.show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

3,Untyped Dataset Operations (aka DataFrame Operations)

import static org.apache.spark.sql.functions.col;// Print the schema in a tree formatdf.printSchema();// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show();// +-------+// |   name|// +-------+// |Michael|// |   Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select(col("name"), col("age").plus(1)).show();// +-------+---------+// |   name|(age + 1)|// +-------+---------+// |Michael|     null|// |   Andy|       31|// | Justin|       20|// +-------+---------+// Select people older than 21df.filter(col("age").gt(21)).show();// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show();// +----+-----+// | age|count|// +----+-----+// |  19|    1|// |null|    1|// |  30|    1|// +----+-----+

4,Running SQL Queries Programmatically

  • SparkSession上的sql函数使应用程序能够以编程方式运行SQL查询,并将结果作为DataSet返回
import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people");Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");sqlDF.show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

Global Temporary View

// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people");// Global temporary view is tied to a system preserved database `global_temp`spark.sql("SELECT * FROM global_temp.people").show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+// Global temporary view is cross-sessionspark.newSession().sql("SELECT * FROM global_temp.people").show();// +----+-------+// | age|   name|// +----+-------+// |null|Michael|// |  30|   Andy|// |  19| Justin|// +----+-------+

Creating Datasets

import java.util.Arrays;import java.util.Collections;import java.io.Serializable;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.Encoder;import org.apache.spark.sql.Encoders;public static class Person implements Serializable {  private String name;  private int age;  public String getName() {    return name;  }  public void setName(String name) {    this.name = name;  }  public int getAge() {    return age;  }  public void setAge(int age) {    this.age = age;  }}// Create an instance of a Bean classPerson person = new Person();person.setName("Andy");person.setAge(32);// Encoders are created for Java beansEncoder<Person> personEncoder = Encoders.bean(Person.class);Dataset<Person> javaBeanDS = spark.createDataset(  Collections.singletonList(person),personEncoder);javaBeanDS.show();// +---+----+// |age|name|// +---+----+// | 32|Andy|// +---+----+

Interoperating with RDDs

  • 使用反射推断模式
import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.Encoder;import org.apache.spark.sql.Encoders;// Create an RDD of Person objects from a text fileJavaRDD<Person> peopleRDD = spark.read()  .textFile("examples/src/main/resources/people.txt")  .javaRDD()  .map(line -> {    String[] parts = line.split(",");    Person person = new Person();    person.setName(parts[0]);    person.setAge(Integer.parseInt(parts[1].trim()));    return person;  });// Apply a schema to an RDD of JavaBeans to get a DataFrameDataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people");
  • 以编程方式指定模式
import java.util.ArrayList;import java.util.List;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.function.Function;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.types.DataTypes;import org.apache.spark.sql.types.StructField;import org.apache.spark.sql.types.StructType;// Create an RDDJavaRDD<String> peopleRDD = spark.sparkContext()  .textFile("examples/src/main/resources/people.txt", 1)  .toJavaRDD();// The schema is encoded in a stringString schemaString = "name age";// Generate the schema based on the string of schemaList<StructField> fields = new ArrayList<>();for (String fieldName : schemaString.split(" ")) {  StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true);  fields.add(field);}StructType schema = DataTypes.createStructType(fields);// Convert records of the RDD (people) to RowsJavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> {  String[] attributes = record.split(",");  return RowFactory.create(attributes[0], attributes[1].trim());});// Apply the schema to the RDDDataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);// Creates a temporary view using the DataFramepeopleDataFrame.createOrReplaceTempView("people");// SQL can be run over a temporary view created using DataFramesDataset<Row> results = spark.sql("SELECT name FROM people");// The results of SQL queries are DataFrames and support all the normal RDD operations// The columns of a row in the result can be accessed by field index or by field nameDataset<String> namesDS = results.map(    (MapFunction<Row, String>) row -> "Name: " + row.getString(0),    Encoders.STRING());namesDS.show();// +-------------+// |        value|// +-------------+// |Name: Michael|// |   Name: Andy|// | Name: Justin|// +-------------+

Aggregations

内置的DataFrames的功能提供了常见的聚合如count(),countdistinct(),avg(),max(),min(),等.

原创粉丝点击