Spark SQL
来源:互联网 发布:c语言课程设计题目汇总 编辑:程序博客网 时间:2024/06/15 12:30
Spark SQL
概念
- Datasets
本质类似关系数据库中的记录,TA是分布式数据集合,数据集可以从JVM对象构建,然后使用函数转换(map,flatMap,filter等)进行操作.
- DataFrames
本质类似关系数据库中的表. DataFrame可以从各种各样的源构建,例如:结构化数据文件,Hive中的表,外部数据库或现有的RDD。
实战
1,Starting Point: SparkSession
- The entry point into all functionality in Spark is the SparkSession class. To create a basic SparkSession, just use SparkSession.builder():
import org.apache.spark.sql.SparkSession;SparkSession spark = SparkSession .builder() .appName("Java Spark SQL basic example") .config("spark.some.config.option", "some-value") .getOrCreate();
2,Creating DataFrames
- With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;Dataset<Row> df = spark.read().json("examples/src/main/resources/people.json");// Displays the content of the DataFrame to stdoutdf.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
3,Untyped Dataset Operations (aka DataFrame Operations)
import static org.apache.spark.sql.functions.col;// Print the schema in a tree formatdf.printSchema();// root// |-- age: long (nullable = true)// |-- name: string (nullable = true)// Select only the "name" columndf.select("name").show();// +-------+// | name|// +-------+// |Michael|// | Andy|// | Justin|// +-------+// Select everybody, but increment the age by 1df.select(col("name"), col("age").plus(1)).show();// +-------+---------+// | name|(age + 1)|// +-------+---------+// |Michael| null|// | Andy| 31|// | Justin| 20|// +-------+---------+// Select people older than 21df.filter(col("age").gt(21)).show();// +---+----+// |age|name|// +---+----+// | 30|Andy|// +---+----+// Count people by agedf.groupBy("age").count().show();// +----+-----+// | age|count|// +----+-----+// | 19| 1|// |null| 1|// | 30| 1|// +----+-----+
4,Running SQL Queries Programmatically
- SparkSession上的sql函数使应用程序能够以编程方式运行SQL查询,并将结果作为DataSet返回
import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;// Register the DataFrame as a SQL temporary viewdf.createOrReplaceTempView("people");Dataset<Row> sqlDF = spark.sql("SELECT * FROM people");sqlDF.show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
Global Temporary View
// Register the DataFrame as a global temporary viewdf.createGlobalTempView("people");// Global temporary view is tied to a system preserved database `global_temp`spark.sql("SELECT * FROM global_temp.people").show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+// Global temporary view is cross-sessionspark.newSession().sql("SELECT * FROM global_temp.people").show();// +----+-------+// | age| name|// +----+-------+// |null|Michael|// | 30| Andy|// | 19| Justin|// +----+-------+
Creating Datasets
import java.util.Arrays;import java.util.Collections;import java.io.Serializable;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.Encoder;import org.apache.spark.sql.Encoders;public static class Person implements Serializable { private String name; private int age; public String getName() { return name; } public void setName(String name) { this.name = name; } public int getAge() { return age; } public void setAge(int age) { this.age = age; }}// Create an instance of a Bean classPerson person = new Person();person.setName("Andy");person.setAge(32);// Encoders are created for Java beansEncoder<Person> personEncoder = Encoders.bean(Person.class);Dataset<Person> javaBeanDS = spark.createDataset( Collections.singletonList(person),personEncoder);javaBeanDS.show();// +---+----+// |age|name|// +---+----+// | 32|Andy|// +---+----+
Interoperating with RDDs
- 使用反射推断模式
import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.MapFunction;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.Encoder;import org.apache.spark.sql.Encoders;// Create an RDD of Person objects from a text fileJavaRDD<Person> peopleRDD = spark.read() .textFile("examples/src/main/resources/people.txt") .javaRDD() .map(line -> { String[] parts = line.split(","); Person person = new Person(); person.setName(parts[0]); person.setAge(Integer.parseInt(parts[1].trim())); return person; });// Apply a schema to an RDD of JavaBeans to get a DataFrameDataset<Row> peopleDF = spark.createDataFrame(peopleRDD, Person.class);// Register the DataFrame as a temporary viewpeopleDF.createOrReplaceTempView("people");
- 以编程方式指定模式
import java.util.ArrayList;import java.util.List;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.function.Function;import org.apache.spark.sql.Dataset;import org.apache.spark.sql.Row;import org.apache.spark.sql.types.DataTypes;import org.apache.spark.sql.types.StructField;import org.apache.spark.sql.types.StructType;// Create an RDDJavaRDD<String> peopleRDD = spark.sparkContext() .textFile("examples/src/main/resources/people.txt", 1) .toJavaRDD();// The schema is encoded in a stringString schemaString = "name age";// Generate the schema based on the string of schemaList<StructField> fields = new ArrayList<>();for (String fieldName : schemaString.split(" ")) { StructField field = DataTypes.createStructField(fieldName, DataTypes.StringType, true); fields.add(field);}StructType schema = DataTypes.createStructType(fields);// Convert records of the RDD (people) to RowsJavaRDD<Row> rowRDD = peopleRDD.map((Function<String, Row>) record -> { String[] attributes = record.split(","); return RowFactory.create(attributes[0], attributes[1].trim());});// Apply the schema to the RDDDataset<Row> peopleDataFrame = spark.createDataFrame(rowRDD, schema);// Creates a temporary view using the DataFramepeopleDataFrame.createOrReplaceTempView("people");// SQL can be run over a temporary view created using DataFramesDataset<Row> results = spark.sql("SELECT name FROM people");// The results of SQL queries are DataFrames and support all the normal RDD operations// The columns of a row in the result can be accessed by field index or by field nameDataset<String> namesDS = results.map( (MapFunction<Row, String>) row -> "Name: " + row.getString(0), Encoders.STRING());namesDS.show();// +-------------+// | value|// +-------------+// |Name: Michael|// | Name: Andy|// | Name: Justin|// +-------------+
Aggregations
内置的DataFrames的功能提供了常见的聚合如count(),countdistinct(),avg(),max(),min(),等.
阅读全文
0 0
- Spark Streaming+Spark SQL
- spark sql
- Spark SQL
- Spark SQL
- spark-sql
- spark sql
- spark sql
- spark sql
- Spark-Sql
- Spark SQL
- Spark SQL
- spark Sql
- spark-sql
- spark sql
- Spark Sql
- spark sql
- spark sql
- spark sql
- ng generate component Error More than one module matches. Use skip-import opti
- leetcode 513. Find Bottom Left Tree Value(C语言,二叉树,深度,递归)54
- pid_t的类型定义
- Linux 高速下载器XDM
- windows hadoop2.7.0.zip
- Spark SQL
- 文件上传
- Spring Aop(十六)——编程式的自定义Advisor
- Makefile编写
- No.17【大数据算法】Bloom Filter 的数学背景
- 并发编程(5)-volatile
- [luogu]10月1日夏令营day2
- 一周的时间
- FireFox和VimFx插件--个人感觉最好用的火狐快捷键插件