Spark Sql和DataFrame指南(部分)
来源:互联网 发布:淘宝 找 岛国片 编辑:程序博客网 时间:2024/05/02 02:28
有一个链接
sql-programming-guide
简单翻译几个要点:
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
翻译:
Spark SQL是Spark中一个处理结构化数据的模块。它提供了被称为DataFrames的编程抽象,并能作为一个分布式的SQL查询引擎。
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
翻译:
DataFrame是一个由命名列组成的分布式数据集。它从概念上讲相当于关系数据库里的一张表,或R/Python里的数据框架,但内部有很多优化。
DataFrame能通过广泛的数据源构建,比如:结构化数据文件,Hive数据表,外部数据库或已有的RDDs。
With a SQLContext
, applications can create DataFrame
s from an existingRDD
, from a Hive table, or from data sources.
翻译:
使用SQLContext
对象,应用程序能从已有RDD、从Hive数据表或从其它数据源创建DataFrame。
DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, and Python.
翻译:
DataFrame为结构化数据提供一个特定领域的语言,操作Scala, Java或Python。
The sql
function on a SQLContext
enables applications to run SQL queries programmatically and returns the result as aDataFrame
.
翻译:
SQLContext
的sql方法使应用程序能以编程方式运行SQL查询,并返回结果作为一个DataFrame
。
Parquet is a columnar format that is supported by many other data processing systems.Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.
翻译:
Parquet是列结构的,被很多其它数据处理系统支持。
Spark SQL为读写Parquet文件提供支持,它自动保存原始数据的图表。
然后有一段AMPCamp的测试代码,如下:
import org.apache.spark.{SparkConf, SparkContext}object TestDataFrameAndSql { def main(args: Array[String]) { val conf = new SparkConf().setAppName("SparkPi").setMaster("local") val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) sqlContext.setConf("spark.sql.parquet.binaryAsString", "true") val wikiData = sqlContext.parquetFile("/home/hadoop/AMPCamp/ampcamp/data/wiki_parquet") val count = wikiData.count() println("count is " + count) // count is 39365 wikiData.registerTempTable("wikiData") val countResult = sqlContext.sql("SELECT COUNT(*) FROM wikiData").collect() println("countResult is " + countResult) // countResult is [Lorg.apache.spark.sql.Row;@7fdacda0 val sqlCount = countResult.head.getLong(0) println("sqlCount is " + sqlCount) // sqlCount is 39365 sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username <> '' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)// [Waacstats,2003]// [Cydebot,949]// [BattyBot,939]// [Yobot,890]// [Addbot,853]// [Monkbot,668]// [ChrisGualtieri,438]// [RjwilmsiBot,387]// [OccultZone,377]// [ClueBot NG,353] sqlContext.sql("SELECT username, COUNT(*) AS cnt FROM wikiData WHERE text LIKE '%california%' GROUP BY username ORDER BY cnt DESC LIMIT 10").collect().foreach(println)// [,179]// [BattyBot,42]// [Waacstats,37]// [RjwilmsiBot,30]// [Monkbot,26]// [Yobot,22]// [Bender235,20]// [Cydebot,19]// [ClueBot NG,16]// [Bgwhite,11] }}
其中wiki_parquet是一个目录,结构如下:
[hadoop@localhost wiki_parquet]$ ls -rlth
total 239M
-rwxrwxrwx. 1 hadoop hadoop 24M Jun 19 2014 part-r-4.parquet
-rwxrwxrwx. 1 hadoop hadoop 25M Jun 19 2014 part-r-2.parquet
-rwxrwxrwx. 1 hadoop hadoop 24M Jun 19 2014 part-r-1.parquet
-rwxrwxrwx. 1 hadoop hadoop 24M Jun 19 2014 part-r-3.parquet
-rwxrwxrwx. 1 hadoop hadoop 24M Jun 19 2014 part-r-5.parquet
-rwxrwxrwx. 1 hadoop hadoop 24M Jun 19 2014 part-r-6.parquet
-rwxrwxrwx. 1 hadoop hadoop 25M Jun 19 2014 part-r-7.parquet
-rwxrwxrwx. 1 hadoop hadoop 24M Jun 19 2014 part-r-8.parquet
-rwxrwxrwx. 1 hadoop hadoop 25M Jun 19 2014 part-r-9.parquet
-rwxrwxrwx. 1 hadoop hadoop 0 Jun 19 2014 _SUCCESS
-rwxrwxrwx. 1 hadoop hadoop 25M Jun 19 2014 part-r-10.parquet
-rwxrwxrwx. 1 hadoop hadoop 3.1K Jun 19 2014 _metadata
然后执行结果注释里有给出。
- Spark Sql和DataFrame指南(部分)
- Spark SQL和DataFrame指南
- Spark Sql,Dataframe和数据集指南
- Spark 2.1 -- spark SQL , Dataframe 和DataSet 指南
- Spark SQL、DataFrame和Dataset
- spark sql和DataFrame本质
- day56-Spark SQL和DataFrame的本质
- Spark SQL和DataFrame的学习总结
- Spark SQL和DataFrame的学习总结
- Spark SQL和DataFrame的本质
- spark-SQL的DataFrame和DataSet
- Spark SQL中的DataFrame
- spark sql DataFrame操作
- Spark SQL与DataFrame
- spark sql dataframe操作
- Spark SQL 之 DataFrame
- Spark SQL与DataFrame
- Spark-SQL DataFrame操作
- uva 11609 - Teams
- 十折交叉验证--crossvalind函数
- 杭电-The Last Practice
- sizeof 与 strlen的区别
- Android LayoutInflater详解
- Spark Sql和DataFrame指南(部分)
- 内核开发时应该注意的点
- htk网络和解码源码(五、htk解码)
- UITableView---分组
- HDOJ2923Einbahnstrasse(Map+Floyd)
- Pymongo Tutorial & Pymongo入门教
- poj3461Oulipo
- 文本编辑器vi一般模式下常用快捷键
- 在VIM中使用宏macro