SparkSQL-DataFrame学习笔记

来源：互联网发布：漂亮的网站底部代码php 编辑：程序博客网时间：2024/05/08 14:01

（1）导语

昨天在处理寻医问药网的记录时，遇到了处理非关系型的数据库，非关系数据以前都是mongodb来处理的，但它（a）不能分布式。所以处理的数据能力有上限（b）不能很好衔接外源程序，要是对其数据进行处理。必须下载一个第三方的包，来回对读，写其数据。非常不方便。下面我们来看看SparkSQL DataFrame这种数据库的优缺点。

（2）分析材料

网上关于DataFrame介绍的资源也只有其官网的document（http://spark.apache.org/docs/latest/sql-programming-guide.html），和API（http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset）了。

（3）

一下是官网API的介绍：

A Dataset is a strongly typed collection of domain-specific objects that can be transformed in parallel using functional or relational operations. Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row.

Operations available on Datasets are divided into transformations and actions. Transformations are the ones that produce new Datasets, and actions are the ones that trigger computation and return results. Example transformations include map, filter, select, and aggregate (groupBy). Example actions count, show, or writing data out to file systems.

Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. To explore the logical plan as well as optimized physical plan, use the explain function.

从这个角度里看，DataFrame是一种特殊的RDD，RDD［Row］＋Schema；并且正是因为Spark本身知道了内部的数据结构，因此可以在RDD基础之上进一步的优化。

To efficiently support domain-specific objects, an Encoder is required. The encoder maps the domain specific type T to Spark's internal type system. For example, given a class Person with two fields, name (string) and age (int), an encoder is used to tell Spark to generate code at runtime to serialize the Person object into a binary structure. This binary structure often has much lower memory footprint as well as are optimized for efficiency in data processing (e.g. in a columnar format). To understand the internal binary representation for data, use the schema function.

之前在《数据关联》项目中，定义NanStatCounter，其要求必须继承Serializable，现在在这里终于找到原因了。因为Spark要对RDD［Objects］进行序列化为二进制，之所以这样做的原因是因为可以占有更少的内存空间，并且可以在后面的数据处理中得到效率上的优化。

（4）一些有意思的API

apply：上次分析寻医问药网的数据，因为是嵌套的数据结构（nested），因此访问比较麻烦：好在apply的API的提示：

def apply(colName: String): ColumnSelects column based on the column name and return it as a Column. Note that the column name can also reference to a nested column like a.b.Since  2.0.0

scala> inres0: org.apache.spark.sql.DataFrame = [answer: array<struct<answer_text:string,answer_time:string,mainpage:string,major:string,name:string,position:string>>, question: struct<question:string,time:string>, userinfo: struct<age:string,gender:string,name:string>]scala> in.printSchema                          def printSchema(): Unit   scala> in.printSchema                          def printSchema(): Unit   scala> in.printSchema()root |-- answer: array (nullable = true) |    |-- element: struct (containsNull = true) |    |    |-- answer_text: string (nullable = true) |    |    |-- answer_time: string (nullable = true) |    |    |-- mainpage: string (nullable = true) |    |    |-- major: string (nullable = true) |    |    |-- name: string (nullable = true) |    |    |-- position: string (nullable = true) |-- question: struct (nullable = true) |    |-- question: string (nullable = true) |    |-- time: string (nullable = true) |-- userinfo: struct (nullable = true) |    |-- age: string (nullable = true) |    |-- gender: string (nullable = true) |    |-- name: string (nullable = true)scala> in("question"."time")<console>:1: error: identifier expected but string literal found.       in("question"."time")                     ^scala> in("question").("time")<console>:1: error: identifier expected but '(' found.       in("question").("time")                      ^scala> in("question")res2: org.apache.spark.sql.Column = questionscala> in("question")("time")res3: org.apache.spark.sql.Column = question[time]scala> in("question.time")res4: org.apache.spark.sql.Column = question.time AS time#3scala> in.select("question.time")res5: org.apache.spark.sql.DataFrame = [time: string]scala> in.select("question.time").show()16/09/20 10:41:30 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 86.5 KB, free 343.0 KB)16/09/20 10:41:30 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 19.5 KB, free 362.5 KB)16/09/20 10:41:30 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:34642 (size: 19.5 KB, free: 511.5 MB)16/09/20 10:41:30 INFO SparkContext: Created broadcast 2 from show at <console>:2816/09/20 10:41:30 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 230.1 KB, free 592.6 KB)16/09/20 10:41:30 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 19.6 KB, free 612.2 KB)16/09/20 10:41:30 INFO BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:34642 (size: 19.6 KB, free: 511.4 MB)16/09/20 10:41:30 INFO SparkContext: Created broadcast 3 from show at <console>:2816/09/20 10:41:31 INFO FileInputFormat: Total input paths to process : 116/09/20 10:41:31 INFO SparkContext: Starting job: show at <console>:2816/09/20 10:41:31 INFO DAGScheduler: Got job 1 (show at <console>:28) with 1 output partitions16/09/20 10:41:31 INFO DAGScheduler: Final stage: ResultStage 1 (show at <console>:28)16/09/20 10:41:31 INFO DAGScheduler: Parents of final stage: List()16/09/20 10:41:31 INFO DAGScheduler: Missing parents: List()16/09/20 10:41:31 INFO DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[10] at show at <console>:28), which has no missing parents16/09/20 10:41:31 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 7.7 KB, free 619.9 KB)16/09/20 10:41:31 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.2 KB, free 624.1 KB)16/09/20 10:41:31 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:34642 (size: 4.2 KB, free: 511.4 MB)16/09/20 10:41:31 INFO SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:100616/09/20 10:41:31 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[10] at show at <console>:28)16/09/20 10:41:31 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks16/09/20 10:41:31 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 2, localhost, partition 0,ANY, 2166 bytes)16/09/20 10:41:31 INFO Executor: Running task 0.0 in stage 1.0 (TID 2)16/09/20 10:41:31 INFO HadoopRDD: Input split: hdfs://master:8020/user/miaofu/healthcare/2016-05-01/content.txt:0+4171036416/09/20 10:41:31 INFO GenerateUnsafeProjection: Code generated in 564.418505 ms16/09/20 10:41:31 INFO GenerateUnsafeProjection: Code generated in 20.096962 ms16/09/20 10:41:31 INFO GenerateSafeProjection: Code generated in 14.488333 ms16/09/20 10:41:31 INFO Executor: Finished task 0.0 in stage 1.0 (TID 2). 3469 bytes result sent to driver16/09/20 10:41:31 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 2) in 732 ms on localhost (1/1)16/09/20 10:41:31 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 16/09/20 10:41:31 INFO DAGScheduler: ResultStage 1 (show at <console>:28) finished in 0.733 s16/09/20 10:41:31 INFO DAGScheduler: Job 1 finished: show at <console>:28, took 0.745009 s+-------------------+|               time|+-------------------+|2016-05-01 23:57:16||2016-05-01 23:59:56||2016-05-01 23:59:52||2016-05-01 23:59:50||2016-05-01 23:59:48||2016-05-01 23:59:47||2016-05-01 23:59:45||2016-05-01 23:59:39||2016-05-01 23:59:32||2016-05-01 23:59:20||2016-05-01 23:59:15||2016-05-01 23:59:12||2016-05-01 23:59:09||2016-05-01 23:59:06||2016-05-01 23:59:03||2016-05-01 23:59:03||2016-05-01 23:58:58||2016-05-01 23:58:49||2016-05-01 23:58:47||2016-05-01 23:58:38|+-------------------+only showing top 20 rowsscala> val user= in.select("userinfo")user: org.apache.spark.sql.DataFrame = [userinfo: struct<age:string,gender:string,name:string>]scala> user.show()16/09/20 10:42:51 INFO BlockManagerInfo: Removed broadcast_4_piece0 on localhost:34642 in memory (size: 4.2 KB, free: 511.4 MB)16/09/20 10:42:51 INFO ContextCleaner: Cleaned accumulator 416/09/20 10:42:51 INFO ContextCleaner: Cleaned accumulator 316/09/20 10:42:51 INFO BlockManagerInfo: Removed broadcast_3_piece0 on localhost:34642 in memory (size: 19.6 KB, free: 511.5 MB)16/09/20 10:42:51 INFO BlockManagerInfo: Removed broadcast_2_piece0 on localhost:34642 in memory (size: 19.5 KB, free: 511.5 MB)16/09/20 10:42:51 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 229.7 KB, free 486.3 KB)16/09/20 10:42:51 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 19.5 KB, free 505.8 KB)16/09/20 10:42:51 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on localhost:34642 (size: 19.5 KB, free: 511.5 MB)16/09/20 10:42:51 INFO SparkContext: Created broadcast 5 from show at <console>:3016/09/20 10:42:51 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 230.1 KB, free 735.9 KB)16/09/20 10:42:51 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 19.6 KB, free 755.5 KB)16/09/20 10:42:51 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on localhost:34642 (size: 19.6 KB, free: 511.4 MB)16/09/20 10:42:51 INFO SparkContext: Created broadcast 6 from show at <console>:3016/09/20 10:42:51 INFO FileInputFormat: Total input paths to process : 116/09/20 10:42:51 INFO SparkContext: Starting job: show at <console>:3016/09/20 10:42:51 INFO DAGScheduler: Got job 2 (show at <console>:30) with 1 output partitions16/09/20 10:42:51 INFO DAGScheduler: Final stage: ResultStage 2 (show at <console>:30)16/09/20 10:42:51 INFO DAGScheduler: Parents of final stage: List()16/09/20 10:42:51 INFO DAGScheduler: Missing parents: List()16/09/20 10:42:51 INFO DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[16] at show at <console>:30), which has no missing parents16/09/20 10:42:51 INFO MemoryStore: Block broadcast_7 stored as values in memory (estimated size 5.8 KB, free 761.4 KB)16/09/20 10:42:51 INFO MemoryStore: Block broadcast_7_piece0 stored as bytes in memory (estimated size 3.4 KB, free 764.7 KB)16/09/20 10:42:51 INFO BlockManagerInfo: Added broadcast_7_piece0 in memory on localhost:34642 (size: 3.4 KB, free: 511.4 MB)16/09/20 10:42:51 INFO SparkContext: Created broadcast 7 from broadcast at DAGScheduler.scala:100616/09/20 10:42:51 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[16] at show at <console>:30)16/09/20 10:42:51 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks16/09/20 10:42:51 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 3, localhost, partition 0,ANY, 2166 bytes)16/09/20 10:42:51 INFO Executor: Running task 0.0 in stage 2.0 (TID 3)16/09/20 10:42:51 INFO HadoopRDD: Input split: hdfs://master:8020/user/miaofu/healthcare/2016-05-01/content.txt:0+4171036416/09/20 10:42:51 INFO GenerateUnsafeProjection: Code generated in 31.453387 ms16/09/20 10:42:51 INFO GenerateSafeProjection: Code generated in 18.588947 ms16/09/20 10:42:51 INFO Executor: Finished task 0.0 in stage 2.0 (TID 3). 4268 bytes result sent to driver16/09/20 10:42:51 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 3) in 99 ms on localhost (1/1)16/09/20 10:42:51 INFO DAGScheduler: ResultStage 2 (show at <console>:30) finished in 0.099 s16/09/20 10:42:51 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 16/09/20 10:42:51 INFO DAGScheduler: Job 2 finished: show at <console>:30, took 0.109368 s+------------------+|          userinfo|+------------------+| [22,男,会员96129648]||[24岁,女,会员96129643]||[24岁,男,会员96129642]||[28岁,女,会员96129631]||[18岁,男,会员96129641]||[55岁,男,会员96129640]||[28岁,女,会员96129567]||[18岁,女,会员96129636]||[22岁,女,会员96129635]||[55岁,男,会员96118403]||[21岁,女,会员96129633]||[21岁,男,会员96129619]||[32岁,男,会员96129632]||[44岁,女,会员96129522]||[16岁,男,会员96129630]||[23岁,女,会员96129628]||[21岁,女,会员96129507]||[15岁,男,会员96129625]||[34岁,女,会员96129624]||[26岁,男,会员96128857]|+------------------+only showing top 20 rows

scala> user.explain(true)16/09/20 10:48:33 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 229.7 KB, free 985.3 KB)16/09/20 10:48:33 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 19.5 KB, free 1004.8 KB)16/09/20 10:48:33 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on localhost:34642 (size: 19.5 KB, free: 511.4 MB)16/09/20 10:48:33 INFO SparkContext: Created broadcast 10 from explain at <console>:3016/09/20 10:48:33 INFO MemoryStore: Block broadcast_11 stored as values in memory (estimated size 230.1 KB, free 1234.9 KB)16/09/20 10:48:33 INFO MemoryStore: Block broadcast_11_piece0 stored as bytes in memory (estimated size 19.6 KB, free 1254.5 KB)16/09/20 10:48:33 INFO BlockManagerInfo: Added broadcast_11_piece0 in memory on localhost:34642 (size: 19.6 KB, free: 511.4 MB)16/09/20 10:48:33 INFO SparkContext: Created broadcast 11 from explain at <console>:30== Parsed Logical Plan =='Project [unresolvedalias('userinfo)]+- Relation[answer#0,question#1,userinfo#2] JSONRelation== Analyzed Logical Plan ==userinfo: struct<age:string,gender:string,name:string>Project [userinfo#2]+- Relation[answer#0,question#1,userinfo#2] JSONRelation== Optimized Logical Plan ==Project [userinfo#2]+- Relation[answer#0,question#1,userinfo#2] JSONRelation== Physical Plan ==Scan JSONRelation[userinfo#2] InputPaths: hdfs://master:8020/user/miaofu/healthcare/2016-05-01/content.txt

0 0