spark：在spark-shell上运行一些sparkSQL简单语句--12

来源：互联网发布：直线抗锯齿算法编辑：程序博客网时间：2024/06/06 03:18
sparkSQL--简单语句
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)sqlContext: org.apache.spark.sql.SQLContext = org.apache.spark.sql.SQLContext@7a0f926cscala> import sqlContext._import sqlContext._scala> case class Person(name:String,age:Int)defined class Personscala> val people = sc.textFile("/datatnt/people.txt").map(_.split(",")).map(p => Person(p(0),p(1).trim.toInt))15/03/18 21:07:51 INFO storage.MemoryStore: ensureFreeSpace(163705) called with curMem=0, maxMem=28024897515/03/18 21:07:51 INFO storage.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 159.9 KB, free 267.1 MB)15/03/18 21:07:55 INFO storage.MemoryStore: ensureFreeSpace(22923) called with curMem=163705, maxMem=28024897515/03/18 21:07:55 INFO storage.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.4 KB, free 267.1 MB)15/03/18 21:07:55 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on localhost:60363 (size: 22.4 KB, free: 267.2 MB)15/03/18 21:07:55 INFO storage.BlockManagerMaster: Updated info of block broadcast_0_piece015/03/18 21:07:55 INFO spark.SparkContext: Created broadcast 0 from textFile at <console>:19people: org.apache.spark.rdd.RDD[Person] = MappedRDD[3] at map at <console>:19scala> people.registerAsTable("people")warning: there were 1 deprecation warning(s); re-run with -deprecation for detailsscala> people.toDebugString15/03/18 21:15:07 INFO mapred.FileInputFormat: Total input paths to process : 1res2: String = (1) MappedRDD[3] at map at <console>:19 [] |  MappedRDD[2] at map at <console>:19 [] |  /datatnt/people.txt MappedRDD[1] at textFile at <console>:19 [] |  /datatnt/people.txt HadoopRDD[0] at textFile at <console>:19 []scala> val teenagers = sqlContext.sql("SELECT name FROM people WHERE age >= 13 and age <= 19")teenagers: org.apache.spark.sql.SchemaRDD = SchemaRDD[8] at RDD at SchemaRDD.scala:108== Query Plan ==== Physical Plan ==Project [name#2] Filter ((age#3 >= 13) && (age#3 <= 19))  PhysicalRDD [name#2,age#3], MapPartitionsRDD[6] at mapPartitions at ExistingRDD.scala:36                                                  ^scala> teenagers.map(t => "Name: " + t(0)).collect().foreach(println)15/03/18 21:23:35 INFO spark.SparkContext: Starting job: collect at <console>:2015/03/18 21:23:35 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:20) with 1 output partitions (allowLocal=false)15/03/18 21:23:35 INFO scheduler.DAGScheduler: Final stage: Stage 0(collect at <console>:20)15/03/18 21:23:35 INFO scheduler.DAGScheduler: Parents of final stage: List()15/03/18 21:23:36 INFO scheduler.DAGScheduler: Missing parents: List()15/03/18 21:23:36 INFO scheduler.DAGScheduler: Submitting Stage 0 (MappedRDD[9] at map at <console>:20), which has no missing parents15/03/18 21:23:36 INFO storage.MemoryStore: ensureFreeSpace(6424) called with curMem=186628, maxMem=28024897515/03/18 21:23:36 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 6.3 KB, free 267.1 MB)15/03/18 21:23:36 INFO storage.MemoryStore: ensureFreeSpace(4266) called with curMem=193052, maxMem=28024897515/03/18 21:23:36 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.2 KB, free 267.1 MB)15/03/18 21:23:36 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:60363 (size: 4.2 KB, free: 267.2 MB)15/03/18 21:23:36 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece015/03/18 21:23:36 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:83815/03/18 21:23:36 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 0 (MappedRDD[9] at map at <console>:20)15/03/18 21:23:36 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks15/03/18 21:23:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1304 bytes)15/03/18 21:23:37 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)15/03/18 21:23:37 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/datatnt/people.txt:0+3215/03/18 21:23:38 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id15/03/18 21:23:38 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id15/03/18 21:23:38 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap15/03/18 21:23:38 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition15/03/18 21:23:38 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id15/03/18 21:23:43 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1897 bytes result sent to driver15/03/18 21:23:43 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 6131 ms on localhost (1/1)15/03/18 21:23:43 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 15/03/18 21:23:43 INFO scheduler.DAGScheduler: Stage 0 (collect at <console>:20) finished in 6.426 s15/03/18 21:23:43 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:20, took 8.170127 sName: Justin&&&&&&&&&&&&&&&&&&&&&scala> val teenagers_dsl = people.where('age > 10).where('age < 19).select('name)//字段名要加引号teenagers_dsl: org.apache.spark.sql.SchemaRDD = SchemaRDD[16] at RDD at SchemaRDD.scala:108== Query Plan ==== Physical Plan ==Project [name#4] Filter ((age#5 > 10) && (age#5 < 19))  PhysicalRDD [name#4,age#5], MapPartitionsRDD[12] at mapPartitions at ExistingRDD.scala:36&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&scala> teenagers_dsl.map(t => "Name: " + t(0)).collect().foreach(println)15/03/18 21:35:59 INFO spark.SparkContext: Starting job: collect at <console>:2415/03/18 21:35:59 INFO scheduler.DAGScheduler: Got job 2 (collect at <console>:24) with 1 output partitions (allowLocal=false)15/03/18 21:35:59 INFO scheduler.DAGScheduler: Final stage: Stage 2(collect at <console>:24)15/03/18 21:35:59 INFO scheduler.DAGScheduler: Parents of final stage: List()15/03/18 21:36:00 INFO scheduler.DAGScheduler: Missing parents: List()15/03/18 21:36:00 INFO scheduler.DAGScheduler: Submitting Stage 2 (MappedRDD[20] at map at <console>:24), which has no missing parents15/03/18 21:36:00 INFO storage.MemoryStore: ensureFreeSpace(6440) called with curMem=208099, maxMem=28024897515/03/18 21:36:00 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 6.3 KB, free 267.1 MB)15/03/18 21:36:01 INFO storage.MemoryStore: ensureFreeSpace(4343) called with curMem=214539, maxMem=28024897515/03/18 21:36:01 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 4.2 KB, free 267.1 MB)15/03/18 21:36:01 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on localhost:60363 (size: 4.2 KB, free: 267.2 MB)15/03/18 21:36:01 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece015/03/18 21:36:01 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:83815/03/18 21:36:01 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 2 (MappedRDD[20] at map at <console>:24)15/03/18 21:36:01 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks15/03/18 21:36:01 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, PROCESS_LOCAL, 1304 bytes)15/03/18 21:36:01 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)15/03/18 21:36:01 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/datatnt/people.txt:0+3215/03/18 21:36:02 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 1719 bytes result sent to driver15/03/18 21:36:02 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 850 ms on localhost (1/1)15/03/18 21:36:02 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 15/03/18 21:36:02 INFO scheduler.DAGScheduler: Stage 2 (collect at <console>:24) finished in 0.877 s15/03/18 21:36:02 INFO scheduler.DAGScheduler: Job 2 finished: collect at <console>:24, took 2.235891 s&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&&scala> people.saveAsParquetFile("/outputtnt/people.parquet")//成为含有表的结构信息和元素信息15/03/18 21:46:02 INFO storage.BlockManager: Removing broadcast 215/03/18 21:46:03 INFO storage.BlockManager: Removing block broadcast_215/03/18 21:46:04 INFO storage.MemoryStore: Block broadcast_2 of size 6440 dropped from memory (free 280036533)15/03/18 21:46:04 INFO storage.BlockManager: Removing block broadcast_2_piece015/03/18 21:46:04 INFO storage.MemoryStore: Block broadcast_2_piece0 of size 4341 dropped from memory (free 280040874)15/03/18 21:46:04 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on localhost:60363 in memory (size: 4.2 KB, free: 267.2 MB)15/03/18 21:46:04 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece015/03/18 21:46:04 INFO spark.ContextCleaner: Cleaned broadcast 215/03/18 21:46:04 INFO storage.BlockManager: Removing broadcast 315/03/18 21:46:04 INFO storage.BlockManager: Removing block broadcast_3_piece015/03/18 21:46:04 INFO storage.MemoryStore: Block broadcast_3_piece0 of size 4343 dropped from memory (free 280045217)15/03/18 21:46:04 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on localhost:60363 in memory (size: 4.2 KB, free: 267.2 MB)15/03/18 21:46:04 INFO storage.BlockManagerMaster: Updated info of block broadcast_3_piece015/03/18 21:46:04 INFO storage.BlockManager: Removing block broadcast_315/03/18 21:46:04 INFO storage.MemoryStore: Block broadcast_3 of size 6440 dropped from memory (free 280051657)15/03/18 21:46:04 INFO spark.ContextCleaner: Cleaned broadcast 3SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".SLF4J: Defaulting to no-operation (NOP) logger implementationSLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.15/03/18 21:46:24 INFO spark.SparkContext: Starting job: runJob at ParquetTableOperations.scala:32515/03/18 21:46:24 INFO scheduler.DAGScheduler: Got job 3 (runJob at ParquetTableOperations.scala:325) with 1 output partitions (allowLocal=false)15/03/18 21:46:24 INFO scheduler.DAGScheduler: Final stage: Stage 3(runJob at ParquetTableOperations.scala:325)15/03/18 21:46:24 INFO scheduler.DAGScheduler: Parents of final stage: List()15/03/18 21:46:25 INFO scheduler.DAGScheduler: Missing parents: List()15/03/18 21:46:25 INFO scheduler.DAGScheduler: Submitting Stage 3 (MapPartitionsRDD[21] at mapPartitions at ExistingRDD.scala:36), which has no missing parents15/03/18 21:46:25 INFO storage.MemoryStore: ensureFreeSpace(59936) called with curMem=197318, maxMem=28024897515/03/18 21:46:25 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 58.5 KB, free 267.0 MB)15/03/18 21:46:26 INFO storage.MemoryStore: ensureFreeSpace(36220) called with curMem=257254, maxMem=28024897515/03/18 21:46:26 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 35.4 KB, free 267.0 MB)15/03/18 21:46:26 INFO storage.BlockManagerInfo: Added broadcast_4_piece0 in memory on localhost:60363 (size: 35.4 KB, free: 267.2 MB)15/03/18 21:46:26 INFO storage.BlockManagerMaster: Updated info of block broadcast_4_piece015/03/18 21:46:26 INFO spark.SparkContext: Created broadcast 4 from broadcast at DAGScheduler.scala:83815/03/18 21:46:26 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from Stage 3 (MapPartitionsRDD[21] at mapPartitions at ExistingRDD.scala:36)15/03/18 21:46:26 INFO scheduler.TaskSchedulerImpl: Adding task set 3.0 with 1 tasks15/03/18 21:46:26 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 3.0 (TID 3, localhost, PROCESS_LOCAL, 1304 bytes)15/03/18 21:46:26 INFO executor.Executor: Running task 0.0 in stage 3.0 (TID 3)15/03/18 21:46:27 INFO rdd.HadoopRDD: Input split: hdfs://localhost:9000/datatnt/people.txt:0+3215/03/18 21:46:28 INFO codec.CodecConfig: Compression: GZIP15/03/18 21:46:28 INFO hadoop.ParquetOutputFormat: Parquet block size to 13421772815/03/18 21:46:28 INFO hadoop.ParquetOutputFormat: Parquet page size to 104857615/03/18 21:46:28 INFO hadoop.ParquetOutputFormat: Parquet dictionary page size to 104857615/03/18 21:46:28 INFO hadoop.ParquetOutputFormat: Dictionary is on15/03/18 21:46:28 INFO hadoop.ParquetOutputFormat: Validation is off15/03/18 21:46:28 INFO hadoop.ParquetOutputFormat: Writer version is: PARQUET_1_015/03/18 21:46:29 INFO compress.CodecPool: Got brand-new compressor [.gz]15/03/18 21:47:28 INFO storage.BlockManager: Removing broadcast 115/03/18 21:47:28 INFO storage.BlockManager: Removing block broadcast_1_piece015/03/18 21:47:28 INFO storage.MemoryStore: Block broadcast_1_piece0 of size 4266 dropped from memory (free 279959767)15/03/18 21:47:28 INFO storage.BlockManagerInfo: Removed broadcast_1_piece0 on localhost:60363 in memory (size: 4.2 KB, free: 267.2 MB)15/03/18 21:47:28 INFO storage.BlockManagerMaster: Updated info of block broadcast_1_piece015/03/18 21:47:28 INFO storage.BlockManager: Removing block broadcast_115/03/18 21:47:28 INFO storage.MemoryStore: Block broadcast_1 of size 6424 dropped from memory (free 279966191)15/03/18 21:47:28 INFO spark.ContextCleaner: Cleaned broadcast 115/03/18 21:47:30 INFO hadoop.InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 30,303,90815/03/18 21:47:31 INFO hadoop.ColumnChunkPageWriteStore: written 87B for [name] BINARY: 3 values, 35B raw, 51B comp, 1 pages, encodings: [PLAIN, RLE, BIT_PACKED]15/03/18 21:47:31 INFO hadoop.ColumnChunkPageWriteStore: written 62B for [age] INT32: 3 values, 12B raw, 29B comp, 1 pages, encodings: [PLAIN, BIT_PACKED]15/03/18 21:47:32 INFO output.FileOutputCommitter: Saved output of task 'attempt_201503182146_0023_r_000000_3' to hdfs://localhost:9000/outputtnt/people.parquet/_temporary/0/task_201503182146_0023_r_00000015/03/18 21:47:32 INFO executor.Executor: Finished task 0.0 in stage 3.0 (TID 3). 1919 bytes result sent to driver15/03/18 21:47:33 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 3.0 (TID 3) in 66536 ms on localhost (1/1)15/03/18 21:47:33 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 3.0, whose tasks have all completed, from pool 15/03/18 21:47:33 INFO scheduler.DAGScheduler: Stage 3 (runJob at ParquetTableOperations.scala:325) finished in 66.595 s15/03/18 21:47:33 INFO scheduler.DAGScheduler: Job 3 finished: runJob at ParquetTableOperations.scala:325, took 68.124391 s15/03/18 21:47:34 INFO hadoop.ParquetFileReader: Initiating action with parallelism: 5
0 0