Spark学习手札(一):SparkSQL的registerAsTable与registerTempTable

来源:互联网 发布:淘宝网半身雪仿长裙 编辑:程序博客网 时间:2024/06/03 23:47

今天在学习SparkSQL时,按照教程上的代码,在注册数据库表时,使用registerAsTable函数注册table:
教程源码:

val sqlContext=new org.apache.spark.sql.SQLContext(sc)import sqlContext._case class Person(name:String,age:Int)val people=sc.textFile("File:/home/hadoop/examples/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim().toInt)).toDF()people.registerAsTable("people")val teenagers = sqlContext.sql("select name,age from people")teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }

但是当运行到people.registerAsTable(“people”)时,提示如下问题:
registerAsTable函数报错
经过查找相关文章资料,终于找到问题的解决方法,故在此做个记录,以备后患。
原因分析:
(1)、函数不同:
由于源码教程使用的是Spark1.3之前的版本,而我使用的是Spark1.6版本,版本不同,所以相应的函数也就不一样了,在Spark1.3之后,注册table使用的是registerTempTable,在Spark1.6.1Documentation的DataFrame类目录下查找该函数可以得到以下解释:
registerTempTable函数解释
从函数解释中可以知道,这个函数在Spark1.3之后替代了registerAsTable函数。
找到这个原因之后,将函数改为registerTempTable继续运行,此时源码如下:

val sqlContext=new org.apache.spark.sql.SQLContext(sc)import sqlContext._case class Person(name:String,age:Int)val people=sc.textFile("File:/home/hadoop/examples/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim().toInt)).toDF()people.registerTempTable("people")val teenagers = sqlContext.sql("select name,age from people")teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }

但是当运行到people.registerTempTable(“people”)时,同样提示无法找到函数。
继续查找问题原因,提示找不到函数,是否是import错误,尝试修改import包。
(2)、函数包不同:
通过查找函数包,可以知道新版函数registerTempTable的包是qlContext.implicits._,更改源码:

val sqlContext=new org.apache.spark.sql.SQLContext(sc)import sqlContext.implicits._case class Person(name:String,age:Int)val people=sc.textFile("File:/home/hadoop/examples/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim().toInt)).toDF()people.registerTempTable("people")val teenagers = sqlContext.sql("select name,age from people")teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }

程序运行成功,没有报出故障,此时运行结果如下:

scala> teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }16/04/17 08:20:03 INFO mapred.FileInputFormat: Total input paths to process : 116/04/17 08:20:04 INFO spark.SparkContext: Starting job: collect at <console>:3516/04/17 08:20:04 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:35) with 1 output partitions16/04/17 08:20:04 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (collect at <console>:35)16/04/17 08:20:04 INFO scheduler.DAGScheduler: Parents of final stage: List()16/04/17 08:20:04 INFO scheduler.DAGScheduler: Missing parents: List()16/04/17 08:20:04 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at map at <console>:35), which has no missing parents16/04/17 08:20:04 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.4 KB, free 89.4 KB)16/04/17 08:20:04 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.8 KB, free 93.2 KB)16/04/17 08:20:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:34132 (size: 3.8 KB, free: 517.4 MB)16/04/17 08:20:04 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/04/17 08:20:04 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at map at <console>:35)16/04/17 08:20:04 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks16/04/17 08:20:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2139 bytes)16/04/17 08:20:04 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)16/04/17 08:20:04 INFO rdd.HadoopRDD: Input split: file:/home/hadoop/examples/people.txt:0+3216/04/17 08:20:04 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id16/04/17 08:20:04 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id16/04/17 08:20:04 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap16/04/17 08:20:04 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition16/04/17 08:20:04 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id16/04/17 08:20:05 INFO codegen.GenerateUnsafeProjection: Code generated in 348.055303 ms16/04/17 08:20:05 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2224 bytes result sent to driver16/04/17 08:20:05 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 788 ms on localhost (1/1)16/04/17 08:20:05 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/04/17 08:20:05 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:35) finished in 0.851 s16/04/17 08:20:05 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:35, took 1.190281 sMichael 29Andy 30Justin 19

至此,问题解决。

0 0
原创粉丝点击