Spark学习手札（一）：SparkSQL的registerAsTable与registerTempTable

来源：互联网发布：淘宝网半身雪仿长裙编辑：程序博客网时间：2024/06/03 23:47

今天在学习SparkSQL时，按照教程上的代码，在注册数据库表时，使用registerAsTable函数注册table：
教程源码：

val sqlContext=new org.apache.spark.sql.SQLContext(sc)import sqlContext._case class Person(name:String,age:Int)val people=sc.textFile("File:/home/hadoop/examples/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim().toInt)).toDF()people.registerAsTable("people")val teenagers = sqlContext.sql("select name,age from people")teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }

但是当运行到people.registerAsTable(“people”)时，提示如下问题：
registerAsTable函数报错
经过查找相关文章资料，终于找到问题的解决方法，故在此做个记录，以备后患。
原因分析：
（1）、函数不同：
由于源码教程使用的是Spark1.3之前的版本，而我使用的是Spark1.6版本，版本不同，所以相应的函数也就不一样了，在Spark1.3之后，注册table使用的是registerTempTable，在Spark1.6.1Documentation的DataFrame类目录下查找该函数可以得到以下解释：
registerTempTable函数解释
从函数解释中可以知道，这个函数在Spark1.3之后替代了registerAsTable函数。
找到这个原因之后，将函数改为registerTempTable继续运行，此时源码如下：

val sqlContext=new org.apache.spark.sql.SQLContext(sc)import sqlContext._case class Person(name:String,age:Int)val people=sc.textFile("File:/home/hadoop/examples/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim().toInt)).toDF()people.registerTempTable("people")val teenagers = sqlContext.sql("select name,age from people")teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }

但是当运行到people.registerTempTable(“people”)时，同样提示无法找到函数。
继续查找问题原因，提示找不到函数，是否是import错误，尝试修改import包。
（2）、函数包不同：
通过查找函数包，可以知道新版函数registerTempTable的包是qlContext.implicits._，更改源码：

val sqlContext=new org.apache.spark.sql.SQLContext(sc)import sqlContext.implicits._case class Person(name:String,age:Int)val people=sc.textFile("File:/home/hadoop/examples/people.txt").map(_.split(",")).map(p=>Person(p(0),p(1).trim().toInt)).toDF()people.registerTempTable("people")val teenagers = sqlContext.sql("select name,age from people")teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }

程序运行成功，没有报出故障，此时运行结果如下：

scala> teenagers.map { t => t(0)+" "+t(1) } collect() foreach { println }16/04/17 08:20:03 INFO mapred.FileInputFormat: Total input paths to process : 116/04/17 08:20:04 INFO spark.SparkContext: Starting job: collect at <console>:3516/04/17 08:20:04 INFO scheduler.DAGScheduler: Got job 0 (collect at <console>:35) with 1 output partitions16/04/17 08:20:04 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (collect at <console>:35)16/04/17 08:20:04 INFO scheduler.DAGScheduler: Parents of final stage: List()16/04/17 08:20:04 INFO scheduler.DAGScheduler: Missing parents: List()16/04/17 08:20:04 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[7] at map at <console>:35), which has no missing parents16/04/17 08:20:04 INFO storage.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 7.4 KB, free 89.4 KB)16/04/17 08:20:04 INFO storage.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 3.8 KB, free 93.2 KB)16/04/17 08:20:04 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on localhost:34132 (size: 3.8 KB, free: 517.4 MB)16/04/17 08:20:04 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:100616/04/17 08:20:04 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[7] at map at <console>:35)16/04/17 08:20:04 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks16/04/17 08:20:04 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, partition 0,PROCESS_LOCAL, 2139 bytes)16/04/17 08:20:04 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)16/04/17 08:20:04 INFO rdd.HadoopRDD: Input split: file:/home/hadoop/examples/people.txt:0+3216/04/17 08:20:04 INFO Configuration.deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id16/04/17 08:20:04 INFO Configuration.deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id16/04/17 08:20:04 INFO Configuration.deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap16/04/17 08:20:04 INFO Configuration.deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition16/04/17 08:20:04 INFO Configuration.deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id16/04/17 08:20:05 INFO codegen.GenerateUnsafeProjection: Code generated in 348.055303 ms16/04/17 08:20:05 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2224 bytes result sent to driver16/04/17 08:20:05 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 788 ms on localhost (1/1)16/04/17 08:20:05 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 16/04/17 08:20:05 INFO scheduler.DAGScheduler: ResultStage 0 (collect at <console>:35) finished in 0.851 s16/04/17 08:20:05 INFO scheduler.DAGScheduler: Job 0 finished: collect at <console>:35, took 1.190281 sMichael 29Andy 30Justin 19

至此，问题解决。

0 0