sparksql实现多表关联查询

来源:互联网 发布:centos查看gcc 编辑:程序博客网 时间:2024/05/21 12:00

1.准备数据以json格式为例

student表

{"sid":"S001","sanme":"zhangsan","age":"12","gender":"female"}
{"sid":"S002","sanme":"lisi","age":"13","gender":"male"}
{"sid":"S003","sanme":"wangwu","age":"14","gender":"male"}{"sid":"S004","sanme":"zhaoliu","age":"15","gender":"female"}

course表

{"cid":"C001","cname":"football"}
{"cid":"C002","cname":"music"}
{"cid":"C003","cname":"art"}

student_course表

{"id":"1","sid":"S001","cid":"C001"}
{"id":"2","sid":"S002","cid":"C001"}
{"id":"3","sid":"S002","cid":"C002"}
{"id":"4","sid":"S003","cid":"C003"}
{"id":"5","sid":"S003","cid":"C001"}
{"id":"6","sid":"S004","cid":"C003"}
{"id":"7","sid":"S004","cid":"C002"}

2.打开spark客户端

命令:./spark-shell   --master =local

这种创建的表是永久表,前后三个双引号是因为里面有字符串,这样可以实现转义,这些表是存在spark的一个内置数据库(derby数据库)当中。

sqlContext.sql("""create table student_course using org.apache.spark.sql.json options(path "hdfs://192.168.1.110:9000/data/sparksql/student_course.json")""")

sqlContext.sql("""create table course using org.apache.spark.sql.json options(path "hdfs://192.168.1.110:9000/data/sparksql/course.json")""")

sqlContext.sql("""create table student using org.apache.spark.sql.json options(path "hdfs://192.168.1.110:9000/data/sparksql/student.json")""")


查看表

sqlContext.sql("show tables").show()

实现关联查询:

sqlContext.sql("select * from course c,student s,student_course sc where c.cid=sc.cid and s.sid=sc.sid").show()


如下数据:

+----+--------+---+------+--------+----+----+---+----+
| cid|   cname|age|gender|   sanme| sid| cid| id| sid|
+----+--------+---+------+--------+----+----+---+----+
|C001|football| 12|female|zhangsan|S001|C001|  1|S001|
|C001|football| 13|  male|    lisi|S002|C001|  2|S002|
|C001|football| 14|  male|  wangwu|S003|C001|  5|S003|
|C002|   music| 13|  male|    lisi|S002|C002|  3|S002|
|C003|     art| 14|  male|  wangwu|S003|C003|  4|S003|
+----+--------+---+------+--------+----+----+---+----+



1 0