sparksql实现多表关联查询

来源：互联网发布：centos查看gcc 编辑：程序博客网时间：2024/05/21 12:00

1.准备数据以json格式为例

student表

{"sid":"S001","sanme":"zhangsan","age":"12","gender":"female"}
{"sid":"S002","sanme":"lisi","age":"13","gender":"male"}
{"sid":"S003","sanme":"wangwu","age":"14","gender":"male"}{"sid":"S004","sanme":"zhaoliu","age":"15","gender":"female"}

course表

{"cid":"C001","cname":"football"}
{"cid":"C002","cname":"music"}
{"cid":"C003","cname":"art"}

student_course表

{"id":"1","sid":"S001","cid":"C001"}
{"id":"2","sid":"S002","cid":"C001"}
{"id":"3","sid":"S002","cid":"C002"}
{"id":"4","sid":"S003","cid":"C003"}
{"id":"5","sid":"S003","cid":"C001"}
{"id":"6","sid":"S004","cid":"C003"}
{"id":"7","sid":"S004","cid":"C002"}

2.打开spark客户端

命令：./spark-shell --master =local

这种创建的表是永久表,前后三个双引号是因为里面有字符串，这样可以实现转义，这些表是存在spark的一个内置数据库（derby数据库）当中。

sqlContext.sql("""create table student_course using org.apache.spark.sql.json options(path "hdfs://192.168.1.110:9000/data/sparksql/student_course.json")""")

sqlContext.sql("""create table course using org.apache.spark.sql.json options(path "hdfs://192.168.1.110:9000/data/sparksql/course.json")""")

sqlContext.sql("""create table student using org.apache.spark.sql.json options(path "hdfs://192.168.1.110:9000/data/sparksql/student.json")""")

查看表

sqlContext.sql("show tables").show()

实现关联查询：

sqlContext.sql("select * from course c,student s,student_course sc where c.cid=sc.cid and s.sid=sc.sid").show()

如下数据：

+----+--------+---+------+--------+----+----+---+----+
| cid| cname|age|gender| sanme| sid| cid| id| sid|
+----+--------+---+------+--------+----+----+---+----+
|C001|football| 12|female|zhangsan|S001|C001| 1|S001|
|C001|football| 13| male| lisi|S002|C001| 2|S002|
|C001|football| 14| male| wangwu|S003|C001| 5|S003|
|C002| music| 13| male| lisi|S002|C002| 3|S002|
|C003| art| 14| male| wangwu|S003|C003| 4|S003|
+----+--------+---+------+--------+----+----+---+----+

1 0