Spark 之 dataframe 之 join
来源:互联网 发布:老子去了哪里 知乎 编辑:程序博客网 时间:2024/04/28 17:00
Spark DataFrame中join与SQL很像,都有inner join, left join, right join, full join;
那么join方法如何实现不同的join类型呢?
看其原型
def join(right : DataFrame, usingColumns : Seq[String], joinType : String) : DataFrame
def join(right : DataFrame, joinExprs : Column, joinType : String) : DataFrame
可见,可以通过传入String类型的joinType来实现。
joinType可以是”inner”、“left”、“right”、“full”分别对应inner join, left join, right join, full join,默认值是”inner”,代表内连接
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person")).show()personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "inner").show()
结果如下:
“left”,”left_outer”或者”leftouter”代表左连接
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left").show()personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left_outer").show()
结果如下:
“right”,”right_outer”及“rightouter”代表右连接
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right").show()personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right_outer").show()
- 1
结果如下:
“full”,”outer”,”full_outer”,”fullouter”代表全连接
personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full").show()personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full_outer").show()personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "outer").show()
结果如下:
Scala测试源码:
import org.apache.spark.{SparkContext, SparkConf}import org.apache.spark.sql.SQLContextcase class Persons(id_person: Int, name: String, address: String)case class Orders(id_order: Int, orderNum: Int, id_person: Int)object DataFrameTest { def main(args: Array[String]) { val conf = new SparkConf().setMaster("local[2]").setAppName("DataFrameTest") val sc = new SparkContext(conf) val sqlContext = new SQLContext(sc) val personDataFrame = sqlContext.createDataFrame(List(Persons(1, "张三", "深圳"), Persons(2, "李四", "成都"), Persons(3, "王五", "厦门"), Persons(4, "朱六", "杭州"))) val orderDataFrame = sqlContext.createDataFrame(List(Orders(1, 325, 2), Orders(2, 34, 3), Orders(3, 533, 1), Orders(4, 444, 1), Orders(5, 777, 11))) personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person")).show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "inner").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "left_outer").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "right_outer").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "full_outer").show() personDataFrame.join(orderDataFrame, personDataFrame("id_person") === orderDataFrame("id_person"), "outer").show() }}
如何实现的呢?查看spark源码中sql部分可知其是将String类型转换为了JoinType
JoinType的伴生对象中对String类型的typ先转换成小写,然后去掉typ中的下划线 _
,之后用模式匹配来决定用的是哪种join类型,另外,从源码中可知,除了内连接、左连接、右连接、全连接外,还有个LeftSemi连接,这种连接没用过,不太清楚
Spark中JoinType源码:
object JoinType { def apply(typ: String): JoinType = typ.toLowerCase.replace("_", "") match { case "inner" => Inner case "outer" | "full" | "fullouter" => FullOuter case "leftouter" | "left" => LeftOuter case "rightouter" | "right" => RightOuter case "leftsemi" => LeftSemi case _ => val supported = Seq( "inner", "outer", "full", "fullouter", "leftouter", "left", "rightouter", "right", "leftsemi") throw new IllegalArgumentException(s"Unsupported join type '$typ'. " + "Supported join types include: " + supported.mkString("'", "', '", "'") + ".") }}sealed abstract class JoinTypecase object Inner extends JoinTypecase object LeftOuter extends JoinTypecase object RightOuter extends JoinTypecase object FullOuter extends JoinTypecase object LeftSemi extends JoinType
hkl曰:其实测试了之后发现这个他的join的操作和我们对于mysql表的各种join操作是几乎一样的。搞清楚你的业务需求就知道该如何来使用连接的类型了。对于新手来说就是表连接的相等条件就是用 === 不要搞错了。有新内容我会及时更新的。
转自:http://blog.csdn.net/anjingwunai/article/details/51934921
- Spark 之 dataframe 之 join
- Spark SQL 之 DataFrame
- Spark之 DataFrame
- Spark之DataFrame 练习
- Spark之join
- Spark DataFrame中的join类型
- Spark-Sql之DataFrame实战详解
- spark之DataFrame的json数据实战
- spark之DataFrame分析日志文件
- spark之DataFrame 通过反射创建
- Spark之DataFrame通过编码创建
- Spark 之DataFrame与RDD 转换
- Spark-SQL之DataFrame操作大全
- Spark-SQL之DataFrame操作大全
- Spark-SQL 之DataFrame操作大全
- Spark-SQL之DataFrame操作大全
- Spark-SQL之DataFrame操作大全
- Spark-SQL之DataFrame操作大全
- Ubuntu下QT的安装详细教程
- 斐波那契函数的实现
- android开发包国内下载地址
- String、StringBuffer、StringBuilder小结
- 关于对父类窗口的操作:window.open相关
- Spark 之 dataframe 之 join
- C++累积求和函数
- Hawk-and-Chicken HDU
- 键盘、文本事件、复合事件、变动事件
- 凌阳小实训
- Qt中的ui(new Ui::Widget)是什么意思呢?
- Jenkins 插件的应用
- sql用#{},和 ${}传参的区别
- 微信和支付宝支付开发