Spark的Dataset操作(四)-其他单表操作
来源:互联网 发布:淘宝客户管理系统 编辑:程序博客网 时间:2024/06/05 15:37
Spark的Dataset操作(四)-其他单表操作
还有些杂七杂八的小用法没有提到,比如添加列,删除列,NA值处理之类的,就在这里大概列一下吧。
数据集还是之前的那个吧:
scala> val df = spark.createDataset(Seq( ("aaa",1,2),("bbb",3,4),("ccc",3,5),("bbb",4, 6)) ).toDF("key1","key2","key3")df: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]scala> df.printSchemaroot |-- key1: string (nullable = true) |-- key2: integer (nullable = false) |-- key3: integer (nullable = false)scala> df.show+----+----+----+|key1|key2|key3|+----+----+----+| aaa| 1| 2|| bbb| 3| 4|| ccc| 3| 5|| bbb| 4| 6|+----+----+----+
下面来添加一列,可以是字符串类型,整型;可以是常量或者是对当前已有的某列的变换,都行:
/* 新增字符串类型的列key_4,都初始化为new_str_col,注意这里的lit()函数 */scala> val df_1 = df.withColumn("key4", lit("new_str_col"))df_1: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 2 more fields]scala> df_1.printSchemaroot |-- key1: string (nullable = true) |-- key2: integer (nullable = false) |-- key3: integer (nullable = false) |-- key4: string (nullable = false)scala> df_1.show+----+----+----+-----------+|key1|key2|key3| key4|+----+----+----+-----------+| aaa| 1| 2|new_str_col|| bbb| 3| 4|new_str_col|| ccc| 3| 5|new_str_col|| bbb| 4| 6|new_str_col|+----+----+----+-----------+/* 同样的,新增Int类型的列key5,都初始化为1024 */scala> val df_2 = df_1.withColumn("key5", lit(1024))df_2: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 3 more fields]scala> df_2.printSchemaroot |-- key1: string (nullable = true) |-- key2: integer (nullable = false) |-- key3: integer (nullable = false) |-- key4: string (nullable = false) |-- key5: integer (nullable = false)scala> df_2.show+----+----+----+-----------+-----+|key1|key2|key3| key4|key5|+----+----+----+-----------+-----+| aaa| 1| 2|new_str_col| 1024|| bbb| 3| 4|new_str_col| 1024|| ccc| 3| 5|new_str_col| 1024|| bbb| 4| 6|new_str_col| 1024|+----+----+----+-----------+-----+/*再来个不是常量的新增列key6 = key5 * 2*/scala> val df_3 = df_2.withColumn("key6", $"key5"*2)df_3: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]scala> df_3.show+----+----+----+-----------+----+----+|key1|key2|key3| key4|key5|key6|+----+----+----+-----------+----+----+| aaa| 1| 2|new_str_col|1024|2048|| bbb| 3| 4|new_str_col|1024|2048|| ccc| 3| 5|new_str_col|1024|2048|| bbb| 4| 6|new_str_col|1024|2048|+----+----+----+-----------+----+----+/*这次是用的expr()函数*/scala> val df_4 = df_2.withColumn("key6", expr("key5 * 4"))df_4: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 4 more fields]scala> df_4.show+----+----+----+-----------+----+----+|key1|key2|key3| key4|key5|key6|+----+----+----+-----------+----+----+| aaa| 1| 2|new_str_col|1024|4096|| bbb| 3| 4|new_str_col|1024|4096|| ccc| 3| 5|new_str_col|1024|4096|| bbb| 4| 6|new_str_col|1024|4096|+----+----+----+-----------+----+----+
删除列就比较简单了,指定列名就好了
/*删除列key5*/scala> val df_5 = df_4.drop("key5")df_5: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 3 more fields]scala> df_4.printSchemaroot |-- key1: string (nullable = true) |-- key2: integer (nullable = false) |-- key3: integer (nullable = false) |-- key4: string (nullable = false) |-- key5: integer (nullable = false) |-- key6: integer (nullable = false)scala> df_5.printSchemaroot |-- key1: string (nullable = true) |-- key2: integer (nullable = false) |-- key3: integer (nullable = false) |-- key4: string (nullable = false) |-- key6: integer (nullable = false)scala> df_5.show+----+----+----+-----------+----+|key1|key2|key3| key4|key6|+----+----+----+-----------+----+| aaa| 1| 2|new_str_col|4096|| bbb| 3| 4|new_str_col|4096|| ccc| 3| 5|new_str_col|4096|| bbb| 4| 6|new_str_col|4096|+----+----+----+-----------+----+/*可以一次删除多列key4和key6*/scala> val df_6 = df_5.drop("key4", "key6")df_6: org.apache.spark.sql.DataFrame = [key1: string, key2: int ... 1 more field]/* 这里的columns函数以数组形式返回所有列名 */scala> df_6.columnsres23: Array[String] = Array(key1, key2, key3)scala> df_6.show+----+----+----+|key1|key2|key3|+----+----+----+| aaa| 1| 2|| bbb| 3| 4|| ccc| 3| 5|| bbb| 4| 6|+----+----+----+
再写几个null值等无效数据的一些处理吧
这次得换个数据集,null值的表用个csv文件导入,代码如下:
/*csv文件内容如下:key1,key2,key3,key4,key5aaa,1,2,t1,4bbb,5,3,t2,8ccc,2,2,,7,7,3,t1,bbb,1,5,t3,0,4,,t1,8 */scala> val df = spark.read.option("header","true").csv("natest.csv")df: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7||null| 7| 3| t1|null|| bbb| 1| 5| t3| 0|| null| 4|null| t1| 8|+----+----+----+----+----+/*把key1列中所有的null值替换成'xxx' */scala> val df_2 = df.na.fill("xxx",Seq("key1"))df_2: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df_2.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7|| xxx| 7| 3| t1|null|| bbb| 1| 5| t3| 0|| xxx| 4|null| t1| 8|+----+----+----+----+----+/*一次修改相同类型的多个列的示例。这里是把key3,key5列中所有的null值替换成1024。csv导入时默认是string,如果是整型,写法是一样的,有各个类型的重载。*/scala> val df_3 = df.na.fill("1024",Seq("key3","key5"))df_3: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df_3.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7||null| 7| 3| t1|1024|| bbb| 1| 5| t3| 0||null| 4|1024| t1| 8|+----+----+----+----+----+/*一次修改不同类型的多个列的示例。csv导入时默认是string,如果是整型,写法是一样的,有各个类型的重载。*/scala> val df_3 = df.na.fill(Map(("key1"->"yyy"),("key3","1024"),("key4","t88"),("key5","4096")))df_3: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df_3.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2| t88| 7|| yyy| 7| 3| t1|4096|| bbb| 1| 5| t3| 0|| yyy| 4|1024| t1| 8|+----+----+----+----+----+/*不修改,只是过滤掉含有null值的行。这里是过滤掉key3,key5列中含有null的行*/scala> val df_4 = df.na.drop(Seq("key3","key5"))df_4: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df_4.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7|| bbb| 1| 5| t3| 0|+----+----+----+----+----+/*过滤掉指定的若干列中,有效值少于n列的行这里是过滤掉key1,key2,key3这3列中有效值小于2列的行。最后一行中,这3列有2列都是null,所以被过滤掉了。*/scala> val df_5 = df.na.drop(2,Seq("key1","key2","key3"))df_5: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7||null| 7| 3| t1|null|| bbb| 1| 5| t3| 0||null| 4|null| t1| 8|+----+----+----+----+----+scala> df_5.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7||null| 7| 3| t1|null|| bbb| 1| 5| t3| 0|+----+----+----+----+----+/*同上,如果不指定列名列表,则默认列名列表就是所有列*/scala> val df_6 = df.na.drop(4)df_6: org.apache.spark.sql.DataFrame = [key1: string, key2: string ... 3 more fields]scala> df_6.show+----+----+----+----+----+|key1|key2|key3|key4|key5|+----+----+----+----+----+| aaa| 1| 2| t1| 4|| bbb| 5| 3| t2| 8|| ccc| 2| 2|null| 7|| bbb| 1| 5| t3| 0|+----+----+----+----+----+
ok,就到这吧,下次再写多表的部分了~~
阅读全文
0 0
- Spark的Dataset操作(四)-其他单表操作
- Spark的Dataset操作(五)-多表操作 join
- spark之dataset基本操作
- Spark的Dataset操作(一)-列的选择select
- Spark的Dataset操作(二)-过滤的filter和where
- Spark的Dataset操作(三)-分组,聚合,排序
- dataset操作
- [Spark]Spark RDD 指南四 RDD操作
- 操作DataSet的工具类
- DataSet的几个基本操作
- DataSet的几个基本操作
- DataSet,DataGridView的相关操作
- DataSet,DataGridView的相关操作
- 单链表的其他操作
- 文件系统的其他操作
- DataSet表中的数据操作
- DataSet多表查询操作
- DataSet多表查询操作
- Linux系统上安装JDK
- 前端面试题整理——CSS篇
- 学习akka之future
- #6 块参数
- 并发实战——原子类AtomicReference及底层源码CompareAndSwapObject分析
- Spark的Dataset操作(四)-其他单表操作
- 哈喽
- Linux-bash-管理用户账号
- 常用的Java数组操作
- 文本编辑器
- JMS持久订阅(DurableSubscribe)模式示例
- 编码(python)
- 2017 四川省赛 D.Dynamic Graph (思维 拓扑排序 bitset优化)
- java 23种设计模式