spark dataframe dataset reducebykey用法
来源:互联网 发布:住范儿 价格 知乎 编辑:程序博客网 时间:2024/05/22 12:00
case class Record(ts: Long, id: Int, value: Int)如果是rdd,我们经常会用reducebykey获取到最新时间戳的一条记录,用下面的方法def findLatest(records: RDD[Record])(implicit spark: SparkSession) = { records.keyBy(_.id).reduceByKey{ (x, y) => if(x.ts > y.ts) x else y }.values}在dataset中可以用一下方法:import org.apache.spark.sql.functions._val newDF = df.groupBy('id).agg.max(struct('ts, 'val)) as 'tmp).select($"id", $"tmp.*")为什么可以这样操作呢?因为对于struct,或者tuple类型而言,max方法默认按照第一个元素进行排序处理举个详细点的例子:import org.apache.spark.sql.functions._val data = Seq( ("michael", 1, "event 1"), ("michael", 2, "event 2"), ("reynold", 1, "event 3"), ("reynold", 3, "event 4")).toDF("user", "time", "event")val newestEventPerUser = data .groupBy('user) .agg(max(struct('time, 'event)) as 'event) .select($"user", $"event.*") // Unnest the struct into top-level columns.scala> newestEventPerUser.show()+-------+----+-------+ | user|time| event|+-------+----+-------+|reynold| 3|event 4||michael| 2|event 2|+-------+----+-------+复杂一点可参考如下:case class AggregateResultModel(id: String, mtype: String, healthScore: Int, mortality: Float, reimbursement: Float)// assume that the rawScores are loaded behorehand from json,csv filesval groupedResultSet = rawScores.as[AggregateResultModel].groupByKey( item => (item.id,item.mtype )) .reduceGroups( (x,y) => getMinHealthScore(x,y)).map(_._2)// the binary function used in the reduceGroupsdef getMinHealthScore(x : AggregateResultModel, y : AggregateResultModel): AggregateResultModel = { // complex logic for deciding between which row to keep if (x.healthScore > y.healthScore) { return y } else if (x.healthScore < y.healthScore) { return x } else { if (x.mortality < y.mortality) { return y } else if (x.mortality > y.mortality) { return x } else { if(x.reimbursement < y.reimbursement) return x else return y } } }
ref:https://stackoverflow.com/questions/41236804/spark-dataframes-reducing-by-key
阅读全文
1 0
- spark dataframe dataset reducebykey用法
- spark dataset,dataframe学习
- Spark SQL、DataFrame和Dataset
- Spark DataFrame和Dataset区别
- Spark SQL DataFrame/Dataset介绍
- spark DataFrame用法
- Spark ML 基础:DataFrame、Dataset、feature
- Spark RDD、DataFrame和DataSet的区别
- Spark RDD、DataFrame和DataSet的区别
- Spark的RDD与DataFrame、DataSet
- Spark RDD、DataFrame和DataSet的区别
- Spark RDD、DataFrame、DataSet区别和联系
- Spark RDD、DataFrame和DataSet的区别
- spark core组件:RDD、DataFrame和DataSet
- Spark核心API发展史:RDD、DataFrame、DataSet
- Spark RDD、DataFrame和DataSet的区别
- Spark RDD、DataFrame和DataSet的区别
- spark-SQL的DataFrame和DataSet
- Java的基本数据类型
- 秋招总结--那些挂掉的面试
- sshpass的简单使用和缺陷
- 悲观锁和乐观锁
- CodeIgniter 完美解决URL含有中文字符串
- spark dataframe dataset reducebykey用法
- 游标
- Unity小知识整理==》持续更新
- 二维码的生成与扫描-Android
- location.href的用户总结
- 20170927_快排应用_将数组中的大小写字母分开
- 一个完整的react router 4.0嵌套路由的例子如下
- 深度解剖~ FreeRtos阅读笔记1
- 如何写出优美的 JavaScript 代码?