使用RDD解决spark TopN问题:分组、排序、取TopN
来源:互联网 发布:淘宝仓库工作流程 编辑:程序博客网 时间:2024/05/20 18:00
处理大学生云计算技能大赛第一题关联规则推荐中,遇到了这题:
用的关联规则推荐算法 FPGowth
想把:
(2174 84329 114721 149, 150, 0.885480572597137)
(858 878 791 4662, 49, 0.8841607565011821)
(858 878 791 4662, 122, 0.9385342789598109)
(2174 84329 114721 149, 8, 0.8568507157464212)
(858 878 791 4662, 8, 0.9432624113475178)
(2174 84329 114721 149, 122, 0.9038854805725971)
处理成下面的结果:
(858 878 791 4662, 49, 0.8841607565011821)
(858 878 791 4662, 122, 0.9385342789598109)
(858 878 791 4662, 8, 0.9432624113475178)
(2174 84329 114721 149, 8, 0.8568507157464212)
(2174 84329 114721 149, 150, 0.885480572597137)
(2174 84329 114721 149, 122, 0.9038854805725971)
数据的真实含义是:(用户购物篮中的商品ID,将为用户推荐的商品ID,置信度)
下面处理流程,是从网上查找到的解决方法,挺好的,我就在这里记录下,很感谢那个博主,这个最后小难关让我通过,顺利完成比赛的第一道题目。
======》 val rdd1 = sc.parallelize(Seq(
(0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
(1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
(2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
(3,"cat8",35.6)))
======》 val rdd2 = rdd1.map(x => (x._1,(x._2, x._3))).groupByKey()
/*
rdd2.collect
res9: Array[(Int, Iterable[(String, Double)])] = Array((0,CompactBuffer((cat26,30.9), (cat13,22.1), (cat95,19.6), (cat105,1.3))),
(1,CompactBuffer((cat67,28.5), (cat4,26.8), (cat13,12.6), (cat23,5.3))),
(2,CompactBuffer((cat56,39.6), (cat40,29.7), (cat187,27.9), (cat68,9.8))),
(3,CompactBuffer((cat8,35.6))))
*/
======》 val rdd3 = rdd2.map( x => {
val i2 = x._2.toBuffer
val i2_2 = i2.sortBy(_._2)
if (i2_2.length > N_value) i2_2.remove(0, (i2_2.length - N_value))
(x._1, i2_2.toIterable)
})
/*
rdd3.collect
res8: Array[(Int, Iterable[(String, Double)])] = Array((0,ArrayBuffer((cat95,19.6), (cat13,22.1), (cat26,30.9))),
(1,ArrayBuffer((cat13,12.6), (cat4,26.8), (cat67,28.5))),
(2,ArrayBuffer((cat187,27.9), (cat40,29.7), (cat56,39.6))),
(3,ArrayBuffer((cat8,35.6))))
*/
======》val rdd4 = rdd3.flatMap(x => {
val y = x._2
for (w <- y) yield (x._1, w._1, w._2)
})
rdd4.collect
/*
res3: Array[(Int, String, Double)] = Array((0,cat95,19.6), (0,cat13,22.1), (0,cat26,30.9),
(1,cat13,12.6), (1,cat4,26.8), (1,cat67,28.5),
(2,cat187,27.9), (2,cat40,29.7), (2,cat56,39.6),
(3,cat8,35.6))
*
用的关联规则推荐算法 FPGowth
想把:
(2174 84329 114721 149, 150, 0.885480572597137)
(858 878 791 4662, 49, 0.8841607565011821)
(858 878 791 4662, 122, 0.9385342789598109)
(2174 84329 114721 149, 8, 0.8568507157464212)
(858 878 791 4662, 8, 0.9432624113475178)
(2174 84329 114721 149, 122, 0.9038854805725971)
处理成下面的结果:
(858 878 791 4662, 49, 0.8841607565011821)
(858 878 791 4662, 122, 0.9385342789598109)
(858 878 791 4662, 8, 0.9432624113475178)
(2174 84329 114721 149, 8, 0.8568507157464212)
(2174 84329 114721 149, 150, 0.885480572597137)
(2174 84329 114721 149, 122, 0.9038854805725971)
数据的真实含义是:(用户购物篮中的商品ID,将为用户推荐的商品ID,置信度)
下面处理流程,是从网上查找到的解决方法,挺好的,我就在这里记录下,很感谢那个博主,这个最后小难关让我通过,顺利完成比赛的第一道题目。
======》 val rdd1 = sc.parallelize(Seq(
(0,"cat26",30.9), (0,"cat13",22.1), (0,"cat95",19.6), (0,"cat105",1.3),
(1,"cat67",28.5), (1,"cat4",26.8), (1,"cat13",12.6), (1,"cat23",5.3),
(2,"cat56",39.6), (2,"cat40",29.7), (2,"cat187",27.9), (2,"cat68",9.8),
(3,"cat8",35.6)))
======》 val rdd2 = rdd1.map(x => (x._1,(x._2, x._3))).groupByKey()
/*
rdd2.collect
res9: Array[(Int, Iterable[(String, Double)])] = Array((0,CompactBuffer((cat26,30.9), (cat13,22.1), (cat95,19.6), (cat105,1.3))),
(1,CompactBuffer((cat67,28.5), (cat4,26.8), (cat13,12.6), (cat23,5.3))),
(2,CompactBuffer((cat56,39.6), (cat40,29.7), (cat187,27.9), (cat68,9.8))),
(3,CompactBuffer((cat8,35.6))))
*/
======》 val rdd3 = rdd2.map( x => {
val i2 = x._2.toBuffer
val i2_2 = i2.sortBy(_._2)
if (i2_2.length > N_value) i2_2.remove(0, (i2_2.length - N_value))
(x._1, i2_2.toIterable)
})
/*
rdd3.collect
res8: Array[(Int, Iterable[(String, Double)])] = Array((0,ArrayBuffer((cat95,19.6), (cat13,22.1), (cat26,30.9))),
(1,ArrayBuffer((cat13,12.6), (cat4,26.8), (cat67,28.5))),
(2,ArrayBuffer((cat187,27.9), (cat40,29.7), (cat56,39.6))),
(3,ArrayBuffer((cat8,35.6))))
*/
======》val rdd4 = rdd3.flatMap(x => {
val y = x._2
for (w <- y) yield (x._1, w._1, w._2)
})
rdd4.collect
/*
res3: Array[(Int, String, Double)] = Array((0,cat95,19.6), (0,cat13,22.1), (0,cat26,30.9),
(1,cat13,12.6), (1,cat4,26.8), (1,cat67,28.5),
(2,cat187,27.9), (2,cat40,29.7), (2,cat56,39.6),
(3,cat8,35.6))
*
阅读全文
0 0
- 使用RDD解决spark TopN问题:分组、排序、取TopN
- Spark Scala 分组排序取TopN
- Spark Java 分组排序取TopN
- Spark Scala TopN分组排序
- Spark核心编程-分组取topN
- spark中实现分组取topN
- Spark--分组TopN
- 使用Spark core和SparkSQL的窗口函数分别实现分组取topN的操作
- 使用Spark core和SparkSQL的窗口函数分别实现分组取topN的操作
- Hive TopN+分组TopN
- Hive TopN+分组TopN
- Spark--TopN
- MySQL分组然后取每个分组中按照某些字段排序的topN条数据
- hive 分组+组内排序 , 求topN
- hive 分组+组内排序 , 求topN
- Spark RDD 二次分组排序取TopK
- Lucene排序取TopN源码分析
- 第20课 :SPARK Top N彻底解秘 TOPN 排序(Scala)SPARK分组TOPN 算法(JAVA) 必须掌握!
- Android自定义通用适配器
- 后台实现fancytree遍历节点数据
- Android图片加载框架最全解析(二),从源码的角度理解Glide的执行流程
- ng-if | ng-show | ng-hide的使用场景
- Openstack Rally测试方法
- 使用RDD解决spark TopN问题:分组、排序、取TopN
- 《kubernetes-1.8.0》07-addon-kubedns
- 原生js实现轮播的小demo
- 【学习笔记】Google JobScheduler Demo的学习与运用
- 装饰模式
- 通过ajax和json进行表单验证(异步加载)
- 如何指定TMemo或TRichEdit的制表符的长度
- Android Java_WebSocket实现与后台聊天通讯
- C++中虚函数详解一