spark关于分区和sortBy的学习
来源:互联网 发布:22周胎儿发育标准数据 编辑:程序博客网 时间:2024/06/06 00:50
学习目的
首次学习spark时,对分区没有直观的了解,在使用sortBy方式时也不能得预期的结果,通过实践了解spark分区和sortBy的原理
SparkContext 的配置
val conf = new SparkConf().setAppName(getAppName).setMaster("local[4]")val sc = new SparkContext(conf)
master设置为:local[4],利用4个线程(Executor)来测试,模拟分布式环境
测试分区
按分区打印
val rdd = sc.parallelize(1 to 100)rdd.mapPartitionsWithIndex((idx, iter)=>{ println("partitionIndex" + idx + " " + iter.mkString(",")) iter}).collect()
输出结果为
partitionIndex1 26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50
partitionIndex3 76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
partitionIndex0 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25
partitionIndex2 51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75
可以看到打印结果中有4个分区,每个分区的数据是有序的,与预期结果一致
直接打印全部数据
val rdd = sc.parallelize(1 to 100)rdd.foreach(i => print(i + ","))
输出结果
26,76,51,27,1,2,3,4,5,28,52,77,53,29,6,7,8,30,54,78,55,31,9,10,11,32,56,57,79,58,59,33,12,13,14,15,16,17,34,60,80,61,62,63,64,35,18,19,20,21,36,65,81,66,37,22,23,24,38,67,82,68,83,39,40,41,42,43,44,45,46,47,48,25,49,84,69,85,50,86,70,87,71,88,72,89,73,90,74,91,75,92,93,94,95,96,97,98,99,100,
从输出结果看,在每个分区里的数据是有序的,但是整体输出时是无序的,目前我所知道的原因为rdd的foreach会在每个Executor执行,而不是Driver,每个Executor的执行是并发执行,所以看到的结果为无序
加上collect后输出
val rdd = sc.parallelize(1 to 100)rdd.collect().foreach(i => print(i + ","))
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,
执行collect后,在Driver端单线程执行,可以有序的数据
测试sortBy
打乱分区的数据并打印
val rdd = sc.parallelize(1 to 100)val radmomRdd = rdd.map(i => i + Random.nextInt(100))//添加随机数radmomRdd.mapPartitionsWithIndex((idx, iter)=>{ println("partitionIndex" + idx + " " + iter.mkString(",")) iter}).collect()
输出结果
partitionIndex3 123,134,152,126,171,105,99,131,172,183,125,148,178,141,174,94,147,103,101,162,153,192,102,101,167
partitionIndex2 127,129,115,77,140,150,94,124,79,124,116,143,70,86,131,74,142,77,71,153,153,155,124,84,146
partitionIndex1 46,119,69,40,95,84,128,71,51,68,76,131,67,50,103,93,121,46,127,115,109,93,124,75,136
partitionIndex0 37,86,63,98,36,30,90,79,69,28,91,95,16,53,27,56,66,41,29,23,76,78,114,84,32
将每个rdd的数据加上一个随机数,使得每个分区的数据无序
用sortBy将数据排序
val rdd = sc.parallelize(1 to 100)val radmomRdd = rdd.map(i => i + Random.nextInt(100))//增加随机数radmomRdd.sortBy(i => i, true).mapPartitionsWithIndex((idx, iter)=>{ println("partitionIndex" + idx + " " + iter.mkString(",")) iter}).collect()
输出结果
partitionIndex2 98,98,98,101,102,105,106,108,108,108,109,110,110,111,118,119,121,122,124,124,126,126
partitionIndex3 128,129,135,138,139,141,141,143,149,151,153,154,158,158,161,161,161,162,164,168,172,173,175,177,179
partitionIndex1 66,66,69,70,71,72,73,75,75,75,77,78,78,79,80,81,84,84,84,86,87,87,88,90,90,92,93,93,95,95,96,97
partitionIndex0 27,29,29,30,32,33,34,39,42,43,44,44,45,46,47,48,56,59,62,62,64
可以看到使用sortBy后每个分区的数据已经变成有序排列了
直接打印全部数据
val rdd = sc.parallelize(1 to 100)val radmomRdd = rdd.map(i => i + Random.nextInt(100))//增加随机数radmomRdd.sortBy(i => i, true).foreach(i => print(i + ","))
输出结果
93,127,97,95,98,103,106,97,69,58,152,123,148,119,53,72,103,56,57,86,32,92,82,41,10,70,161,181,132,68,150,70,100,110,102,182,120,152,114,72,104,65,40,48,56,60,84,102,71,183,123,65,68,129,193,85,63,75,55,82,116,117,106,99,145,135,56,142,110,79,69,20,72,87,110,34,16,59,70,76,20,70,87,25,39,120,149,187,108,158,73,142,167,195,140,180,84,89,132,78,
可以看到整体输出的结果是无序的,原因前面说过
加上collect后输出
val rdd = sc.parallelize(1 to 100)val radmomRdd = rdd.map(i => i + Random.nextInt(100))//增加随机数radmomRdd.sortBy(i => i, true).collect().foreach(i => print(i + ","))
输出结果
2,8,13,13,25,29,32,33,34,37,39,43,46,51,52,53,54,59,59,60,60,62,63,64,64,68,70,70,73,74,77,79,80,84,84,86,87,87,89,90,91,92,92,94,95,96,97,97,98,99,100,100,104,105,105,105,105,108,109,110,111,112,113,113,115,116,116,117,118,120,121,122,125,129,130,132,133,134,135,138,138,144,147,148,149,152,154,154,155,159,161,164,170,171,177,183,184,185,186,192,
得到了有序列表
注意点
- 学习和测试时collect很重要,否则得到的数据可能跟预期的不一样
- 在小数据集上验证运行原理要容易些
- spark关于分区和sortBy的学习
- Spark:sortBy和sortByKey的函数详解
- spark中的sortBy和sortByKey
- Spark学习之旅(一)SortBy
- Spark: sortBy和sortByKey函数详解
- Spark: sortBy和sortByKey函数详解
- Spark: sortBy和sortByKey函数详解
- Spark: sortBy和sortByKey函数详解
- Spark: sortBy和sortByKey函数详解
- SPARK:sortByKey和sortBy 函数讲解
- 关于Spark和Spark的学习资料
- 【Spark】sortBy[T]和sortByKey[T]排序详解
- spark点滴之map-flatMap~mappartition~sortBY~fold~combinebykey~分区~mvn~scala
- Spark: sortBy sortByKey 二次排序
- 大数据Spark “蘑菇云”行动第89课:Hive中GroupBy优化、Join的多种类型实战及性能优化、OrderBy和SortBy、UnionAll等实战和优化
- Spark中sortByKey和sortBy对(key,value)数据分别 根据key和value排序
- Spark中sortByKey和sortBy对(key,value)数据分别 根据key和value排序
- spark的优化-控制数据分区和分布
- 183.134.16.17无高防,不酣畅,快快网络台州三线高防给您极致体验
- Volley源码及流程分析
- windows下 c++ 检测磁盘可用空间
- Java关键字 transient
- 3DTouch
- spark关于分区和sortBy的学习
- leetcode 71. Simplify Path
- Qt笔记(5)重新搭建Qt开发环境 一
- 【个人笔记】md5学习
- MFC调用库文件函数pragma comment()
- [Error]集成第三方API时Other Linker Flags项添加-all_load引起的冲突
- Selenium显性等待的方式
- 文章标题
- MongoDb 基本操作(PHP)