Spark Java sortByKey二次排序及Task not serializable异常
来源:互联网 发布:直接消耗系数矩阵 编辑:程序博客网 时间:2024/06/05 16:01
相比于scala,用java写二次排序较繁琐一些,请参考:
Spark Java 二次排序:http://blog.csdn.net/leen0304/article/details/78280282
Spark Scala 二次排序: http://blog.csdn.net/leen0304/article/details/78280282
下边用sortByKey实现二次排序:
为了说明问题,举了一个简单的例子,key是由两部分组成的,我们这里按key的第一部分的升序排,key的第二部分降序排,具体如下:
public class SecondarySortByKey implements Serializable { public static void main(String[] args) { SparkConf conf = new SparkConf().setAppName("SecondarySortByKey").setMaster("local"); JavaSparkContext sc = new JavaSparkContext(conf); List<Tuple2<String, Integer>> list = Arrays.asList( new Tuple2<String, Integer>("A", 10), new Tuple2<String, Integer>("D", 20), new Tuple2<String, Integer>("D", 6), new Tuple2<String, Integer>("B", 6), new Tuple2<String, Integer>("C", 12), new Tuple2<String, Integer>("B", 2), new Tuple2<String, Integer>("A", 3) ); JavaRDD<Tuple2<String,Integer>> rdd1 = sc.parallelize(list); JavaPairRDD<String,Integer> pairRdd = rdd1.mapToPair(x -> new Tuple2<String, Integer>(x._1() + " " + x._2(), 1)); //自定义比较器 Comparator<String> comparator = new Comparator<String>() { @Override public int compare(String o1, String o2) { String[] oo1 = o1.split(" "); String[] oo2 = o2.split(" "); if (oo1[0].equals(oo2[0])) { return -Integer.valueOf(oo1[1]).compareTo(Integer.valueOf(oo2[1])); } else { return oo1[0].compareTo(oo2[0]); } } }; JavaPairRDD<String,Integer> res = pairRdd.sortByKey(comparator); res.foreach(x -> System.out.println(x._1())); }}
上边的代码没有问题。但是运行报错如下:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: GCore.SecondarySortByKey$1Serialization stack:... at org.apache.spark.rdd.RDD.foreach(RDD.scala:916) at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:351) at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45) at GCore.SecondarySortByKey.main(SecondarySortByKey.java:52) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intellij.rt.execution.application.AppMain.main(AppMain.java:140)Caused by: java.io.NotSerializableException: GCore.SecondarySortByKey$1
上边的异常大致意思就是:Task not serializable
参考源码:
def sortByKey(comp: Comparator[K], ascending: Boolean): JavaPairRDD[K, V] = { implicit val ordering = comp // Allow implicit conversion of Comparator to Ordering. fromRDD(new OrderedRDDFunctions[K, V, (K, V)](rdd).sortByKey(ascending))}
其实在OrderedRDDFunctions类中有个变量ordering它是隐形的:implicit val ordering = comp。
它就是默认的排序规则,我们自己重写的comp就修改了默认的排序规则。
到这里还是没有发现问题,但是发现类OrderedRDDFunctions extends Logging with Serializable,又回到上面的报错信息,扫描到“serializable”,因此,返回上述代码,查看Comparator interface实现,发现原来是它没有extend Serializable,故只需创建一个 serializable的comparator就可以:
具体如下:
public class SecondaryComparator implements Comparator<String>, Serializable { @Override public int compare(String o1, String o2) { String[] oo1 = o1.split(" "); String[] oo2 = o2.split(" "); if (oo1[0].equals(oo2[0])) { return -Integer.valueOf(oo1[1]).compareTo(Integer.valueOf(oo2[1])); } else { return oo1[0].compareTo(oo2[0]); } }}JavaPairRDD<String,Integer> res = pairRdd.sortByKey(new SecondaryComparator());
打印结果:
A 10
A 3
B 6
B 2
C 12
D 20
D 6
关于SparkTask未序列化(Tasknotserializable)问题分析请参考:http://blog.csdn.net/javastart/article/details/51206715
- Spark Java sortByKey二次排序及Task not serializable异常
- 【spark】sortByKey实现二次排序
- Spark: sortBy sortByKey 二次排序
- Spark 使用sortByKey进行二次排序
- Spark 使用sortByKey进行二次排序
- spark: Task not serializable (java)
- Spark算子[13]:sortByKey、sortBy、二次排序 源码实例详解
- Spark运行程序异常信息: org.apache.spark.SparkException: Task not serializable 解决办法
- Spark Task未序列化(Task not serializable)问题分析及解决
- spark出现task org.apache.spark.SparkException: Task not serializable
- org.apache.spark.SparkException: Task not serializable
- Spark Java 二次排序
- Exception in thread "main" org.apache.spark.SparkException: Task not serializable异常
- spark出现“org.apache.spark.SparkException: Task not serializable"
- spark + quartz : org.apache.spark.SparkException: Task not serializable
- Spark[二]:org.apache.spark.SparkException: Task not serializable
- Spark:Java实现 二次排序
- Task not serializable:java.io.NotSerializableExceptionon
- vs2012在已有项目上生成和使用dll动态链接库
- 基于 Flask 与 MySQL 实现番剧推荐系统(1)
- Spring2.6高级配置、util简化配置
- Spring-mvc运行原理
- 使用Maven构建多模块项目
- Spark Java sortByKey二次排序及Task not serializable异常
- 导入数据库时报错1067 – Invalid default value for ‘字段名’
- java包资源路径
- java往elasticsearch(ES)中写入数据
- 利用matlab进行简单的贝叶斯网络构建
- springcloud
- 基于ajax+php+mysql数据库实现用户注册登录
- hive中order by,sort by, distribute by, cluster by作用以及用法
- 微信小程序--单选复选按钮组的实现