Eclipse提交代码到Spark集群上运行
来源:互联网 发布:乔治梅森大学知乎 编辑:程序博客网 时间:2024/05/29 10:12
Spark集群master节点: 192.168.168.200
Eclipse运行windows主机: 192.168.168.100
场景:
为了测试在Eclipse上开发的代码在Spark集群上运行的情况,比如:内存、cores、stdout以及相应的变量传递是否正常!
生产环境是把在Eclipse上开发的代码打包放到Spark集群上,然后使用spark-submit提交运行。当然我们也可以启动远程调试,
但是这样就会造成每次测试代码,我们都需要把jar包复制到Spark集群机器上,十分的不方便。因此,我们希望能够在Eclipse直接
模拟spark-submit提交程序运行,便于调试!
一、准备words.txt文件
words.txt :
- Hello Hadoop
- Hello BigData
- Hello Spark
- Hello Flume
- Hello Kafka
上传到HDFS文件系统中,如图:
二、创建Spark测试类
- package com.spark.test;
- import java.util.Arrays;
- import java.util.Iterator;
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaPairRDD;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.api.java.function.FlatMapFunction;
- import org.apache.spark.api.java.function.Function2;
- import org.apache.spark.api.java.function.PairFunction;
- import org.apache.spark.api.java.function.VoidFunction;
- import scala.Tuple2;
- public class JavaWordCount {
- public static void main(String[] args) {
- SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount").setMaster("local[2]");
- JavaSparkContext jsc = new JavaSparkContext(sparkConf);
- JavaRDD<String> lines = jsc.textFile("hdfs://192.168.168.200:9000/test/words.txt");
- JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
- public Iterator<String> call(String line) {
- return Arrays.asList(line.split(" ")).iterator();
- }
- });
- JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
- public Tuple2<String, Integer> call(String word) throws Exception {
- return new Tuple2<String, Integer>(word, 1);
- }
- });
- JavaPairRDD<String, Integer> wordCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
- public Integer call(Integer v1, Integer v2) throws Exception {
- return v1 + v2;
- }
- });
- wordCount.foreach(new VoidFunction<Tuple2<String, Integer>>() {
- public void call(Tuple2<String, Integer> pairs) throws Exception {
- System.out.println(pairs._1() + ":" + pairs._2());
- }
- });
- jsc.close();
- }
- }
日志输出:
访问spark的web ui : http://192.168.168.200:8080
从中看出spark的master地址为: spark://master:7077
将
- SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount").setMaster("local[2]");
修改为:
- SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.168.200:7077");
运行,发现会有报org.apache.spark.SparkException的错:
- Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 192.168.168.200): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
- at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
- at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
- at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
- at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
- at org.apache.spark.scheduler.Task.run(Task.scala:86)
- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
- at java.lang.Thread.run(Thread.java:745)
- Driver stacktrace:
- at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
- at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
- at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
- at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1441)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
- at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:811)
- at scala.Option.foreach(Option.scala:257)
- at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:811)
- at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1667)
- at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1622)
- at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1611)
- at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
- at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:632)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1873)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1886)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1899)
- at org.apache.spark.SparkContext.runJob(SparkContext.scala:1913)
- at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:894)
- at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:892)
- at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
- at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
- at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
- at org.apache.spark.rdd.RDD.foreach(RDD.scala:892)
- at org.apache.spark.api.java.JavaRDDLike$class.foreach(JavaRDDLike.scala:350)
- at org.apache.spark.api.java.AbstractJavaRDDLike.foreach(JavaRDDLike.scala:45)
- at com.spark.test.JavaWordCount.main(JavaWordCount.java:39)
- Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
- at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2083)
- at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1261)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1996)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
- at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
- at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
- at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
- at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
- at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
- at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:71)
- at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
- at org.apache.spark.scheduler.Task.run(Task.scala:86)
- at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
- at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
- at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
- at java.lang.Thread.run(Thread.java:745)
在网上找到的解决办法是配置jar包的路径即可,先用maven install把程序打包成jar包,然后setJars方法。
- SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.168.200:7077");
- String [] jars = {"I:\\TestSpark\\target\\TestSpark-0.0.1-jar-with-dependencies.jar"};
- sparkConf.setJars(jars);
最终源码如下:
- package com.spark.test;
- import java.util.Arrays;
- import java.util.Iterator;
- import org.apache.spark.SparkConf;
- import org.apache.spark.api.java.JavaPairRDD;
- import org.apache.spark.api.java.JavaRDD;
- import org.apache.spark.api.java.JavaSparkContext;
- import org.apache.spark.api.java.function.FlatMapFunction;
- import org.apache.spark.api.java.function.Function2;
- import org.apache.spark.api.java.function.PairFunction;
- import org.apache.spark.api.java.function.VoidFunction;
- import scala.Tuple2;
- public class JavaWordCount {
- public static void main(String[] args) {
- SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount").setMaster("spark://192.168.168.200:7077");
- String [] jars = {"I:\\TestSpark\\target\\TestSpark-0.0.1-jar-with-dependencies.jar"};
- sparkConf.setJars(jars);
- JavaSparkContext jsc = new JavaSparkContext(sparkConf);
- JavaRDD<String> lines = jsc.textFile("hdfs://192.168.168.200:9000/test/words.txt");
- JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
- public Iterator<String> call(String line) {
- return Arrays.asList(line.split(" ")).iterator();
- }
- });
- JavaPairRDD<String, Integer> pairs = words.mapToPair(new PairFunction<String, String, Integer>() {
- public Tuple2<String, Integer> call(String word) throws Exception {
- return new Tuple2<String, Integer>(word, 1);
- }
- });
- JavaPairRDD<String, Integer> wordCount = pairs.reduceByKey(new Function2<Integer, Integer, Integer>() {
- public Integer call(Integer v1, Integer v2) throws Exception {
- return v1 + v2;
- }
- });
- wordCount.foreach(new VoidFunction<Tuple2<String, Integer>>() {
- public void call(Tuple2<String, Integer> pairs) throws Exception {
- System.out.println(pairs._1() + ":" + pairs._2());
- }
- });
- jsc.close();
- }
- }
运行正常,没有出现报错!
查看stdout是否统计正确:
至此,你可以很方便的在Eclipse上开发调试你的代码啦!
阅读全文
0 0
- Eclipse提交代码到Spark集群上运行
- 用maven管理spark应用程序,提交到spark on yarn 集群上运行
- scala编写的Spark程序远程提交到服务器集群上运行
- 在window上提交spark代码到远程测试环境上运行
- 从windows上提交代码到spark集群发现driver地址不通
- mac电脑的eclipse把mapreduce程序提交到hadoop2.x集群虚拟机上运行
- 本地Spark程序提交到hadoop集群运行流程
- 提交第一个spark作业到集群运行
- 3-2、Intellij IDEA开发、集群提交运行Spark代码
- eclipse远程提交scala到spark集群问题
- Spark提交任务到集群
- Spark提交任务到集群
- Spark1.5.2在eclipse生成jar提交到集群运行
- Flink 代码方式提交程序到远程集群运行
- spark mllib 应用程序开发及提交到spark集群运行--入门
- spark在集群上运行
- 在集群上运行Spark
- 在windows上使用eclipse提交Spark任务到Spark平台上
- python初学者应该知道的类、对象、集成器、多态
- 多态的语义
- c++中cin与cout 详解
- 定义一个交通工具(Vehicle)的类
- python3-迭代器与生成器
- Eclipse提交代码到Spark集群上运行
- 2017.09.22工作日记
- <3>C++ Primer字面值常量
- JVM学习笔记-运行时内存区域
- Qt 报错onecoreuap\inetcore\urlmon\zones\zoneidentifier.cxx(359)\urlmon.dll!00007FF9D9FA5B50:
- 打印当前时间
- 事务是什么,以及事务四个特性
- 笨方法学python 习题42(对外星人游戏的改写。改错了还不知道为什么,getattr也要再看)
- 查看华为应用商店APPID