在Eclipse上运行Spark(Standalone,Yarn-Client)
来源:互联网 发布:新加坡数据 编辑:程序博客网 时间:2024/03/29 06:27
我们知道有eclipse的Hadoop插件,能够在eclipse上操作hdfs上的文件和新建mapreduce程序,以及以Run On Hadoop方式运行程序。那么我们可不可以直接在eclipse上运行Spark程序,提交到集群上以YARN-Client方式运行,或者以Standalone方式运行呢?
答案是可以的。下面我来介绍一下如何在eclipse上运行Spark的wordcount程序。我用的hadoop 版本为2.6.2,spark版本为1.5.2。
1.Standalone方式运行
1.1 新建一个普通的java工程即可,下面直接上代码,
1 /* 2 * Licensed to the Apache Software Foundation (ASF) under one or more 3 * contributor license agreements. See the NOTICE file distributed with 4 * this work for additional information regarding copyright ownership. 5 * The ASF licenses this file to You under the Apache License, Version 2.0 6 * (the "License"); you may not use this file except in compliance with 7 * the License. You may obtain a copy of the License at 8 * 9 * http://www.apache.org/licenses/LICENSE-2.010 *11 * Unless required by applicable law or agreed to in writing, software12 * distributed under the License is distributed on an "AS IS" BASIS,13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.14 * See the License for the specific language governing permissions and15 * limitations under the License.16 */17 18 package com.frank.spark;19 20 import scala.Tuple2;21 import org.apache.spark.SparkConf;22 import org.apache.spark.api.java.JavaPairRDD;23 import org.apache.spark.api.java.JavaRDD;24 import org.apache.spark.api.java.JavaSparkContext;25 import org.apache.spark.api.java.function.FlatMapFunction;26 import org.apache.spark.api.java.function.Function2;27 import org.apache.spark.api.java.function.PairFunction;28 29 import java.util.Arrays;30 import java.util.List;31 import java.util.regex.Pattern;32 33 public final class JavaWordCount {34 private static final Pattern SPACE = Pattern.compile(" ");35 36 public static void main(String[] args) throws Exception {37 38 if (args.length < 1) {39 System.err.println("Usage: JavaWordCount <file>");40 System.exit(1);41 }42 43 SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");44 sparkConf.setMaster("spark://192.168.0.1:7077");45 JavaSparkContext ctx = new JavaSparkContext(sparkConf);46 ctx.addJar("C:\\Users\\Frank\\sparkwordcount.jar");47 JavaRDD<String> lines = ctx.textFile(args[0], 1);48 49 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {50 @Override51 public Iterable<String> call(String s) {52 return Arrays.asList(SPACE.split(s));53 }54 });55 56 JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {57 @Override58 public Tuple2<String, Integer> call(String s) {59 return new Tuple2<String, Integer>(s, 1);60 }61 });62 63 JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {64 @Override65 public Integer call(Integer i1, Integer i2) {66 return i1 + i2;67 }68 });69 70 List<Tuple2<String, Integer>> output = counts.collect();71 for (Tuple2<?,?> tuple : output) {72 System.out.println(tuple._1() + ": " + tuple._2());73 }74 ctx.stop();75 }76 }
代码直接从spark安装包解压后在examples/src/main/java/org/apache/spark/examples/JavaWordCount.java拷贝出来,唯一不同的地方在增加了44行和46行,44行设置了Master,为hadoop的master 结点的IP,端口号为7077。46行设置了工程打包后放置在windows上的路径。
1.2 加入spark依赖包spark-assembly-1.5.2-hadoop2.6.0.jar,这个包可以从spark 安装包解压 后在lib目录下。
1.3 配置要统计的文件在hdfs上的路径
Run As->Run Configurations
点击Arguments,因为程序中47行要求输入被统计的文件路径,所以在这里配置以下,文件必须放在hdfs上,所以这里的ip也是你的hadoop的master机器的ip.
1.4 接下来就是Run程序了,统计的结果会显示在eclipse的控制台。你也可以通过spark的web页面查看刚才提交的程序。
2. 以YARN-Client方式运行
2.1 先上代码
1 /* 2 * Licensed to the Apache Software Foundation (ASF) under one or more 3 * contributor license agreements. See the NOTICE file distributed with 4 * this work for additional information regarding copyright ownership. 5 * The ASF licenses this file to You under the Apache License, Version 2.0 6 * (the "License"); you may not use this file except in compliance with 7 * the License. You may obtain a copy of the License at 8 * 9 * http://www.apache.org/licenses/LICENSE-2.010 *11 * Unless required by applicable law or agreed to in writing, software12 * distributed under the License is distributed on an "AS IS" BASIS,13 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.14 * See the License for the specific language governing permissions and15 * limitations under the License.16 */17 18 package com.frank.spark;19 20 import scala.Tuple2;21 import org.apache.spark.SparkConf;22 import org.apache.spark.api.java.JavaPairRDD;23 import org.apache.spark.api.java.JavaRDD;24 import org.apache.spark.api.java.JavaSparkContext;25 import org.apache.spark.api.java.function.FlatMapFunction;26 import org.apache.spark.api.java.function.Function2;27 import org.apache.spark.api.java.function.PairFunction;28 29 import java.util.Arrays;30 import java.util.List;31 import java.util.regex.Pattern;32 33 public final class JavaWordCount {34 private static final Pattern SPACE = Pattern.compile(" ");35 36 public static void main(String[] args) throws Exception {37 38 System.setProperty("HADOOP_USER_NAME", "hadoop");39 40 if (args.length < 1) {41 System.err.println("Usage: JavaWordCount <file>");42 System.exit(1);43 }44 45 SparkConf sparkConf = new SparkConf().setAppName("JavaWordCountByFrank01");46 sparkConf.setMaster("yarn-client");47 sparkConf.set("spark.yarn.dist.files", "C:\\software\\workspace\\sparkwordcount\\src\\yarn-site.xml");48 sparkConf.set("spark.yarn.jar", "hdfs://192.168.0.1:9000/user/bigdatagfts/spark-assembly-1.5.2-hadoop2.6.0.jar");49 50 JavaSparkContext ctx = new JavaSparkContext(sparkConf);51 ctx.addJar("C:\\Users\\Frank\\sparkwordcount.jar");52 JavaRDD<String> lines = ctx.textFile(args[0], 1);53 54 JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {55 @Override56 public Iterable<String> call(String s) {57 return Arrays.asList(SPACE.split(s));58 }59 });60 61 JavaPairRDD<String, Integer> ones = words.mapToPair(new PairFunction<String, String, Integer>() {62 @Override63 public Tuple2<String, Integer> call(String s) {64 return new Tuple2<String, Integer>(s, 1);65 }66 });67 68 JavaPairRDD<String, Integer> counts = ones.reduceByKey(new Function2<Integer, Integer, Integer>() {69 @Override70 public Integer call(Integer i1, Integer i2) {71 return i1 + i2;72 }73 });74 75 List<Tuple2<String, Integer>> output = counts.collect();76 for (Tuple2<?,?> tuple : output) {77 System.out.println(tuple._1() + ": " + tuple._2());78 }79 ctx.stop();80 }81 }
2.2 程序解释
38行,如果你的windows用户名和集群上用户名不一样,这里就应该配置一下。比如我windows用户名为Frank,而装有hadoop的集群username为hadoop,这里我就以38行这样设置。
46行,这里配置以yarn-client方式
48行,以这种方式运行时候,每一次运行都会把spark-assembly-1.5.2-hadoop2.6.0.jar包上传到hdfs下这次生成的application-id文件夹下,会耗费几分钟时间,这里也可以配置spark.yarn.jar,先把spark-assembly-1.5.2-hadoop2.6.0.jar上传到hdfs一个目录下,这样就不用每次从windows上传到hdfs下了。参考https://spark.apache.org/docs/1.5.2/running-on-yarn.html.
spark.yarn.jar :The location of the Spark jar file, in case overriding the default location is desired. By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a jar on HDFS, for example, set this configuration to "hdfs:///some/path".
51行,把项目打包后放在windows上的路径。
2.3 程序配置
把3个配置文件放在src下,配置文件从hadoop的linux机器上拷贝下来。
2.4 配置要统计的文件在hdfs上的路径
参考1.3,同样结果显示在eclipse控制台。
- 在Eclipse上运行Spark(Standalone,Yarn-Client)
- 在 YARN 上运行 Spark
- spark部署:在YARN上运行Spark
- spark在yarn上面的运行模型:yarn-cluster和yarn-client两种运行模式:
- Spark在Yarn上运行Wordcount程序
- 在Yarn上运行Apache Zeppelin & Spark
- Spark的环境搭建以及简单的eclipse的两种运行方式--standalone和yarn
- 在Yarn上运行spark-shell和spark-sql命令行
- 在Yarn上运行spark-shell和spark-sql命令行
- spark安装:在hadoop YARN上运行spark-shell
- spark standalone&&yarn模式
- 安装Spark Standalone模式/Hadoop yarn模式并运行Wordcount
- Spark local/standalone/yarn/远程调试-运行WordCount
- 在基于Yarn的集群上运行Spark程序
- 27课 :SPARK 运行在yarn资源调度框架 client 、cluster方式 !!
- 在standalone-cluster模式上运行spark应用程序(用sbt打包)
- Yarn上运行spark-1.6.0
- Spark运行模式(local standalond,yarn-client,yarn-cluster,mesos-client,mesos-cluster)
- Codeforces 596 C Wilbur and Points【贪心】
- java基础语法
- 【c++】万年历
- JavaWeb学习总结(四十九)——简单模拟Sping MVC
- MFC下分割窗口
- 在Eclipse上运行Spark(Standalone,Yarn-Client)
- 完美解决SDWebImage加载多个图片内存崩溃的问题
- 链表奇偶分离,并且链表反转
- 【软件工程实践】20160903
- 占位
- RxJava 与 Retrofit 结合的最佳实践
- 占位
- 占位
- 占位