基于maven创建spark工程、调试并运行

来源：互联网发布：悦诗风吟小绿瓶知乎编辑：程序博客网时间：2024/06/03 19:54

建立spark工程有两种方式：java工程、maven工程。

由于maven管理jar包很方便，本篇基于maven构建spark工程。

spark支持四种语言：scala、java、python、R。

其中scala是基于java的语言，必须有JDK支持。

同时也是spark源码语言，官方API文档对scala的支持是最好的。

如果能选择scala语言作为spark程序的开发，是最好的。

java、python是spark中支持比较好的语言，官方文档中有完整的API解释。

R语言是spark1.4版之后才开始支持，官方资源较少，网络资源也少。

由于博主之前用的是java，这里为了快速入手，构建出能运行spark实例，还是用java开发spark程序。

先决条件：

1、已安装好maven。

2、已安装好hadoop。

3、已安装好spark。

maven构建spark工程基本步骤：

1、新建maven工程。

2、新建JavaSparkPi类。

3、添加spark解压包中JavaSparkPi.java代码。

package sparkTest;/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *    http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.Function2;import java.util.ArrayList;import java.util.List;/**  * Computes an approximation to pi * Usage: JavaSparkPi [slices] */public final class JavaSparkPi {  public static void main(String[] args) throws Exception {    SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi").setMaster("local");    JavaSparkContext jsc = new JavaSparkContext(sparkConf);    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;    int n = 100000 * slices;    List<Integer> l = new ArrayList<Integer>(n);    for (int i = 0; i < n; i++) {      l.add(i);    }    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);    int count = dataSet.map(new Function<Integer, Integer>() {      @Override      public Integer call(Integer integer) {        double x = Math.random() * 2 - 1;        double y = Math.random() * 2 - 1;        return (x * x + y * y < 1) ? 1 : 0;      }    }).reduce(new Function2<Integer, Integer, Integer>() {      @Override      public Integer call(Integer integer, Integer integer2) {        return integer + integer2;      }    });    System.out.println("Pi is roughly " + 4.0 * count / n);    jsc.stop();  }}

4、下载spark-core_2.10-1.6.1jar包及其依赖包。

5、import jar包，消除源码编译错误。

6、运行程序，提示错误：A master URL must be configuration.

解决：sparkContext.setMaster("local").local表示本地运行程序。

形式：SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi").setMaster("local");

7、再次运行，报错：sparkDriver failed。

原因：IP地址和端口不对

解决方法：第一步：打开spark-env.sh文件。添加：SPARK_MASTER_IP=127.0.0.1,SPARK_LOCAL_IP=127.0.0.1.最好填IP而非名字（localhost）

第二步：sudo gedit /etc/hosts打开系统hosts文件。添加如下内容。

192.168.0.115 localhost peter-HP-ENVY-Notebook255.255.255.255 broadcasthost127.0.0.1    localhost localhost.localdomain localhost4 localhost4.localdomain4 peter-HP-ENVY-Notebook::1              localhost localhost.localdomain localhost6 localhost6.localdomain6 peter-HP-ENVY-Notebook

具体原因还没深究。大致解释如下。

第一行表示IP、对应的机器为本机、对应机器名。

第二行表示IP、局域网广播host IP

第三行表示本机对应的各个别名。

第四行表示、具体我也没搞懂。

注意：IP127.0.0.1就是本机的另一种写法。IP192.168.0.115才是本机真实的IP地址，是局域网IP地址。

sparkDriver failed就是对应的sparkDriver 端口port 0连不上，就是hosts中没有设置真实IP地址，之前只设置了127.0.0.1 localhost。

扩展调试：

1、stop-dfs.sh关闭hadoop的hdfs分布式文件存储系统后，运行sparkPi程序，同样成功。

分析：sparkPi程序没有读取HDFS上的文件，不需要HDFS支持。

具体计算由spark程序完成，不需要mapreduce。

资源调度不开启yarn也可以。本身hadoop程序不开启yarn也可以。

2、stop-all.sh关闭spark的master、worker，运行sparkPi程序，同样OK。

分析：此例子设置了local运行模式，所以不需要master、worker模式。

如果没设置，是standalone模式，则需要开启master、worker。

补充说明：

1、sparkDriver failed报错，查看的几个文件。

spark-defaults.conf.template，没有设置，后面需要注意这个文件的作用及何时设置。

slaves.template，没有设置，它决定worker的IP地址。

spark-env.sh，修改设置了，它决定master的IP地址，还有很有参数。我的设置如下。

# set spark environment    export JAVA_HOME=/usr/lib/jvm/java    export SCALA_HOME=/opt/scala    export SPARK_MASTER_IP=127.0.0.1    export SPARK_LOCAL_IP=127.0.0.1    export SPARK_WORKER_CORES=2    export SPARK_WORKER_MEMORY=1g    export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop

2、如何程序打成jar包，放在spark集群上运行，这里给出初步思路。

利用maven或者eclipse将程序打成jar包。

命令行运行spark-submit ×××.jar --params，运行jar包，注意参数的输入。

小知识：

命令hostname，查看本机machine name。

0 0