执行Spark程序

来源：互联网发布：超人软件官网编辑：程序博客网时间：2024/05/16 17:18

1. 执行Spark程序

1.1. 执行第一个spark程序

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master spark://mini1:7077 \

--executor-memory 1G \

--total-executor-cores 2 \

/usr/local/app/spark-2.1.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.1.0.jar \

100

该算法是利用蒙特·卡罗算法求PI

1.2. 启动Spark Shell

spark-shell是Spark自带的交互式Shell程序，方便用户进行交互式编程，用户可以在该命令行下用scala编写spark程序。

1.2.1. 启动spark shell

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-shell \

--master spark://mini1:7077,mini2:7077,mini3:7077 \

--executor-memory 812m \

--total-executor-cores 2

参数说明：

--master spark://mini1:7077 指定Master的地址,可用逗号分割指定多台机器

--executor-memory 2g 指定每个worker可用内存为2G

--total-executor-cores 2 指定整个集群使用的cup核数为2个

注意：

如果启动spark shell时没有指定master地址，但是也可以正常启动spark shell和执行spark shell中的程序，其实是启动了spark的local模式，该模式仅在本机启动一个进程，没有与集群建立联系。

Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到，则直接应用sc即可

1.2.2. 在spark shell中编写WordCount程序

1.首先启动hdfs

2.向hdfs上传一个文件到hdfs://node1.edu360.cn:9000/words.txt

3.在spark shell中用scala语言编写spark程序

sc.textFile("hdfs://mini1:9000//wordcount/in/*").flatMap(_.split(" "))

.map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://mini1:9000/wordcount/out")

4.使用hdfs命令查看结果

hdfs dfs -ls hdfs://mini1:9000//wordcount/out

说明：

sc是SparkContext对象，该对象时提交spark程序的入口

textFile(hdfs://mini1:9000//wordcount/in/*)是hdfs中读取数据

flatMap(_.split(" "))先map在压平

map((_,1))将单词和1构成元组

reduceByKey(_+_)按照key进行reduce，并将value累加

saveAsTextFile("hdfs://node1.edu360.cn:9000/out")将结果写入到hdfs中

1.2.3. spark shell测试master HA故障自动切换

spark集群状态

启动spark shell

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-shell \

--master spark://mini1:7077,mini2:7077,mini3:7077 \

--executor-memory 812m \

--total-executor-cores 2

val rdd=sc.makeRDD(Array("hello you","hello me","hello world","hehe haha"))

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at <console>:24

rdd.collect

res2: Array[String] = Array(hello you, hello me, hello world, hehe haha)

关闭mini1上master

sbin/stop-master.sh

spark-shell上显示重连信息

等待30s左右，mini3被切换成alive master

继续运行任然可以得到结果

1.3. 在IDEA中编写WordCount程序

spark shell仅在测试和验证我们的程序时使用的较多，在生产环境中，通常会在IDE中编制程序，然后打成jar包，然后提交到集群，最常用的是创建一个Maven项目，利用Maven来管理jar包的依赖。

1.3.1. 创建项目

1.创建一个项目

或

2.选择Maven项目，然后点击next

3.填写maven的GAV，然后点击next

4.填写项目名称，然后点击finish

5.创建好maven项目后，点击Enable Auto-Import

6.配置Maven的pom.xml

<?xml version="1.0"encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.edu360</groupId>
    <artifactId>spark-demo</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.version>2.11.8</scala.version>
        <scala.compat.version>2.11</scala.compat.version>
        <spark.version>2.1.0</spark.version>
        <hadoop.version>2.6.0</hadoop.version>
        <encoding>UTF-8</encoding>
        <akka.version>2.4.16</akka.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>com.typesafe.akka</groupId>
            <artifactId>akka-actor_2.11</artifactId>
            <version>${akka.version}</version>
        </dependency>

        <dependency>
            <groupId>com.typesafe.akka</groupId>
            <artifactId>akka-remote_2.11</artifactId>
            <version>${akka.version}</version>
        </dependency>
    </dependencies>

    <build>
        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.2</version>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.5.1</version>
                </plugin>
            </plugins>
        </pluginManagement>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

7.创建scala源码包

等待一会

1.3.2. 编写代码

8.新建一个scala class，类型为Object

9.编写spark程序

packagecom.edu360

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    val sc: SparkContext =new SparkContext(newSparkConf().setAppName("wordcount"))
    sc.textFile(args(0))
      .flatMap(_.split("\\s+"))
      .map((_, 1))
      .reduceByKey(_+_)
      .saveAsTextFile(args(1))
  }
}

1.3.3. 打jar包

10.1.使用Maven打包

双击package命令

会产生target文件夹，里面会有2个打好的jar包

original开头的表示不带依赖class仅包含自己写的代码

非original开头的表示带全部依赖的jar包

10.2也可以直接使用idea打jar包

添加需要打jar包的module

添加后会多出compile output

点击make module即可编译到指定目录

若该项目不生效，则

即可

1.3.4. 上传jar包到集群

11.选择编译成功的jar包，并将该jar上传到Spark集群中的某个节点上

由于spark集群自身带有相关依赖，所以我们只需上传原始jar包即可

12.使用spark-submit命令提交Spark应用（注意参数的顺序）

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-submit \

--class com.edu360.WordCount \

--master spark://mini1:7077 \

--executor-memory 812m \

--total-executor-cores 2 \

/root/original-spark-demo-1.0-SNAPSHOT.jar \

hdfs://mini1:9000/wordcount/in/* \

hdfs://mini1:9000/wordcount/out

提交后的信息

查看程序执行结果

hdfs dfs -ls /wordcount/out

hdfs dfs -text /wordcount/out/*

1.3.5. spark-submit参数说明

Usage: spark-submit [options] <app jar | python file> [app options]

参数名称

含义

--master MASTER_URL

可以是spark://host:port, mesos://host:port, yarn,yarn-cluster,yarn-client, local

--deploy-mode DEPLOY_MODE

Driver程序运行的地方，client或者cluster

--class CLASS_NAME

主类名称，含包名

--name NAME

Application名称

--jars JARS

Driver依赖的第三方jar包

--py-files PY_FILES

用逗号隔开的放置在Python应用程序PYTHONPATH上的.zip, .egg, .py文件列表

--files FILES

用逗号隔开的要放置在每个executor工作目录的文件列表

--properties-file FILE

设置应用程序属性的文件路径，默认是conf/spark-defaults.conf

--driver-memory MEM

Driver程序使用内存大小

--driver-java-options

--driver-library-path

Driver程序的库路径

--driver-class-path

Driver程序的类路径

--executor-memory MEM

executor内存大小，默认1G

--driver-cores NUM

Driver程序的使用CPU个数，仅限于Spark Alone模式

--supervise

失败后是否重启Driver，仅限于Spark Alone模式

--total-executor-cores NUM

executor使用的总核数，仅限于Spark Alone、Spark on Mesos模式

--executor-cores NUM

每个executor使用的内核数，默认为1，仅限于Spark on Yarn模式

--queue QUEUE_NAME

提交应用程序给哪个YARN的队列，默认是default队列，仅限于Spark on Yarn模式

--num-executors NUM

启动的executor数量，默认是2个，仅限于Spark on Yarn模式

--archives ARCHIVES

仅限于Spark on Yarn模式

1.3.6. spark-submit运行原理

例如：spark-submit --class cn.itcast.spark.WordCount

1.调用org.apache.spark.deploy.SparkSubmit类的main方法

2.doRunMain方法中传进参数 class cn.itcast.spark.WordCount

3.通过反射拿到类的实例的引用mainClass = Utils.classForName(childMainClass)

4.通过反射调用class cn.itcast.spark.WordCount的main方法

完整pdf文档下载链接：

http://pan.baidu.com/s/1bp8G6NP

0 0