执行Spark程序

来源:互联网 发布:超人软件官网 编辑:程序博客网 时间:2024/05/16 17:18

1. 执行Spark程序

1.1. 执行第一个spark程序

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-submit \

--class org.apache.spark.examples.SparkPi \

--master spark://mini1:7077 \

--executor-memory 1G \

--total-executor-cores 2 \

/usr/local/app/spark-2.1.0-bin-hadoop2.6/examples/jars/spark-examples_2.11-2.1.0.jar \

100

 

该算法是利用蒙特·卡罗算法求PI

1.2. 启动Spark Shell

spark-shellSpark自带的交互式Shell程序,方便用户进行交互式编程,用户可以在该命令行下用scala编写spark程序。

1.2.1. 启动spark shell

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-shell \

--master spark://mini1:7077,mini2:7077,mini3:7077 \

--executor-memory 812m \

--total-executor-cores 2

 

 

参数说明:

--master spark://mini1:7077 指定Master的地址,可用逗号分割指定多台机器

--executor-memory 2g 指定每个worker可用内存为2G

--total-executor-cores 2 指定整个集群使用的cup核数为2

 

注意:

如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,其实是启动了sparklocal模式,该模式仅在本机启动一个进程,没有与集群建立联系。

 

Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可

 

 

1.2.2. spark shell中编写WordCount程序

1.首先启动hdfs

2.向hdfs上传一个文件到hdfs://node1.edu360.cn:9000/words.txt

3.在spark shell中用scala语言编写spark程序

sc.textFile("hdfs://mini1:9000//wordcount/in/*").flatMap(_.split(" "))

.map((_,1)).reduceByKey(_+_).saveAsTextFile("hdfs://mini1:9000/wordcount/out")

 

4.使用hdfs命令查看结果

hdfs dfs -ls hdfs://mini1:9000//wordcount/out

 

 

说明:

scSparkContext对象,该对象时提交spark程序的入口

textFile(hdfs://mini1:9000//wordcount/in/*)hdfs中读取数据

flatMap(_.split(" "))map在压平

map((_,1))将单词和1构成元组

reduceByKey(_+_)按照key进行reduce,并将value累加

saveAsTextFile("hdfs://node1.edu360.cn:9000/out")将结果写入到hdfs

 

1.2.3. spark shell测试master HA故障自动切换

 

spark集群状态

 

 

 

 

启动spark shell

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-shell \

--master spark://mini1:7077,mini2:7077,mini3:7077 \

--executor-memory 812m \

--total-executor-cores 2

val rdd=sc.makeRDD(Array("hello you","hello me","hello world","hehe haha"))

rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[2] at makeRDD at <console>:24

 

rdd.collect

res2: Array[String] = Array(hello you, hello me, hello world, hehe haha)

 

关闭mini1master

sbin/stop-master.sh

 

spark-shell上显示重连信息


 

等待30s左右,mini3被切换成alive master

 

继续运行任然可以得到结果

 

 

1.3. 在IDEA中编写WordCount程序

spark shell仅在测试和验证我们的程序时使用的较多,在生产环境中,通常会在IDE中编制程序,然后打成jar包,然后提交到集群,最常用的是创建一个Maven项目,利用Maven来管理jar包的依赖。

1.3.1. 创建项目

1.创建一个项目

 

 

 

2.选择Maven项目,然后点击next

 

 

3.填写mavenGAV,然后点击next

 

 

4.填写项目名称,然后点击finish

 

 

5.创建好maven项目后,点击Enable Auto-Import

 

6.配置Mavenpom.xml

<?xml version="1.0"encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"
>
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.edu360</groupId>
    <artifactId>spark-demo</artifactId>
    <version>1.0-SNAPSHOT</version>
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <scala.version>2.11.8</scala.version>
        <scala.compat.version>2.11</scala.compat.version>
        <spark.version>2.1.0</spark.version>
        <hadoop.version>2.6.0</hadoop.version>
        <encoding>UTF-8</encoding>
        <akka.version>2.4.16</akka.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>com.typesafe.akka</groupId>
            <artifactId>akka-actor_2.11</artifactId>
            <version>${akka.version}</version>
        </dependency>

        <dependency>
            <groupId>com.typesafe.akka</groupId>
            <artifactId>akka-remote_2.11</artifactId>
            <version>${akka.version}</version>
        </dependency>
    </dependencies>

    <build>
        <pluginManagement>
            <plugins>
                <plugin>
                    <groupId>net.alchim31.maven</groupId>
                    <artifactId>scala-maven-plugin</artifactId>
                    <version>3.2.2</version>
                </plugin>
                <plugin>
                    <groupId>org.apache.maven.plugins</groupId>
                    <artifactId>maven-compiler-plugin</artifactId>
                    <version>3.5.1</version>
                </plugin>
            </plugins>
        </pluginManagement>
        <plugins>
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                    <execution>
                        <id>scala-test-compile</id>
                        <phase>process-test-resources</phase>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.4.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>

 

 

7.创建scala源码包

 

 

 

 

等待一会

 

1.3.2. 编写代码

8.新建一个scala class,类型为Object

 

 

 

 

9.编写spark程序

packagecom.edu360

import org.apache.spark.{SparkConf, SparkContext}

object WordCount {
  def main(args: Array[String]): Unit = {
    val sc: SparkContext =new SparkContext(newSparkConf().setAppName("wordcount"))
    sc.textFile(args(0))
      .flatMap(_.split("\\s+"))
      .map((_, 1))
      .reduceByKey(_+_)
      .saveAsTextFile(args(1))
  }
}

 

1.3.3. jar

10.1.使用Maven打包

 

双击package命令

 

会产生target文件夹,里面会有2个打好的jar

original开头的表示不带依赖class仅包含自己写的代码

original开头的表示带全部依赖的jar

 

 

10.2也可以直接使用ideajar

 

 

添加需要打jar包的module

 

 

添加后会多出compile output

 

 

点击make module即可编译到指定目录

若该项目不生效,则

 

 

即可

 

 

 

1.3.4. 上传jar包到集群

11.选择编译成功的jar包,并将该jar上传到Spark集群中的某个节点上

由于spark集群自身带有相关依赖,所以我们只需上传原始jar包即可

 

 

12.使用spark-submit命令提交Spark应用(注意参数的顺序)

/usr/local/app/spark-2.1.0-bin-hadoop2.6/bin/spark-submit \

--class com.edu360.WordCount \

--master spark://mini1:7077 \

--executor-memory 812m \

--total-executor-cores 2 \

/root/original-spark-demo-1.0-SNAPSHOT.jar \

hdfs://mini1:9000/wordcount/in/* \

hdfs://mini1:9000/wordcount/out

 

提交后的信息

 

 

 

查看程序执行结果

hdfs dfs -ls /wordcount/out

hdfs dfs -text /wordcount/out/*

 

 

1.3.5. spark-submit参数说明

Usage: spark-submit [options] <app jar | python file> [app options]

参数名称

含义

--master MASTER_URL

可以是spark://host:port, mesos://host:port, yarn,yarn-cluster,yarn-client, local

--deploy-mode DEPLOY_MODE

Driver程序运行的地方,client或者cluster

--class CLASS_NAME

主类名称,含包名

--name NAME

Application名称

--jars JARS

Driver依赖的第三方jar

--py-files PY_FILES

用逗号隔开的放置在Python应用程序PYTHONPATH上的.zip,  .egg, .py文件列表

--files FILES

用逗号隔开的要放置在每个executor工作目录的文件列表

--properties-file FILE

设置应用程序属性的文件路径,默认是conf/spark-defaults.conf

--driver-memory MEM

Driver程序使用内存大小

--driver-java-options

 

--driver-library-path

Driver程序的库路径

--driver-class-path

Driver程序的类路径

--executor-memory MEM

executor内存大小,默认1G

--driver-cores NUM

Driver程序的使用CPU个数,仅限于Spark Alone模式

--supervise

失败后是否重启Driver,仅限于Spark  Alone模式

--total-executor-cores NUM

executor使用的总核数,仅限于Spark AloneSpark on Mesos模式

--executor-cores NUM

每个executor使用的内核数,默认为1,仅限于Spark on Yarn模式

--queue QUEUE_NAME

提交应用程序给哪个YARN的队列,默认是default队列,仅限于Spark on Yarn模式

--num-executors NUM

启动的executor数量,默认是2个,仅限于Spark on Yarn模式

--archives ARCHIVES

仅限于Spark on Yarn模式

 

1.3.6. spark-submit运行原理

例如:spark-submit --class cn.itcast.spark.WordCount

 

1.调用org.apache.spark.deploy.SparkSubmit类的main方法

2.doRunMain方法中传进参数 class cn.itcast.spark.WordCount

3.通过反射拿到类的实例的引用mainClass = Utils.classForName(childMainClass)

4.通过反射调用class cn.itcast.spark.WordCountmain方法


完整pdf文档下载链接:

http://pan.baidu.com/s/1bp8G6NP

0 0