第一个Spark程序

来源:互联网 发布:网站服务器和域名绑定 编辑:程序博客网 时间:2024/04/24 10:31

在本文中,威廉将尝试构建我们的第一个Spark程序,并在之前文章中创建的Spark集群里运行起来

Java程序

对于Java程序来说,使用Maven管理依赖及发布过程中的各个步骤是不错的选择

利用Maven生成Java Project

mvn archetype:generate -DgroupId=com.spark.example -DartifactId=JavaSparkPi

Maven为我们创建了符合标准目录结构的文件

JavaSparkPi|-src  |-main    |-java      |-com        |-spark          |-example            |-App.java  |-test    |-java      |-com        |-spark          |-example            |-AppTest.java|-pom.xml

接下来我们将Spark官方实例JavaSparkPi.java作实例,因此用其替换掉生成的App.java

/* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements.  See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License.  You may obtain a copy of the License at * *    http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.spark.example;import org.apache.spark.SparkConf;import org.apache.spark.api.java.JavaRDD;import org.apache.spark.api.java.JavaSparkContext;import org.apache.spark.api.java.function.Function;import org.apache.spark.api.java.function.Function2;import java.util.ArrayList;import java.util.List;/**  * 计算Pi的近似值 * Usage: JavaSparkPi [slices] */public final class JavaSparkPi {  public static void main(String[] args) throws Exception {    # 创建SparkConf对象    SparkConf sparkConf = new SparkConf().setAppName("JavaSparkPi");    # 创建JavaSparkContext对象    JavaSparkContext jsc = new JavaSparkContext(sparkConf);    int slices = (args.length == 1) ? Integer.parseInt(args[0]) : 2;    int n = 100000 * slices;    List<Integer> l = new ArrayList<Integer>(n);    for (int i = 0; i < n; i++) {      l.add(i);    }    JavaRDD<Integer> dataSet = jsc.parallelize(l, slices);    int count = dataSet.map(new Function<Integer, Integer>() {      @Override      public Integer call(Integer integer) {        double x = Math.random() * 2 - 1;        double y = Math.random() * 2 - 1;        return (x * x + y * y < 1) ? 1 : 0;      }    }).reduce(new Function2<Integer, Integer, Integer>() {      @Override      public Integer call(Integer integer, Integer integer2) {        return integer + integer2;      }    });    System.out.println("Pi is roughly " + 4.0 * count / n);    jsc.stop();  }}

此程序实现的功能是计算Pi的近似值,需提供参数slices,默认值为2,slices参数值越大,计算结果的精确度越高,但计算量也就越大

配置pom依赖关系

JavaSparkPi对象引用了org.apache.spark包的对象,因此需要在pom文件中配置依赖关系

编辑JavaSparkPi/pom.xml,在dependencies处加入以下代码

<dependency>  <groupId>org.apache.spark</groupId>  <artifactId>spark-core_2.10</artifactId>  <version>1.2.0</version></dependency>

我们可以通过mvn dependency命令来查看依赖关系

mvn dependency:tree -Ddetail=true
[INFO] com.spark.example:JavaSparkPi:jar:1.0-SNAPSHOT[INFO] +- junit:junit:jar:3.8.1:test[INFO] \- org.apache.spark:spark-core_2.10:jar:1.2.0:compile[INFO]    +- com.twitter:chill_2.10:jar:0.5.0:compile[INFO]    |  \- com.esotericsoftware.kryo:kryo:jar:2.21:compile[INFO]    |     +- com.esotericsoftware.reflectasm:reflectasm:jar:shaded:1.07:compile[INFO]    |     +- com.esotericsoftware.minlog:minlog:jar:1.2:compile[INFO]    |     \- org.objenesis:objenesis:jar:1.2:compile[INFO]    +- com.twitter:chill-java:jar:0.5.0:compile[INFO]    +- org.apache.hadoop:hadoop-client:jar:2.2.0:compile[INFO]    |  +- org.apache.hadoop:hadoop-common:jar:2.2.0:compile[INFO]    |  |  +- commons-cli:commons-cli:jar:1.2:compile[INFO]    |  |  +- org.apache.commons:commons-math:jar:2.1:compile[INFO]    |  |  +- xmlenc:xmlenc:jar:0.52:compile[INFO]    |  |  +- commons-io:commons-io:jar:2.1:compile[INFO]    |  |  +- commons-logging:commons-logging:jar:1.1.1:compile[INFO]    |  |  +- commons-lang:commons-lang:jar:2.5:compile[INFO]    |  |  +- commons-configuration:commons-configuration:jar:1.6:compile[INFO]    |  |  |  +- commons-collections:commons-collections:jar:3.2.1:compile[INFO]    |  |  |  +- commons-digester:commons-digester:jar:1.8:compile[INFO]    |  |  |  |  \- commons-beanutils:commons-beanutils:jar:1.7.0:compile[INFO]    |  |  |  \- commons-beanutils:commons-beanutils-core:jar:1.8.0:compile[INFO]    |  |  +- org.codehaus.jackson:jackson-core-asl:jar:1.8.8:compile[INFO]    |  |  +- org.codehaus.jackson:jackson-mapper-asl:jar:1.8.8:compile[INFO]    |  |  +- org.apache.avro:avro:jar:1.7.4:compile[INFO]    |  |  +- com.google.protobuf:protobuf-java:jar:2.5.0:compile[INFO]    |  |  +- org.apache.hadoop:hadoop-auth:jar:2.2.0:compile[INFO]    |  |  \- org.apache.commons:commons-compress:jar:1.4.1:compile[INFO]    |  |     \- org.tukaani:xz:jar:1.0:compile[INFO]    |  +- org.apache.hadoop:hadoop-hdfs:jar:2.2.0:compile[INFO]    |  |  \- org.mortbay.jetty:jetty-util:jar:6.1.26:compile[INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-app:jar:2.2.0:compile[INFO]    |  |  +- org.apache.hadoop:hadoop-mapreduce-client-common:jar:2.2.0:compile[INFO]    |  |  |  +- org.apache.hadoop:hadoop-yarn-client:jar:2.2.0:compile[INFO]    |  |  |  |  +- com.google.inject:guice:jar:3.0:compile[INFO]    |  |  |  |  |  +- javax.inject:javax.inject:jar:1:compile[INFO]    |  |  |  |  |  \- aopalliance:aopalliance:jar:1.0:compile[INFO]    |  |  |  |  +- com.sun.jersey.jersey-test-framework:jersey-test-framework-grizzly2:jar:1.9:compile[INFO]    |  |  |  |  |  +- com.sun.jersey.jersey-test-framework:jersey-test-framework-core:jar:1.9:compile[INFO]    |  |  |  |  |  |  +- javax.servlet:javax.servlet-api:jar:3.0.1:compile[INFO]    |  |  |  |  |  |  \- com.sun.jersey:jersey-client:jar:1.9:compile[INFO]    |  |  |  |  |  \- com.sun.jersey:jersey-grizzly2:jar:1.9:compile[INFO]    |  |  |  |  |     +- org.glassfish.grizzly:grizzly-http:jar:2.1.2:compile[INFO]    |  |  |  |  |     |  \- org.glassfish.grizzly:grizzly-framework:jar:2.1.2:compile[INFO]    |  |  |  |  |     |     \- org.glassfish.gmbal:gmbal-api-only:jar:3.0.0-b023:compile[INFO]    |  |  |  |  |     |        \- org.glassfish.external:management-api:jar:3.0.0-b012:compile[INFO]    |  |  |  |  |     +- org.glassfish.grizzly:grizzly-http-server:jar:2.1.2:compile[INFO]    |  |  |  |  |     |  \- org.glassfish.grizzly:grizzly-rcm:jar:2.1.2:compile[INFO]    |  |  |  |  |     +- org.glassfish.grizzly:grizzly-http-servlet:jar:2.1.2:compile[INFO]    |  |  |  |  |     \- org.glassfish:javax.servlet:jar:3.1:compile[INFO]    |  |  |  |  +- com.sun.jersey:jersey-server:jar:1.9:compile[INFO]    |  |  |  |  |  +- asm:asm:jar:3.1:compile[INFO]    |  |  |  |  |  \- com.sun.jersey:jersey-core:jar:1.9:compile[INFO]    |  |  |  |  +- com.sun.jersey:jersey-json:jar:1.9:compile[INFO]    |  |  |  |  |  +- org.codehaus.jettison:jettison:jar:1.1:compile[INFO]    |  |  |  |  |  |  \- stax:stax-api:jar:1.0.1:compile[INFO]    |  |  |  |  |  +- com.sun.xml.bind:jaxb-impl:jar:2.2.3-1:compile[INFO]    |  |  |  |  |  |  \- javax.xml.bind:jaxb-api:jar:2.2.2:compile[INFO]    |  |  |  |  |  |     \- javax.activation:activation:jar:1.1:compile[INFO]    |  |  |  |  |  +- org.codehaus.jackson:jackson-jaxrs:jar:1.8.3:compile[INFO]    |  |  |  |  |  \- org.codehaus.jackson:jackson-xc:jar:1.8.3:compile[INFO]    |  |  |  |  \- com.sun.jersey.contribs:jersey-guice:jar:1.9:compile[INFO]    |  |  |  \- org.apache.hadoop:hadoop-yarn-server-common:jar:2.2.0:compile[INFO]    |  |  \- org.apache.hadoop:hadoop-mapreduce-client-shuffle:jar:2.2.0:compile[INFO]    |  +- org.apache.hadoop:hadoop-yarn-api:jar:2.2.0:compile[INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-core:jar:2.2.0:compile[INFO]    |  |  \- org.apache.hadoop:hadoop-yarn-common:jar:2.2.0:compile[INFO]    |  +- org.apache.hadoop:hadoop-mapreduce-client-jobclient:jar:2.2.0:compile[INFO]    |  \- org.apache.hadoop:hadoop-annotations:jar:2.2.0:compile[INFO]    +- org.apache.spark:spark-network-common_2.10:jar:1.2.0:compile[INFO]    +- org.apache.spark:spark-network-shuffle_2.10:jar:1.2.0:compile[INFO]    +- net.java.dev.jets3t:jets3t:jar:0.7.1:compile[INFO]    |  +- commons-codec:commons-codec:jar:1.3:compile[INFO]    |  \- commons-httpclient:commons-httpclient:jar:3.1:compile[INFO]    +- org.apache.curator:curator-recipes:jar:2.4.0:compile[INFO]    |  +- org.apache.curator:curator-framework:jar:2.4.0:compile[INFO]    |  |  \- org.apache.curator:curator-client:jar:2.4.0:compile[INFO]    |  +- org.apache.zookeeper:zookeeper:jar:3.4.5:compile[INFO]    |  |  \- jline:jline:jar:0.9.94:compile[INFO]    |  \- com.google.guava:guava:jar:14.0.1:compile[INFO]    +- org.eclipse.jetty:jetty-plus:jar:8.1.14.v20131031:compile[INFO]    |  +- org.eclipse.jetty.orbit:javax.transaction:jar:1.1.1.v201105210645:compile[INFO]    |  +- org.eclipse.jetty:jetty-webapp:jar:8.1.14.v20131031:compile[INFO]    |  |  +- org.eclipse.jetty:jetty-xml:jar:8.1.14.v20131031:compile[INFO]    |  |  \- org.eclipse.jetty:jetty-servlet:jar:8.1.14.v20131031:compile[INFO]    |  \- org.eclipse.jetty:jetty-jndi:jar:8.1.14.v20131031:compile[INFO]    |     \- org.eclipse.jetty.orbit:javax.mail.glassfish:jar:1.4.1.v201005082020:compile[INFO]    |        \- org.eclipse.jetty.orbit:javax.activation:jar:1.1.0.v201105071233:compile[INFO]    +- org.eclipse.jetty:jetty-security:jar:8.1.14.v20131031:compile[INFO]    +- org.eclipse.jetty:jetty-util:jar:8.1.14.v20131031:compile[INFO]    +- org.eclipse.jetty:jetty-server:jar:8.1.14.v20131031:compile[INFO]    |  +- org.eclipse.jetty.orbit:javax.servlet:jar:3.0.0.v201112011016:compile[INFO]    |  +- org.eclipse.jetty:jetty-continuation:jar:8.1.14.v20131031:compile[INFO]    |  \- org.eclipse.jetty:jetty-http:jar:8.1.14.v20131031:compile[INFO]    |     \- org.eclipse.jetty:jetty-io:jar:8.1.14.v20131031:compile[INFO]    +- org.apache.commons:commons-lang3:jar:3.3.2:compile[INFO]    +- org.apache.commons:commons-math3:jar:3.1.1:compile[INFO]    +- com.google.code.findbugs:jsr305:jar:1.3.9:compile[INFO]    +- org.slf4j:slf4j-api:jar:1.7.5:compile[INFO]    +- org.slf4j:jul-to-slf4j:jar:1.7.5:compile[INFO]    +- org.slf4j:jcl-over-slf4j:jar:1.7.5:compile[INFO]    +- log4j:log4j:jar:1.2.17:compile[INFO]    +- org.slf4j:slf4j-log4j12:jar:1.7.5:compile[INFO]    +- com.ning:compress-lzf:jar:1.0.0:compile[INFO]    +- org.xerial.snappy:snappy-java:jar:1.1.1.6:compile[INFO]    +- net.jpountz.lz4:lz4:jar:1.2.0:compile[INFO]    +- org.roaringbitmap:RoaringBitmap:jar:0.4.5:compile[INFO]    +- commons-net:commons-net:jar:2.2:compile[INFO]    +- org.spark-project.akka:akka-remote_2.10:jar:2.3.4-spark:compile[INFO]    |  +- org.spark-project.akka:akka-actor_2.10:jar:2.3.4-spark:compile[INFO]    |  |  \- com.typesafe:config:jar:1.2.1:compile[INFO]    |  +- io.netty:netty:jar:3.8.0.Final:compile[INFO]    |  +- org.spark-project.protobuf:protobuf-java:jar:2.5.0-spark:compile[INFO]    |  \- org.uncommons.maths:uncommons-maths:jar:1.2.2a:compile[INFO]    +- org.spark-project.akka:akka-slf4j_2.10:jar:2.3.4-spark:compile[INFO]    +- org.scala-lang:scala-library:jar:2.10.4:compile[INFO]    +- org.json4s:json4s-jackson_2.10:jar:3.2.10:compile[INFO]    |  +- org.json4s:json4s-core_2.10:jar:3.2.10:compile[INFO]    |  |  +- org.json4s:json4s-ast_2.10:jar:3.2.10:compile[INFO]    |  |  +- com.thoughtworks.paranamer:paranamer:jar:2.6:compile[INFO]    |  |  \- org.scala-lang:scalap:jar:2.10.0:compile[INFO]    |  |     \- org.scala-lang:scala-compiler:jar:2.10.0:compile[INFO]    |  |        \- org.scala-lang:scala-reflect:jar:2.10.0:compile[INFO]    |  \- com.fasterxml.jackson.core:jackson-databind:jar:2.3.1:compile[INFO]    |     +- com.fasterxml.jackson.core:jackson-annotations:jar:2.3.0:compile[INFO]    |     \- com.fasterxml.jackson.core:jackson-core:jar:2.3.1:compile[INFO]    +- org.apache.mesos:mesos:jar:shaded-protobuf:0.18.1:compile[INFO]    +- io.netty:netty-all:jar:4.0.23.Final:compile[INFO]    +- com.clearspring.analytics:stream:jar:2.7.0:compile[INFO]    +- com.codahale.metrics:metrics-core:jar:3.0.0:compile[INFO]    +- com.codahale.metrics:metrics-jvm:jar:3.0.0:compile[INFO]    +- com.codahale.metrics:metrics-json:jar:3.0.0:compile[INFO]    +- com.codahale.metrics:metrics-graphite:jar:3.0.0:compile[INFO]    +- org.tachyonproject:tachyon-client:jar:0.5.0:compile[INFO]    |  \- org.tachyonproject:tachyon:jar:0.5.0:compile[INFO]    +- org.spark-project:pyrolite:jar:2.0.1:compile[INFO]    +- net.sf.py4j:py4j:jar:0.8.2.1:compile[INFO]    \- org.spark-project.spark:unused:jar:1.0.0:compile

配置maven-shade-plugin

默认情况下,Maven在本地有名为.m2的仓库文件夹,用于存放所有依赖的jar包,因此在package操作时只会打包依赖的jar包的名称,而不会将真正的内容打包进去

而当在集群中运行程序的时候,情况就有所不同了,我们很难保证所依赖的jar包在所有worker主机上都存在,因此需要生成一个大jar包(uber JAR/assembly JAR),其中包含所有的依赖内容

我们可以通过maven-shade-plugin插件来实现,在pom中加入以下内容:

<build>  <plugins>    <plugin>      <groupId>org.apache.maven.plugins</groupId>      <artifactId>maven-shade-plugin</artifactId>      <version>2.3</version>      <executions>        <execution>          <phase>package</phase>          <goals>            <goal>shade</goal>          </goals>        </execution>      </executions>    </plugin>  </plugins></build>

Maven生成JAR包

mvn clean compile package

成功运行后,我们看到在JavaSparkPi/target文件夹下生成了JavaSparkPi-1.0-SNAPSHOT.jar文件,大小为76M,可见所有的依赖jar包都被包含了进来

此处威廉只是演示如何生成uber JAR,其实spark-corejar是所有集群主机都肯定拥有的,因此我们不需要把它加入到打包生成的jar包中

修改JavaSparkPi/pom.xml,把spark-core依赖的scope设为provided

<dependency>  <groupId>org.apache.spark</groupId>  <artifactId>spark-core_2.10</artifactId>  <version>1.2.0</version>  <scope>provided</scope></dependency>

再次运行mvn clean compile package,可以看到这次的JavaSparkPi-1.0-SNAPSHOT.jar大小仅为5K,不再包含spark-core相关的依赖jar

向Spark集群提交程序

./bin/spark-submit --class com.spark.example.JavaSparkPi --master spark://192.168.32.130:7077 $HOME/JavaSparkPi/target/JavaSparkPi-1.0-SNAPSHOT.jar 1000

Python程序

Python是脚本语言,部署相对简单,不需要编译的步骤,仍然以计算Pi近似值的示例程序为例

./bin/spark-submit \  --master spark://192.168.32.130:7077 \  examples/src/main/python/pi.py \  1000

引用的.zip, .egg, .py文件可以通过--py-files参数来添加

0 0
原创粉丝点击