spark应用开发---Spark学习笔记6

来源：互联网发布：测手机信号软件下载编辑：程序博客网时间：2024/05/06 11:31

如何部署和开发一个spark应用程序呢？

首先要选好环境，我用的是incubator-spark-0.8.1-incubating,那么对应的是scala版本是2.9.3。

如果使用maven或者sbt构建，则可以使用gav

groupId = org.apache.sparkartifactId = spark-core_2.9.3version = 0.8.1-incubating

如果要访问HDFS,则需要引用hadoop-client

groupId = org.apache.hadoopartifactId = hadoop-clientversion = <your-hdfs-version>

对于sbt构建，可以使用sbt/sbt assembly 来打包你的spark源代码assembly/target/scala-2.9.3/spark-assembly-0.8.1-incubating-hadoop*.jar

打包后将其加入到你项目的classpath里。

spark使用hadoop-client来和HDFS或其它HADOOP存储系统通信，因为HDFS协议在hadoop里面已经变更了好几个版本，所以必须指定版本，默认的spark链接的是hadoop1,0,4版本，在编译的时候可以使用SPARK_HADOOP_VERSION参数来指定版本

SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

如果想运行在yarn上，需要再添加一个参数，即

SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

最后在你的程序里，引入SparkContext，spark类就可以使用了

import org.apache.spark.SparkContextimport org.apache.spark.SparkContext._

new SparkContext(master, appName, [sparkHome], [jars])

创建一个spark应用程序，首先要声明一个SparkContext对象。

1.master指定的是要连接到的集群，mesos, yarn or local。

2.appName是你的应用名称，在集群监控web界面可以追溯。

3.sparkHome和jar是应用部署到分布式环境所需。

运行spark-shell，可以指定参数

$ MASTER=local[4] ./spark-shell

这里MASTER参数指定是本地启动，使用4个核心

ubuntu里察看核心cat /proc/cpuinfo，我的电脑才2个。。。。

victor@victor-ubuntu:~/software/spark$ more /proc/cpuinfo processor: 0vendor_id: GenuineIntelcpu family: 6model: 23model name: Intel(R) Core(TM)2 Duo CPU     T6600  @ 2.20GHzstepping: 10microcode: 0xa07cpu MHz: 2200.000cache size: 2048 KBphysical id: 0siblings: 2core id: 0cpu cores: 2apicid: 0initial apicid: 0fdiv_bug: nohlt_bug: nof00f_bug: nocoma_bug: nofpu: yesfpu_exception: yescpuid level: 13wp: yesflags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dthermbogomips: 4388.84clflush size: 64cache_alignment: 64address sizes: 36 bits physical, 48 bits virtualpower management:processor: 1vendor_id: GenuineIntelcpu family: 6model: 23model name: Intel(R) Core(TM)2 Duo CPU     T6600  @ 2.20GHzstepping: 10microcode: 0xa07cpu MHz: 1200.000cache size: 2048 KBphysical id: 0siblings: 2core id: 1cpu cores: 2apicid: 1initial apicid: 1fdiv_bug: nohlt_bug: nof00f_bug: nocoma_bug: nofpu: yesfpu_exception: yescpuid level: 13wp: yesflags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dthermbogomips: 4388.84clflush size: 64cache_alignment: 64address sizes: 36 bits physical, 48 bits virtualpower management:

也可以在启动的时候就添加一些jar

$ MASTER=local[4] ADD_JARS=code.jar ./spark-shell

Master URLs

The master URL passed to Spark can be in one of the following formats:

Master URLMeaninglocalRun Spark locally with one worker thread (i.e. no parallelism at all).local[K]Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).spark://HOST:PORTConnect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.mesos://HOST:PORTConnect to the given Mesos cluster. The host parameter is the hostname of the Mesos master. The port must be whichever one the master is configured to use, which is 5050 by default.

If no master URL is specified, the spark shell defaults to “local”.

For running on YARN, Spark launches an instance of the standalone deploy cluster within YARN; see running on YARN for details.

Deploying Code on a Cluster

If you want to run your application on a cluster, you will need to specify the two optional parameters to SparkContext to let it find your code:

sparkHome: The path at which Spark is installed on your worker machines (it should be the same on all of them).
jars: A list of JAR files on the local machine containing your application’s code and any dependencies, which Spark will deploy to all the worker nodes. You’ll need to package your application into a set of JARs using your build system. For example, if you’re using SBT, the sbt-assembly plugin is a good way to make a single JAR with your code and dependencies.

If you run spark-shell on a cluster, you can add JARs to it by specifying the ADD_JARS environment variable before you launch it. This variable should contain a comma-separated list of JARs. For example, ADD_JARS=a.jar,b.jar ./spark-shell will launch a shell with a.jar and b.jar on its classpath. In addition, any new classes you define in the shell will automatically be distributed.

Ok,现在开始动手写一个Spark应用。

环境：Idellj idea 12.1.7 with scala plugin

创建一个项目wordcount，添加一下jar包到classpath,有scala2.9.3jar包库，还有编译好的spark0.8.1-hadoop2.2.0.jar

1.首先创建一个scala Object

创建一个SparkContext，使用local Mode，输入源是我拷贝来的README.md，输出地址是SAVED文件夹

先创建textFile这个RDD，然后用flatMap将每一行通过空格分隔，变成Seq[word1,word2,word3...]。

调用map，输出word,1 最后reduceByKey来统计词频。

注：这里flatMap,map以及reduceByKey都是transformation.不是action。

上代码：

/** * Created with IntelliJ IDEA. * User: shengli.victor * Date: 4/2/14 * Time: 11:43 PM * To change this template use File | Settings | File Templates. */import org.apache.spark._import SparkContext._object WordCount {  def main(args: Array[String]) {    val sc = new SparkContext("local", "WordCount",      System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_TEST_JAR")))    val textFile = sc.textFile("README.md")    val result = textFile.flatMap(line => line.split("\\s+"))      .map(word => (word, 1)).reduceByKey(_ + _)    result.saveAsTextFile("SAVED")  }}

reduceByKey(func, [numTasks])When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function. Like in groupByKey, the number of reduce tasks is configurable through an optional second argument.

这里我们看到了熟悉的part-00000文件和_SUCCESS文件，说明本地模式运行成功。

如果想运行在yarn上面，则需要注意将local模式改为yarn-standalone模式。

将README.md上传到hdfs://host:port/dw/wordcount/input目录下。

则对应的输入目录可以是hdfs://host:port/dw/wordcount/input

输出目录可以是hdfs://host:port/dw/dw/wordcount/output

不出意外，结果应该是一致的。

原创，转载请注明出处http://blog.csdn.net/oopsoom/article/details/22827083

0 0