spark note

来源:互联网 发布:手机淘宝店铺模版购买 编辑:程序博客网 时间:2024/05/13 02:06

SparkContext:

def createSparkContext(): SparkContext = {
val master = this.master match {
case Some(m) => m
case None => {
val prop = System.getenv("MASTER")
if (prop != null) prop else "local"
}
}
sparkContext = new SparkContext(master, "Spark shell")
}

For a client to establish a connection to the Spark cluster, the SparkContext object
needs some basic information as follows:
master: The master URL can be in one of the following formats:
local[n]: for a local mode
spark://[sparkip]: to point to a Spark cluster
mesos://: for a mesos path if you are running a mesos cluster
application name: This is the human-readable application name
sparkHome: This is the path to Spark on the master/workers machines
jars: This gives the path to the list of JAR files required for your job

Scala
In a Scala program, you can create a SparkContext instance using the following code:
val spar kContext = new SparkContext(master_path, "application
name", ["optional spark home path"],["optional list of jars"])
While you can hardcode all of these values, it's better to read them from the
environment with reasonable defaults. This approach provides maximum flexibility
to run the code in a changing environment without having to recompile the code.
Using local as the default value for the master machine makes it easy to launch
your application locally in a test environment. By carefully selecting the defaults,
you can avoid having to over-specify them. An example would be as follows:
import spark.sparkContext
import spark.sparkContext._
import scala.util.Properties
val master = Properties.envOrElse("MASTER","local")
val sparkHome = Properties.get("SPARK_HOME")
val myJars = Seq(System.get("JARS")
val sparkContext = new SparkContext(master, "my app", sparkHome,myJars)


The collect() function is especially useful for testing, in much the same way as the
parallelize() function is. The collect() function only works if your data fits
in memory on a single host; in that case it adds the bottleneck of everything having
to come back to a single machine.

0 0