Spark编程指南

来源：互联网发布：产品经理与数据分析师编辑：程序博客网时间：2024/05/19 19:29

Understanding closures
理解闭包

One of the harder things about Spark is understanding the scope and life cycle of variables and methods when executing code across a cluster. RDD operations that modify variables outside of their scope can be a frequent source of confusion. In the example below we’ll look at code that uses foreach() to increment a counter, but similar issues can occur for other operations as well.

关于spark有个很难的事就是理解通过集群执行代码时变量和方法的作用域与生命周期。会在作用域外修改变量的RDD操作经常让人混淆。下面的例子，我们将使用foreach去递加一个计数器，相似的情形也会发生在RDD的其他操作上。

Example

Consider the naive RDD element sum below, which may behave differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in local mode (–master = local[n]) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):
考虑一个简单RDD元素求和，这个跟它是否运行在同一个虚拟机有关。最普通的例子，当它运行在local模式跟它运行在集群模式（比如提交给yarn集群。）说明：本地模式所有executer，driver，master，worker都运行在一个虚拟机上。

var counter = 0
var rdd = sc.parallelize(data)

// Wrong: Don’t do this!!
rdd.foreach(x => counter += x)

println(“Counter value: ” + counter)

Local vs. cluster modes
本地模式 vs 集群模式
The behavior of the above code is undefined, and may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks, each of which is executed by an executor. Prior to execution, Spark computes the task’s closure. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case foreach()). This closure is serialized and sent to each executor.
上述代码行为是未知的，并不像预想的那样。为了执行job，Spark将RDD操作处理切分成任务，每个都用一个executor执行。在执行前，spark会计算这个闭包。闭包是那些变量和方法，它们必须对执行它的executor可见。这里是foreach方法。闭包要支持系列化，可以被发送给每个executor

The variables within the closure sent to each executor are now copies and thus, when counter is referenced within the foreach function, it’s no longer the counter on the driver node. There is still a counter in the memory of the driver node but this is no longer visible to the executors! The executors only see the copy from the serialized closure. Thus, the final value of counter will still be zero since all operations on counter were referencing the value within the serialized closure.

在闭包里的变量是复制后发给每个executor，所以在foreach函数里的counter已经不是在driver节点里的那个counter了。driver节点里的counter仍然存在，但是对这些executors不可见。executors只能见到复制品。所以最后counter仍然是0，因为所以操作都是针对系列化后的闭包做的。

In local mode, in some circumstances the foreach function will actually execute within the same JVM as the driver and will reference the same original counter, and may actually update it.

在本地模式，在特定情形下，foreach可以执行在同一个jvm，可以执行更新操作。
To ensure well-defined behavior in these sorts of scenarios one should use an Accumulator. Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster. The Accumulators section of this guide discusses these in more detail.
为了确保跟定义一致，在这种情形下我们必须使用accumulator。accumulators在spark中就是为了确保集群多节点能安全的更新变量。

In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that’s just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.

====================================================
Shared Variables 共享变量
Normally, when a function passed to a Spark operation (such as map or reduce) is executed on a remote cluster node, it works on separate copies of all the variables used in the function. These variables are copied to each machine, and no updates to the variables on the remote machine are propagated back to the driver program. Supporting general, read-write shared variables across tasks would be inefficient. However, Spark does provide two limited types of shared variables for two common usage patterns: broadcast variables and accumulators.
一般情况下，一个函数在远端集群节点执行时，它作用在函数变量的复制品上。这些变量被复制到每个机器，并且远程机器上的变量的更新都不会被传播回到驱动程序。在任务之间支持一般的，读写共享变量将是低效的。然而，Spark为两种常用的使用模式提供了两种有限类型的共享变量：广播变量和累加器。

Broadcast Variables广播变量

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
广播变量允许程序员在每个机器上保留只读变量，而不是使用任务运送它的副本。例如，可以使用它们以有效的方式为每个节点提供大型输入数据集的副本。 Spark还尝试使用高效的广播算法分发广播变量，以降低通信成本。

Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
Spark动作通过一组Stage执行，由分散的“shuffle”操作隔开。 Spark自动广播每个阶段任务所需的公共数据。以这种方式广播的数据以序列化形式缓存，并在运行每个任务之前进行反序列化。这意味着，显式创建广播变量只有在跨多个阶段的任务需要相同数据或者以反序列化格式缓存数据很重要时才有用。

Broadcast variables are created from a variable v by calling SparkContext.broadcast(v). The broadcast variable is a wrapper around v, and its value can be accessed by calling the value method. The code below shows this:
广播变量通过调用SparkContext.broadcast（v）从变量v创建。广播变量是围绕v的包装器，其值可以通过调用value方法来访问。下面的代码显示：

scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
broadcastVar: org.apache.spark.broadcast.Broadcast[Array[Int]] = Broadcast(0)

scala> broadcastVar.value
res0: Array[Int] = Array(1, 2, 3)

After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
创建广播变量后，应在群集中运行的任何函数中使用它而不是值v，以便v不会多次发送到节点。此外，对象v在广播之后不应被修改，以便确保所有节点获得与广播变量相同的值（例如，如果变量稍后发送到新节点）。

Accumulators累加器
Accumulators are variables that are only “added” to through an associative and commutative operation and can therefore be efficiently supported in parallel. They can be used to implement counters (as in MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers can add support for new types.

Accumulators 是仅通过关联和交换操作“添加”的变量，因此可以并行有效地支持。它们可用于实现计数器（如MapReduce）或总和。 Spark本身支持数字类型的累加器，程序员可以添加对新类型的支持。
As a user, you can create named or unnamed accumulators. As seen in the image below, a named accumulator (in this instance counter) will display in the web UI for the stage that modifies that accumulator. Spark displays the value for each accumulator modified by a task in the “Tasks” table.

作为用户，您可以创建命名或未命名的累加器。如下图所示，一个命名的累加器（在本例中为计数器）将显示在web UI中，用于修改该累加器的阶段。 Spark在“任务”表中显示任务修改的每个累加器的值。

A numeric accumulator can be created by calling SparkContext.longAccumulator() or SparkContext.doubleAccumulator() to accumulate values of type Long or Double, respectively. Tasks running on a cluster can then add to it using the add method. However, they cannot read its value. Only the driver program can read the accumulator’s value, using its value method.
可以通过调用SparkContext.longAccumulator（）或SparkContext.doubleAccumulator（）来分别创建Long或Double类型的值来创建一个数字累加器。在群集上运行的任务可以使用add方法添加它。但是，他们看不到它的值。只有驱动程序可以使用其value方法读取累加器的值。
The code below shows an accumulator being used to add up the elements of an array:
下面的代码显示了一个累加器用于将数组的元素相加：

scala> val accum = sc.longAccumulator(“My Accumulator”)
accum: org.apache.spark.util.LongAccumulator = LongAccumulator(id: 0, name: Some(My Accumulator), value: 0)

scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum.add(x))
…
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s

scala> accum.value
res2: Long = 10

虽然此代码使用Long类型的累加器的内置支持，但程序员也可以通过对AccumulatorV2进行子类化来创建自己的类型。 AccumulatorV2抽象类有几种方法必须覆盖：将累加器重置为零的复位，添加另一个值到累加器中，合并为另一个相同类型的累加器。必须覆盖的其他方法包含在API文档中。例如，假设我们有一个表示数学向量的MyVector类，我们可以写：
class VectorAccumulatorV2 extends AccumulatorV2[MyVector, MyVector] {

private val myVector: MyVector = MyVector.createZeroVector

def reset(): Unit = {
myVector.reset()
}

def add(v: MyVector): Unit = {
myVector.add(v)
}
…
}

// Then, create an Accumulator of this type:
val myVectorAcc = new VectorAccumulatorV2
// Then, register it into spark context:
sc.register(myVectorAcc, “MyVectorAcc1”)
请注意，当程序员定义自己的AccumulatorV2类型时，生成的类型可能与添加的元素的类型不同。

For accumulator updates performed inside actions only, Spark guarantees that each task’s update to the accumulator will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware of that each task’s update may be applied more than once if tasks or job stages are re-executed.
对于累加器而言，更新只在action中执行，Spark保证每个任务对累加器的更新只会被应用一次，即重新启动的任务将不会更新该值。在转换中，用户应该意识到，如果重新执行任务或作业阶段，则每个任务的更新可能会被多次应用。
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like map(). The below code fragment demonstrates this property:

Accumulators 不改变Spark的懒惰评估模型。如果在RDD的操作中更新它们，则只有在RDD作为action的一部分计算时，才会更新其值。因此，累加器更新不能保证在像map（）这样的懒惰变换中执行。以下代码片段演示此属性：
val accum = sc.longAccumulator
data.map { x => accum.add(x); x }
// Here, accum is still 0 because no actions have caused the map operation to be computed.

阅读全文

0 0