Spark官方文档之Tuning Spark复习笔记-7

来源：互联网发布：ptc是什么软件编辑：程序博客网时间：2024/06/11 05:19

Tuning Spark（Spark调优）

Data Serialization 数据序列化调优
Memory Tuning 内存调优
- Memory Management Overview
- Determining Memory Consumption
- Tuning Data Structures
- Serialized RDD Storage
- Garbage Collection Tuning
Other Considerations 其它考虑因素
- Level of Parallelism
- Memory Usage of Reduce Tasks
- Broadcasting Large Variables
- Data Locality
Summary

Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked by any resource in the cluster: CPU, network bandwidth, or memory. Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you also need to do some tuning, such as storing RDDs in serialized form, to decrease memory usage. This guide will cover two main topics: data serialization, which is crucial for good network performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.

由于Spark内存计算的本性，Spark运行在集群中的程序往往遇到的瓶颈有：CPU资源、网络带宽、和内存因素。通常，如果数据可以全部容纳在内存中的话，一般带宽是瓶颈，但是有时候你也需要调整，例如调整序列化的格式，降低内存的使用。这篇指南主要会介绍两个主题，1：数据序列化，数据序列化对网络性能和减少内存使用是非常至关重要的，其次是内存的调整，我们将展开描述。

Data Serialization（数据序列化）

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:

在分布式应用性能中，序列化扮演着非常重要的角色。如果序列化过程非常慢或者序列化之后的数据规模较大的话，会将计算的速度。通常，这种情况下你第一个要做的事情就是优化Spark应用，需要你在方便使用（方便实用指的是在操作过程中对类型的支持比较优化）和性能之间平衡。Spark提供了两个序列化类库如下：

Java serialization: By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extendingjava.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.

Java 序列化：默认情况下Spark使用的是Java中的ObjectOutputStream 序列化框架，这个序列化器可以和任何你定义的实现序列化接口的类型生效。你可以拓展Externalizable控制你的序列化性能更好一些，Java默认序列化器非常灵活，但是相对比较慢和序列化之后的数据比较大。

Kryo serialization: Spark can also use the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

Kryo序列化器：Spark可以使用Kryo类库(版本2)快速序列化对象，Kryo一个比较快、高压缩（几乎是Java的十分之一大小）的序列化器。但是如果你想要使用Kryo序列化器提高性能的话，你需要给自己定义的类型提前注册（自定义类型需要注册）。

You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer"). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

你可以在初始化Job的时候使用SparkConf对象调用 conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")进行指定使用的序列化器。配合的序列化器可以使用在Worker节点之间的shuffle操作和RDD缓存到磁盘。Kryo没有作为Spark默认的序列化器主要原因是：用户自定义类型需要注册，但是我们建议对于网络密集型的应用使用Kryo序列化器。从Spark2.0.0开始，当shuffle简单类型的数据，例如：数组和字符串类型，我们内部使用Kryo序列化器。

Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered in the AllScalaRegistrar from the Twitter chill library.

To register your own custom classes with Kryo, use the registerKryoClasses method.

Spark会自动包含那些包含在AllScalaRegistrar 常用的Scala类对象的Kryo序列化器，注册自定义的Class的样例代码如下：

val conf = new SparkConf().setMaster(...).setAppName(...)conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))val sc = new SparkContext(conf)

The Kryo documentation describes more advanced registration options, such as adding custom serialization code.

Kryo文档描述了更多的高级注册选项，例如添加自定义的序列化代码。

If your objects are large, you may also need to increase the spark.kryoserializer.buffer config. This value needs to be large enough to hold the largest object you will serialize.

如果你的对象大的话，你可能需要增大 spark.kryoserializer.buffer配置的数值，这个值需要足够大以便于能容下你最大的那个对象。

Finally, if you don’t register your custom classes, Kryo will still work, but it will have to store the full class name with each object, which is wasteful.

最后，如果你不注册你自定义的Class也是可以的，但是非常糟糕。不注册自己的Class的话Kryo任然可以工作，但是序列化之后的数据会存储全类名和每一个对象实例，这将是非常糟糕的。

Memory Tuning（内存调整）

There are three considerations in tuning memory usage: the amount of memory used by your objects (you may want your entire dataset to fit in memory), the cost of accessing those objects, and the overhead of garbage collection (if you have high turnover in terms of objects).

内存的调整主要考虑三个因素，程序中对象使用的所有的内存总量(你可能想把你的数据全部缓存在内存中)，访问对象的代价，和垃圾回收期的影响。

By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space than the “raw” data inside their fields. This is due to several reasons:

默认，Java对象的访问是非常快的，但是很容易消耗"raw data"大小的3-5倍空间，主要原因如下：

Each distinct Java object has an “object header”, which is about 16 bytes and contains information such as a pointer to its class. For an object with very little data in it (say one Int field), this can be bigger than the data.

每一个对象都有一个对象头，大约16比特和包含一些信息，例如指向class的指针信息，对于一个又很少属性数据的对象来说，对象的大小比属性数据占用空间的大小大很多。

Java Strings have about 40 bytes of overhead over the raw string data (since they store it in an array of Chars and keep extra data such as the length), and store each character as two bytes due to String’s internal usage of UTF-16 encoding. Thus a 10-character string can easily consume 60 bytes.

Java的字符串大约有40比特的额外数据，所以10字符的数组，序列化之后大概有60比特。

Common collection classes, such as HashMap and LinkedList, use linked data structures, where there is a “wrapper” object for each entry (e.g. Map.Entry). This object not only has a header, but also pointers (typically 8 bytes each) to the next object in the list.

普通集合类，例如HashMap，LinkedList，使用线性数据结构，对每一个Enity实体都包裹一层数据，这个对象不仅有头还有指向下一个位置数据的指针。

Collections of primitive types often store them as “boxed” objects such as java.lang.Integer.

基本类型数据集合存储数据往往是以基本类型数据的引用类型进行存储。

This section will start with an overview of memory management in Spark, then discuss specific strategies the user can take to make more efficient use of memory in his/her application. In particular, we will describe how to determine the memory usage of your objects, and how to improve it – either by changing your data structures, or by storing data in a serialized format. We will then cover tuning Spark’s cache size and the Java garbage collector.

这一部分开始主要是概述一下Spark内存管理，然后讨论用户可以有意义的在Spark应用中进行优化的策略。尤其，我们将会描述如何决定对象对内存的使用和如何改变数据结构提高内存使用或使用序列化的格式。我们还会降Spark缓存大小的调整和Java垃圾回收器。

Memory Management Overview

Memory usage in Spark largely falls under one of two categories: execution and storage. Execution memory refers to that used for computation in shuffles, joins, sorts and aggregations, while storage memory refers to that used for caching and propagating internal data across the cluster. In Spark, execution and storage share a unified region (M). When no execution memory is used, storage can acquire all the available memory and vice versa. Execution may evict storage if necessary, but only until total storage memory usage falls under a certain threshold (R). In other words, R describes a subregion within M where cached blocks are never evicted. Storage may not evict execution due to complexities in implementation.

Spark内存的使用主要是计算和缓存，计算是指shuffle、join、sort、和aggregations，然而存储数据是指缓存数据和集群之间的数据的传播。在Spark中，计算和缓存公用一块内存空间（M），当没有计算任务使用内存时候，缓存可以获取所有的可用内存，反之亦然。必要时候计算需要的内存会优先占用内存。但是缓存的缓存可能降低到一个阈值左右（R）。话句话说，R表示M区间的一个子内存块用于缓存数据不能被计算强制占用。由于复杂的实现，存储内存也许可能不会占用执行内存。

This design ensures several desirable properties. First, applications that do not use caching can use the entire space for execution, obviating unnecessary disk spills. Second, applications that do use caching can reserve a minimum storage space (R) where their data blocks are immune to being evicted. Lastly, this approach provides reasonable out-of-the-box performance for a variety of workloads without requiring user expertise of how memory is divided internally.

这么设计可确保比较好的性能，首先不使用内存缓存的应用，内存可以全部用来计算可以避免数据的溢写磁盘，第二，使用缓存的应用可以留下一部分应用作为缓存，免疫缓存数据被清除（意思就是说：不会因为计算需要扩大内存而将数据强制清除）。最后这种设计对于那些没有丰富设计内存分配的用户来说，开箱即用的话便有相对较好的性能。（就是直接安装spark，使用默认缓存和计算内存的配置就有相对较好的性能）。

Although there are two relevant configurations, the typical user should not need to adjust them as the default values are applicable to most workloads:

下面有两个相关配置参数，普通用户一般不需要调整他们使用默认值就可以适应大多数负载。

spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.

spark.memory.fraction用来设置计算和缓存一共使用内存占Spark得到的总堆内存的比值。其余剩下的0.4被用来保存数据结构和元数据，并保证由于稀疏和异常大记录造成的内存溢出。（可能翻译的有问题）

spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution..

spark.memory.storageFraction 设置用来缓存使用的内存比例，R的存储空间是免疫由于计算强制占用内存清除缓存数据的（意思就是说不会因为计算缺乏内存而清除缓存）。

The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.

Determining Memory Consumption

The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.

最好的方式设置一个数据集缓存占用内存的大小是创建RDDbiang缓存到内存之后，通过Web UI界面Storage查看内存占用。

To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.

为了评估一个特殊对象占用的内存，可以使用SizeEstimator的estimate 方法评估。这个方法试验不同的数据布局减少内存使用。也可以被用来测试广播一个变量在每一个节点占用的堆内存。

Tuning Data Structures

The first way to reduce memory consumption is to avoid the Java features that add overhead, such as pointer-based data structures and wrapper objects. There are several ways to do this:

第一种方法减少内存消耗是避免使用Java特性，（Java特性会增加内存消耗）。例如引用类型或者包装类型，如下方法可以避免：

Design your data structures to prefer arrays of objects, and primitive types, instead of the standard Java or Scala collection classes (e.g.HashMap). The fastutil library provides convenient collection classes for primitive types that are compatible with the Java standard library. 设计你的数据结构优先选择数组对象或者基本类型而不是Java或者Scala集合类，fastutil类库提供了方便使用的基本类型的集合类兼容Java类库。
Avoid nested structures with a lot of small objects and pointers when possible. 可能的话减少内部包含小对象或者引用
Consider using numeric IDs or enumeration objects instead of strings for keys. 考虑使用数字和枚举作为key而不是string
If you have less than 32 GB of RAM, set the JVM flag -XX:+UseCompressedOops to make pointers be four bytes instead of eight. You can add these options in spark-env.sh.如果你的内存小于32GB的话，可以使用-XX:+UseCompressedOops 参数设置引用大小为4比特而不是8个比特，可以添加这些参数在spark-env.sh中

Serialized RDD Storage

When your objects are still too large to efficiently store despite this tuning, a much simpler way to reduce memory usage is to store them in serialized form, using the serialized StorageLevels in the RDD persistence API, such as MEMORY_ONLY_SER. Spark will then store each RDD partition as one large byte array. The only downside of storing data in serialized form is slower access times, due to having to deserialize each object on the fly. We highly recommend using Kryo if you want to cache data in serialized form, as it leads to much smaller sizes than Java

serializ　　尽管经过了上述优化，但是对象太大以至不能高效存储数据，那么，还有一个减少内存使用的简单方法——以序列化形式存储数据，即在RDD持久化API中使用序列化的StorageLevel，例如MEMORY_ONLY_SER。Spark将每个RDD分区都保存为byte数组。序列化带来的唯一缺点是会降低访问速度，因为需要将对象反序列化。如果需要采用序列化的方式缓存数据，我们强烈建议采用Kryo，Kryo序列化结果比Java标准序列化的更小（从某种程度上，甚至比对象内部的raw数据都要小）ation (and certainly than raw Java objects).

------------------------------------------------------------------------------------如下来至于网络翻译---------------------------------------------------------------------------

Garbage Collection Tuning

JVM garbage collection can be a problem when you have large “churn” in terms of the RDDs stored by your program. (It is usually not a problem in programs that just read an RDD once and then run many operations on it.) When Java needs to evict old objects to make room for new ones, it will need to trace through all your Java objects and find the unused ones. The main point to remember here is that the cost of garbage collection is proportional to the number of Java objects, so using data structures with fewer objects (e.g. an array of Ints instead of aLinkedList) greatly lowers this cost. An even better method is to persist objects in serialized form, as described above: now there will be onlyone object (a byte array) per RDD partition. Before trying other techniques, the first thing to try if GC is a problem is to use serialized caching.

　　如果你需要不断的“翻动”程序保存的RDD数据，那么JVM内存回收可能会成为问题（通常，如果读取RDD一次，然后再进行多个操作，这样是不会有问题的）。当Java需要回收旧对象以便为新对象腾出内存空间时，JVM需要跟踪所有的Java对象以确定哪些对象是不再需要的。需要记住的一点是，内存回收的代价与Java对象的数量成正比；因此，使用含有对象数量更小的数据结构（例如使用Int数组而不是LinkedList）能显著降低这种开销。另一种更好的方法是采用对象序列化，如上所述：现在每个RDD分区只有一个对象（一个字节数组）。如果内存回收（GC）存在问题，那么在尝试其他方法之前，首先应尝试使用序列化缓存（serialized caching）。

GC can also be a problem due to interference between your tasks’ working memory (the amount of space needed to run the task) and the RDDs cached on your nodes. We will discuss how to control the space allocated to the RDD cache to mitigate this.

　任务（task）的工作内存（运行任务所需的内存量）与缓存在节点的RDD之间会相互影响，这种影响也会造成内存回收问题。下面我们将讨论如何为RDD缓存分配空间以便减轻这种影响。

Measuring the Impact of GC

The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of time spent GC. This can be done by adding -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps to the Java options. (See the configuration guide for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker’s logs each time a garbage collection occurs. Note these logs will be on your cluster’s worker nodes (in the stdout files in their work directories), not on your driver program.

　　优化内存回收的第一步是获取一些统计信息，包括内存回收的频率、内存回收耗费的时间等。为了获取这些统计信息，我们可以把参数-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps添加到环境变量SPARK_JAVA_OPTS。设置完成后，下次Spark作业运行时，我们可以在worker的日志中看到每一次内存回收时产生的信息。注意，这些日志保存在集群中的worker节点（在工作路径下的stdout文件中)而不是你的驱动程序（driver program).

Advanced GC Tuning

To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:

Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects while the Old generation is intended for objects with longer lifetimes.
The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].
A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.

The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect temporary objects created during task execution. Some steps which may be useful are:

Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for before a task completes, it means that there isn’t enough memory available for executing tasks.
If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden is determined to be E, then you can set the size of the Young generation using the option -Xmn=4/3*E. (The scaling up by 4/3 is to account for space used by survivor regions as well.)
In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by loweringspark.memory.fraction; it is better to cache fewer objects than to slow down task execution. Alternatively, consider decreasing the size of the Young generation. This means lowering -Xmn if you’ve set it as above. If not, try changing the value of the JVM’s NewRatio parameter. Many JVMs default this to 2, meaning that the Old generation occupies 2/3 of the heap. It should be large enough such that this fraction exceeds spark.memory.fraction.
Try the G1GC garbage collector with -XX:+UseG1GC. It can improve performance in some situations where garbage collection is a bottleneck. Note that with large executor heap sizes, it may be important to increase the G1 region size with -XX:G1HeapRegionSize
As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the size of the block. So if we wish to have 3 or 4 tasks’ worth of working space, and the HDFS block size is 128 MB, we can estimate size of Eden to be 4*3*128MB.
Monitor how the frequency and time taken by garbage collection changes with the new settings.

Our experience suggests that the effect of GC tuning depends on your application and the amount of memory available. There are many more tuning options described online, but at a high level, managing how frequently full GC takes place can help in reducing the overhead.

GC tuning flags for executors can be specified by setting spark.executor.extraJavaOptions in a job’s configuration.

为了进一步优化内存回收，我们首先需要了解JVM内存管理的一些基本知识。

Java堆（heap）空间分为两部分：新生代和老生代。新生代用于保存生命周期较短的对象；老生代用于保存生命周期较长的对象。
新生代进一步划分为三部分[Eden, Survivor1, Survivor2]
内存回收过程的简要描述：如果Eden区域已满，则在Eden上运行的minor GC与在Eden和Survivor1中仍然活跃的对象将被复制到Survivor2。然后将Survivor1和Survivor2对换。如果对象活跃的时间已经足够长或者Survivor2区域已满，那么会将对象拷贝到Old区域。最终，如果Old区域消耗殆尽，则执行full GC。

　　Spark内存回收优化的目标是确保只有长时间存活的RDD才被保存到老生代区域；同时，新生代区域足够大以保存生命周期比较短的对象。这样，在任务执行期间就可以避免执行full GC去回收任务执行期间所创建的临时对象。下面是一些可能有用的执行步骤：

通过收集GC状态信息来检查内存回收是否过于频繁。如果在任务结束之前执行了很多次full GC，则表明任务执行的内存空间不足。
在打印的内存回收状态信息中，如果老生代接近消耗殆尽，那么就减少用于缓存的内存空间。这可以通过属性spark.storage.memoryFraction来完成。减少缓存对象以提高执行速度是非常值得的。
如果有过多的minor GC而不是full GC，那么为Eden分配更大的内存是有益的。你可以为Eden分配大于任务执行所需要的内存空间。如果Eden的大小确定为E，那么可以通过 -Xmn=4/3*E来设置新生代的大小（将内存扩大到4/3是考虑到survivor所需要的空间)。
举一个例子，如果你的任务从HDFS读取数据，那么该任务所需要的内存量可以从读取的block数量估算出来。注意，解压后的block通常为解压前的2-3 倍。所以，如果我们需要同时执行3或4个任务，且HDFS的block大小为64MB，那么我们可以估算出Eden的大小为4*3*64MB。
监控在采用新的参数设置时内存回收的频率以及消耗的时间。

　　我们的经验表明内存回收优化的效果取决于你的程序和可用的内存量。网上还有很多其他的优化选项，但从深层次来讲，控制内存回收的频率有助于降低额外开销。

关于executor的内存jvm参数可以在程序中使用spark.executor.extraJavaOptions进行配置

Other Considerations

Level of Parallelism

Clusters will not be fully utilized unless you set the level of parallelism for each operation high enough. Spark automatically sets the number of “map” tasks to run on each file according to its size (though you can control it through optional parameters to SparkContext.textFile, etc), and for distributed “reduce” operations, such as groupByKey and reduceByKey, it uses the largest parent RDD’s number of partitions. You can pass the level of parallelism as a second argument (see the spark.PairRDDFunctions documentation), or set the config propertyspark.default.parallelism to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster.

　集群资源不会被充分利用，除非为每一个操作都设置足够高的并行度。Spark会根据每一个文件的大小自动设置运行在该文件上的“Map"任务的数量（你也可以通过SparkContext.textFile的配置参数来控制）；对于分布式"reduce"任务（例如groupByKey或者reduceByKey)，则利用最大父级（parent）RDD的分区数。你可以通过第二个参数传入并行度（阅读文档spark.PairRDDFunctions ）或者通过设置系统参数spark.default.parallelism来改变默认值。通常来讲，在集群中，我们建议为每一个CPU核（core）分配2-3个任务。

Memory Usage of Reduce Tasks

Sometimes, you will get an OutOfMemoryError not because your RDDs don’t fit in memory, but because the working set of one of your tasks, such as one of the reduce tasks in groupByKey, was too large. Spark’s shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc) build a hash table within each task to perform the grouping, which can often be large. The simplest fix here is to increase the level of parallelism, so that each task’s input set is smaller. Spark can efficiently support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has a low task launching cost, so you can safely increase the level of parallelism to more than the number of cores in your clusters.

　　有时，你会碰到OutOfMemoryError错误，这不是因为你的RDD不能加载到内存，而是因为任务执行的数据集过大，例如正在执行groupByKey操作的reduce任务。Spark的”混洗“（shuffle）操作（sortByKey、groupByKey、reduceByKey、join等）为了完成分组会为每一个任务创建哈希表，哈希表通常都比较大。最简单的修复方法是增加并行度，这样，每一个任务的输入会变的更小。Spark能够高效地支持耗时短至200ms的任务，因为它会对所有的任务复用一个worker JVM，这样能减小任务启动的消耗。所以，你可以放心地使任务的并行度大于集群的CPU核数。

Broadcasting Large Variables

Using the broadcast functionality available in SparkContext can greatly reduce the size of each serialized task, and the cost of launching a job over a cluster. If your tasks use any large object from the driver program inside of them (e.g. a static lookup table), consider turning it into a broadcast variable. Spark prints the serialized size of each task on the master, so you can look at that to decide whether your tasks are too large; in general tasks larger than about 20 KB are probably worth optimizing.

　　使用SparkContext的广播功能可以极大地减小每一个序列化任务的大小以及在集群中启动作业的消耗。如果任务使用驱动程序（driver program）中比较大的对象（例如静态查找表），考虑将其变成可广播变量。Spark会在master上打印每一个任务序列化后的大小，所以你可以通过检查那些信息来确定任务是否过于庞大。通常来讲，大于20KB的任务很可能都是值得优化的。

Data Locality

Data locality can have a major impact on the performance of Spark jobs. If data and the code that operates on it are together then computation tends to be fast. But if code and data are separated, one must move to the other. Typically it is faster to ship serialized code from place to place than a chunk of data because code size is much smaller than data. Spark builds its scheduling around this general principle of data locality.

数据本地化计算会提高作业性能，如果数据和计算代码都在一起的话，计算会非常快，如果是分开的，我们必须移动一个到另一端。典型的，移动代码会比移动数据块的多，因为代码远比数据小。Spark构建调度器就是根据这个准则。

Data locality is how close data is to the code processing it. There are several levels of locality based on the data’s current location. In order from closest to farthest:

数据的本地是指数据多接近代码，代码才能处理数据。有几个基于当前数据位置的本地化级别，从近到远：

PROCESS_LOCAL data is in the same JVM as the running code. This is the best locality possible

PROCESS_LOCAL 数据和正在运行的代码在同一个JVM线程，这个是最locality 的

NODE_LOCAL data is on the same node. Examples might be in HDFS on the same node, or in another executor on the same node. This is a little slower than PROCESS_LOCAL because the data has to travel between processes

NODE_LOCAL 数据和代码线程再同一个节点上，例如数据在HDFS的某一节点，而executor也在这个节点上，这个计算速度仅慢PROCESS_LOCAL

NO_PREF data is accessed equally quickly from anywhere and has no locality preference

NO_PREF 数据在任何地方处理都一样，没本地性偏好

RACK_LOCAL data is on the same rack of servers. Data is on a different server on the same rack so needs to be sent over the network, typically through a single switch

RACK_LOCAL 数据和executor在一个机架上，但不在同一个节点，往往需要网络发送。

ANY data is elsewhere on the network and not in the same rack

数据在网络上（不在一个机架上）

Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In situations where there is no unprocessed data on any idle executor, Spark switches to lower locality levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same server, or b) immediately start a new task in a farther away place that requires moving data there.

Spark调度器调度任务最理想的情况都是本地化处理，但是这并不是总是可能的，有时候在那些限制的Executor上面没有处理数据，所以Spark就得转向本地化低的级别，有两个选项：一个等待数据所在的节点CPU闲置下来在本地启动task处理，第二就是移动数据到闲置的Executor上面进行处理

What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback between each level can be configured individually or all together in one parameter; see the spark.locality parameters on the configuration page for details. You should increase these settings if your tasks are long and see poor locality, but the default usually works well.

Spark一般做法是等一段时间希望通过等待CPU闲置下来，但是一旦等待超时，将会开始开始移动数据到闲置的节点上，每一种本地化级别等待时间可以配置。查看spark.locality参数详细，如果你的任务执行时间很长或者本地化级别很差的话，可以增加等待时间。但是往往默认效果就很好。

Summary

This has been a short guide to point out the main concerns you should know about when tuning a Spark application – most importantly, data serialization and memory tuning. For most programs, switching to Kryo serialization and persisting data in serialized form will solve most common performance issues. Feel free to ask on the Spark mailing list about other tuning best practices.

　本文指出了Spark程序优化所需了解的几个要点——最重要的是数据序列化和内存优化。对于大多数程序而言，采用Kryo框架并以序列化形式存储数据，能够解决大部分性能问题。非常欢迎在Spark mailing list提问优化相关的问题。

0 0