高性能spark

来源：互联网发布：ubuntu更换源编辑：程序博客网时间：2024/06/07 06:12

High Performance Spark

学习笔记: gitBook

Chapter 2 How Spark Works

spark是依托于分布式存储系统集群管理器之上的分布式通用计算框架

Spark Components

基于抽象数据集RDD：惰性预估计、静态类型、分布式集合，具有tansformatins 操作函数

1. spark SQL

2. spark ML

3. spark streaming

Spark Model of Parallel Computing: RDDs

以不可变、分布式数据集对象RDDs来标识数据集，RDDs的partitions可以分布在不同的节点上

RDDs是lazy的，只有最终的RDD(通常为写存储，collect一个聚合结果到driver)数据需要计算时，才会计算RDD transfromations

RDD是在内存内部生成并复用、是immutable类型所以transforming将产生新的RDD，而不是在原有的RDD上直接操作，这也提示我们，尽量复用RDD，尽量避免不必要的trans操作

Lazy Evaluation

RDD纯lazy，直到有action操作才会计算分区，返回类型：

l RDD

l 返回给Driver的结果（operations like count or collect）

l 写入外部存储系统的结构（such as copyToHadoop）的结果

Actions触发scheduler，创建基于RDDtransformations依赖的DAG。即，spark反向定义一个计算步骤（DAG）来产生最终分布式数据集（每一个分区）中的每一个object，

利用这个计算步骤（执行计划），scheduler会为每个stage计算其缺失的partitions。

【备注】并非所有的transformations都是100%lazy的，例如sortByKey需要预估RDD来决定数据的range，因此它既是一个transformations也是一个action。

Performance and usability advantages of lazy evaluation

（1）高效性：Lazy特性使得不需要和driver互通的一些operation(一般one-to-onedependences 类型的transformation)可以合并，避免重复多次传输数据，例如：

对于相同RDD的map和filter可以一起执行，避免多次访问数据计算。

（2）简化计算逻辑：避免了想MapReduce一样由开发者代码consolidatemaping操作，lazy 允许链式执行窄依赖，spark完成consolidate

Lazy evaluation and fault tolerance

不是采用传统的维护更新日志方式来跟踪RDD，RDD自身的每个分区会记录自己的血缘lineage，优势：快速并行重计算恢复数据

Lazy evaluation and debugging

Lazy为debug一些线索，spark程序只在action的点fail（stack traces反馈于action的点，尽管逻辑上可能fail的点在前面的某个transformation）

[待补]In-Memory Persistence and Memory Management

[待补]Immutability and the RDD Interface

Types of RDDs

RDD中定义的 toDebugString 函数返回RDD类型和父RDDs的List

Functions on RDDs: Transformations Versus Actions

两种functions：actions 、transformations

1. Actions :返回有副作用的非 RDD

2. Transformations :返回另外一个 RDD.

每一个spark程序必须包含一个action（bring information to driver or write to stable storage），Action触发计算。

（1）拉取数据至driver的actions有：collect、count、collectAsMap、sample、reduce、take。【备注】由于driver的内存有限，最好不用collect 或sample

（2）写入存储的actions有：saveAsTextFile, saveAsSequenceFile, and saveAsObjectFile.存入Hadoop的actions只能是K-V pairs RDDS（由PairRDDFunctons class和NewHadoopRDD class 产生），其他的诸如saveAsTextFile and saveAsObjectFile 可作用于任何RDD，主要是通过加入一个隐式的NULL key（会在saving level被忽略）到每一条记录。

【备注】返回nothing（void for java or Unit forscala）的函数，例如foreach也是action，也会触发job

Wide Versus Narrow Dependencies

Transformations 分为两类：

1. transformations with narrowdependencies

2. transformations with widedependencies

narrow dependencies

child RDD的每个partition只简单、有限的依赖于parentRDD的partitions， parent RDD的每个partition至多只有一个child partition

一般的trans为：

map, filter, mapPartitions, andflatmap

child partitions 依赖parent 的任意partions，在parent数据计算完成前不可完全确定对parent partions的具体依赖。

一般的trans为：

groupByKey, reduceByKey, sort, sortByKey

【备注】

join functions 有点复杂，根据两个parentRDD的具体分区情况，可能是窄依赖，也可能是宽依赖

narrow dependency:

Each partition is a self-contained piece of data, which is operatedon by a task. Therefore, there is an almost one-to-one mapping between thenumber of partitions in an RDD and the number of tasks. In a typical setting,the number of partitions in a stage (and, in turn, its parallelism) remains thesame. This is known as a narrow dependency: a partition gets its data from asingle partition in the preceding transformation, leading to pipelinedexecution. This applies to map(), flatMap(), filter(), union(), and so on.coalesce() with shuffle set to false also results in a narrow dependency eventhough it takes in multiple partitions。

wide dependency:

A partitionreads in records from multiple partitions in the preceding transformation. Thisapplies to all *ByKey operations, such as groupByKey()and reduceByKey();joining operations, such as join(), cogroup(), and so on; and repartition()

Spark Job Scheduling

Application包含一个driver process,由集群调度器根据SparkContext调度

driver process:由RD写高层sparklogic，一系列的executor processes(分发到不同集群节点执行)

Spark program 在driver node 运行，向executors分发instructions

Applications

Job由RDD中的action触发

The Spark Application

1. SparkContext启动，然后Spark application启动，driver和一系列的executors在clusters中启动。

2. 每个executor拥有自身的JVM，executor 不可跨多节点，而一个节点可有

3. SparkContext决定每个executor的资源分配

4. Job启动后每个相关executor会运行相应的tasks来计算RDD

5. applications间的RDDs不共享

SparkContext启动，driver programpings cluster manager，cluster manager 在worker nodes中申请下发 Spark executors (JVMs shown as black boxes in the diagram)

RDD 在executors中计算（RDD的分区partition分布在不同的节点）

每个executor可以有多个partitions，但是一个partition不能跨多executors

Default Spark Scheduler

默认情况下， Spark按照先入先出的决策调度jobs，支持FairSchedulerround-robin形式，分批调度所有的任务，任务尽优的获取利用集群资源,Spark application 找找SparkContext调起的相关的actions的顺序，调起jobs

The Anatomy of a Spark Job

Spark的lazy型计算范式中，Sparkapplication 在driver program 调用到action前不会干任何工作

Spark scheduler 为每个action的job构建一个DAG然后出发job

job分为不同的stages,每个stage包含一个task（代表每一个并行中计算）集，task于executors中执行。

Application:针对一个RDD action可能有多个jobs，

Job:根据wide transformation【map shufflestage 或 result shuffle stage】划分为多个stages

Stage:可能包含多个tasks，这取决于stage计算中的parallelizableunit单元（通常也就是一个task，通常task数==partition数），

Task:运行于一个partition

DAG

Spark高层调度器（DAG Scheduler)利用RDD的dependencies为job中的stages构建DAG（DirectedAcylic Graph）

DAG是stage中RDD的依赖执行图，DAG基于RDDtransformations中partitions间依赖关系构建DAG的边线

DAG决定这job的执行，从另一个角度看DAG为job构建了stages的构成图，决定task运行的locations，将这些信息传递给TaskScheduler

TaskScheduler负责集群中task的运行，TaskScheduler在partitions间根据dependencies创建一个图graph

DAG 的边线是基于RDDtransformations中partitions间依赖关系构建的，因此，若一个operation返回的不是RDD，则无children，依据图论，称其为叶子“leaf”，一个复杂的transformations集，将和一个执行图映射在在一起，一旦action调起，不能再向图中添加其他节点

Jobs

Job是 execution层高级成员，每个job与一个action有关，Action触发job。

job is defined by calling an action

Stages

（1）Lazily transformations 只有在action调起是才会执行

（2）Job由action定义触发

（3）Action可包含多个transformations,其中宽transformations是job中stages的切分点，也就是说stage和wide transformation产生的shuffle dependency有关。

（4）从另一个角度看，Stage可以看做是可独立在一个executor执行的计算集（computations\tasks），独立是和其他executor或者driver间无communication.

话一句话说，stage 在任何worker间需要网络通信的点发生，例如wide transformations 产生shuffle,由shuffle产生的stage边缘，存在ShuffleDependencies。例如groupByKey ，需要分布在不同partitions中的数据。

（5）对于窄依赖的transformations可以在同一个stage内执行。

例如word count中flatMap, map, filter操作在同一个stage内

Tasks

stage由tasks组成，tasks执行层的最小单元，每一个task代表一个本地计算过程。一个stage中的所有tasks在不同的数据片上执行相同的代码code. 每个task只能在一个excutor上执行。但是excutor可以分片slot执行多个tasks。每个stage中的Task数，与该stage输出RDD的分区数有关

defsimpleSparkProgram(rdd:RDD[Double]):Long={//stage1

rdd.filter(_<1000.0) .map(x=>(x,x) )

//stage2

.groupByKey()

.map{case(value,groups)=>(groups.sum,value)}

//stage 3

.sortByKey()

.count() }

stages是shuffle操作（groupBykey,sortByKey）的边界,每一个stage包含多个tasks，每一个task对应于trans结果RDD的每一个分区，tasks通常并行的。

1. 集群没不能为所有stage并行执行每个task

2. 每个executor存在有限个cores，executor cores的配置在application层，同时也和集群物理cores数有关。

3. Spark并行执行的task数不多于分配给application的executorcores的数目。

4. 可依据Spark Conf中的配置如下计算task的并行数：

5. executor cores总数 = 每个executor的cores *executor数

6. 若分区数partitions（tasks数）多于可运行tasks数，则只有等待运行中的task执行完毕才可执行新的计算。

spark执行模型小结：

l Job：计算一个最终结果所需要的一组RDD transformations集合

l Stage：一个work任务的分片segment，可脱离（无通信）driver执行，换一句话说，无partitions间的数据move过程的一组计算

l Tasks：stage中作用于每个分区的work任务的执行单元

Task Parallelism 并发度

（1）The number of partitions dictates the number of tasks assigned to eachDStream or RDD
（2）Shuffle Transformations Operation 的并发设置，可引起 shuffle的trans操作如 *ByKey operations and join, cogroup等。

spark.default.parallelism 可以设置并发度，默认为 parent RDD 的分区数partitions
方法二是增加partitions（ repartition() ），partitions数少意味着更高频的GC和溢出磁盘频率。
另一个有效的方法是改变HDFS块chunk的size（相当于增加 parent RDD 的分区数partitions）

阅读全文

0 0