DAGScheduler类的功能注释翻译

来源：互联网发布：智能家居服务器源码编辑：程序博客网时间：2024/06/10 19:26

The high-level scheduling layer that implements stage-oriented scheduling. It computes a DAG of
stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a
minimal schedule to run the job. It then submits stages as TaskSets to an underlying
TaskScheduler implementation that runs them on the cluster. A TaskSet contains fully independent
tasks that can run right away based on the data that's already on the cluster (e.g. map output
files from previous stages), though it may fail if this data becomes unavailable.

高层次面向stage的调度，为每一个job划分stage，跟踪每一个RDD和stage输出的物化，并且找到一种最小代价调度作业
然后提交stage作为TaskSet给底层的TaskScheduler实现者去在集群中运行他们。一个TaskSet保存独立的任务可以基于集群的数据
（或上一个map的输出）立即执行，但是数据如果成为不可获得的话，那么task将会失败



Spark stages are created by breaking the RDD graph at shuffle boundaries. RDD operations with
"narrow" dependencies, like map() and filter(), are pipelined together into one set of tasks
in each stage, but operations with shuffle dependencies require multiple stages (one to write a
set of map output files, and another to read those files after a barrier). In the end, every
stage will have only shuffle dependencies on other stages, and may compute multiple operations
inside it. The actual pipelining of these operations happens in the RDD.compute() functions of
various RDDs (MappedRDD, FilteredRDD, etc).

Spark stage的划分依据shuffle的界限。RDD的操作的窄依赖，例如 map filter将会放在一个task中(言外之意 map filter不是划分依据)
但是shuffle动作会划分stage的。最后划分完stage之后，只有不同的stage之间存在shuffle，stage内部不允许存在shuffle（言外之意就是遇见shuffle就切分stage）
每一个stage可能有多个操作（例如一个stage中有多个map），这些操作组成的流水线当RDD调用compute时候会被执行

In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred
locations to run each task on, based on the current cache status, and passes these to the
low-level TaskScheduler. Furthermore, it handles failures due to shuffle output files being
lost, in which case old stages may need to be resubmitted. Failures within* a stage that are
not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
a small number of times before cancelling the whole stage.


除了切分stage之外，DAGScheduler还会依据当前的cache状态尽可能的locations执行task，传递这些状态
低层的TaskScheduler。DAGScheduler还会处理shuffle数据文件丢失的错误，这种情况丢失数据的stage需要再次提交，
如果一个stage的错误不是由于文件丢失的话将会由TaskScheduler处理，TaskScheduler将会在取消该整个stage之前重试少数几次每一个task

When looking through this code, there are several key concepts:

- Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
For example, when the user calls an action, like count(), a job will be submitted through
submitJob. Each Job may require the execution of multiple stages to build intermediate data.
Jobs是顶层的单元提交给scheduler，例如：当用户调用action(count())，一个job将会使用submitJob提交
每一个Job可能需要多个stages构建中间数据

- Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each
task computes the same function on partitions of the same RDD. Stages are separated at shuffle
boundaries, which introduce a barrier (where we must wait for the previous stage to finish to
fetch outputs). There are two types of stages: [[ResultStage]], for the final stage that
executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.

Stage是Task的集合，用来在Job中计算中间数据，每一个task在同一个RDD的多个分区上执行相同的计算逻辑
stage是根据shuffle界限划分的，这将引入一个界限(后面的stage必须等待前面的stage执行完毕去拉取数据)。
这里有两种类型的Stage：ResultStage是执行Action的最终stage，ShuffleMapStage是map为shuffle写输出文件的stage
Stage通常可以被多个job共享，如果这些Job都是在相同的RDD上执行计算

- Tasks are individual units of work, each sent to one machine.
Task是work的独有的单元，每一个Task都会发送到一台机器上

- Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them
and likewise remembers which shuffle map stages have already produced output files to avoid
redoing the map side of a shuffle.
缓存跟踪：DAGScheduler会指出哪些RDD被缓存，避免重新计算和计算哪些shuffle stage输出了文件，避免从新执行shuffle输出

- Preferred locations: the DAGScheduler also computes where to run each task in a stage based
on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
尽可能本地化：DAGScheduler将会依据RDD所有的位置偏好计算在哪里执行每一个stage中的task，或者在哪里缓存或者输出shuffle data

- Cleanup: all data structures are cleared when the running jobs that depend on them finish,
to prevent memory leaks in a long-running application.
清除工作：当job依赖的数据使用完毕之后将会清除这些数据，避免app长期运行导致内存泄露

To recover from failures, the same stage might need to run multiple times, which are called
"attempts". If the TaskScheduler reports that a task failed because a map output file from a
previous stage was lost, the DAGScheduler resubmits that lost stage. This is detected through a
CompletionEvent with FetchFailed, or an ExecutorLost event. The DAGScheduler will wait a small
amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost
stage(s) that compute the missing tasks. As part of this process, we might also have to create
Stage objects for old (finished) stages where we previously cleaned up the Stage object. Since
tasks from the old attempt of a stage could still be running, care must be taken to map any
events received in the correct Stage object.
为了失败恢复，相同的stage可能需要执行多次，这称为"attempts"。如果TaskScheduler汇报一个task由于上一个map输出
文件丢失的话，DAGScheduler将会重新提交丢失的stage，这个可以通过完成的FetchFailed或ExecutorLost时间侦查到。
DAGScheduler会等一小段时间去检查是否其他节点也存在这种情况或者是Task失败，然后重新提交TaskSet给丢失的stage计算
丢失的Task。在这个过程中，我们必须从新创建Stage Object，因为之前的Stage Object对象已经被清除。失败的stage有可能由于attempt
还会执行，必须在正确的Stage Object中考虑事件(翻译的不是很得体)


Here's a checklist to use when making or reviewing changes to this class:
当做改变或者回顾时候需要看的清单：
- All data structures should be cleared when the jobs involving them end to avoid indefinite
accumulation of state in long-running programs.
当job执行完毕之后该job的所有的数据必须清除掉，避免无限的积累
- When adding a new data structure, update `DAGSchedulerSuite.assertDataStructuresEmpty` to
include the new structure. This will help to catch memory leaks.
添加一个新的数据结构时，` dagschedulersuite assertdatastructuresempty `来更新。包括新结构。这将有助于捕捉内存泄漏。

0 0