spark-schedule

来源:互联网 发布:asp嵌入java代码 编辑:程序博客网 时间:2024/05/18 03:09

作业的调度是spark一个关键组件。目的是保证作业能够准确的下发到各个数据node。

  1. package.scala
    Spark’s scheduling components. This includes the org.apache.spark.scheduler.DAGScheduler and
    lower level org.apache.spark.scheduler.TaskScheduler.
    spark的调度组件,包括了高层dag调度和底层的task调度。

  2. org.apache.spark.scheduler.DAGScheduler

    The high-level scheduling layer that implements stage-oriented scheduling.
    高级调度层是面向stage调度
    It computes a DAG of stages for each job, keeps track of which RDDs and stage outputs are materialized, and finds a minimal schedule to run the job.
    为每一个job,计算了一个stages的DAG,保证rdds的轨迹,并且stage输出实现,发现一个最小的调度。
    It then submits stages as TaskSets to an underlying TaskScheduler implementation that runs them on the cluster.
    提交stages就像taskset 到在能够运行在集群接口的下的TaskScheduler。
    A TaskSet contains fully independent tasks that can run right away based on the data that’s already on the cluster (e.g. map output files from previous stages), though it may fail if this data becomes unavailable.
    *一个taskset 包含完整的独立的task,运行正确基础集群数据上,尽管可能数据不可用导致的task失败。

    • Spark stages are created by breaking the RDD graph at shuffle boundaries.
      spark stage 在shuffle边界时,通过 断裂 rdd图来创建。
      RDD operations with “narrow” dependencies, like map() and filter(), are pipelined together into one set of tasks in each stage, but operations with shuffle dependencies require multiple stages (one to write a set of map output files, and another to read those files after a barrier).
      在tasks的集合里面,每一个stages像map和filter这种窄依赖的rdd运算是串联一起.
    • In the end, every stage will have only shuffle dependencies on other stages, and may compute multiple operations inside it.
    • 最后,每一个stage 只是将依赖 shuffle到另外的stages,在进行多种算子操作。
      The actual pipelining of these operations happens in the RDD.compute() functions of various RDDs
    • 这些操作的实际管道发生在各种RDD的compute函数
    • In addition to coming up with a DAG of stages, the DAGScheduler also determines the preferred locations to run each task on, based on the current cache status, and passes these to the low-level TaskScheduler.
    • 除了提到的dag以外, DAGScheduler 基于当前的cache status 决定了运行在每个task的优先位置,把它们传到到底层的taskscheduler.
    • Furthermore, it handles failures due to shuffle output files being
    • lost, in which case old stages may need to be resubmitted. Failures within a stage that are
    • not caused by shuffle file loss are handled by the TaskScheduler, which will retry each task
    • a small number of times before cancelling the whole stage.
      *
    • When looking through this code, there are several key concepts:
      *
      • Jobs (represented by [[ActiveJob]]) are the top-level work items submitted to the scheduler.
    • For example, when the user calls an action, like count(), a job will be submitted through
    • submitJob.
    • Each Job may require the execution of multiple stages to build intermediate data.
    • 每个作业可能需求处理的多个tages编译中间数据
      • Stages ([[Stage]]) are sets of tasks that compute intermediate results in jobs, where each task computes the same function on partitions of the same RDD.
    • stages是一组task集合,这个集合计算在job的中间结果,每个工作计算相同的函数在相同的RDD的分区上。
    • Stages are separated at shuffle boundaries, which introduce a barrier (where we must wait for the previous stage to finish to fetch outputs).
    • stages被shuffle边界分割,边界带来了一个阻碍。我们必须等待优先的stages完成以后获取输出计算结果。
      There are two types of stages: [[ResultStage]], for the final stage that executes an action, and [[ShuffleMapStage]], which writes map output files for a shuffle.
      这里有两个stages,resultstages和shufflemapstage.RS针对最后可以执行的action, SMS写输出文件
      Stages are often shared across multiple jobs, if these jobs reuse the same RDDs.
      stages如果这些jobs重复使用相同RDD,也被多个job共享。
      Tasks are individual units of work, each sent to one machine.
      tasks是独立的工作单元,每个发送到一个个机器。
      Cache tracking: the DAGScheduler figures out which RDDs are cached to avoid recomputing them and likewise remembers which shuffle map stages have already produced output files to avoid redoing the map side of a shuffle.
      缓冲追踪:DAGScheduler算出被缓存避免重算的RDD ,同样 记住 shuffle map stages 已经产生 避免重新进行shuffle的输出文件。
      Preferred locations: the DAGScheduler also computes where to run each task in a stage based on the preferred locations of its underlying RDDs, or the location of cached or shuffle data.
      优先的location: DAGScheduler 运行每个task在一个stage 基于RDDs的优先位置。
      Cleanup: all data structures are cleared when the running jobs that depend on them finish, to prevent memory leaks in a long-running application.
      所有的数据结构被清理当运行的job完成时,可以保护内存泄漏在一个长作业中。
      To recover from failures, the same stage might need to run multiple times, which are called
      If the TaskScheduler reports that a task failed because a map output file from a previous stage was lost, the DAGScheduler resubmits that lost stage.
      This is detected through a CompletionEvent with FetchFailed, or an ExecutorLost event.
      The DAGScheduler will wait a small amount of time to see whether other nodes or tasks fail, then resubmit TaskSets for any lost stage(s) that compute the missing tasks.
      As part of this process, we might also have to create Stage objects for old (finished) stages where we previously cleaned up the Stage object.

    Since tasks from the old attempt of a stage could still be running, care must be taken to map any events received in the correct Stage object.

    Here’s a checklist to use when making or reviewing changes to this class:
    这里有一个检查列表可以使用,在标记或者预览改变这个类。
    All data structures should be cleared when the jobs involving them end to avoid indefinite accumulation of state in long-running programs.
    所以数据结构应该被清理,当job结束时,避免长时间运行带来的数据积累。
    When adding a new data structure, updateDAGSchedulerSuite.assertDataStructuresEmpty to
    include the new structure. This will help to catch memory leaks.
    /当加入一个新数据结构,更新DAGSchedulerSuite.assertDataStructuresEmpty,避免内存泄漏。

TaskScheduler
Low-level task scheduler interface, currently implemented exclusively by [[org.apache.spark.scheduler.TaskSchedulerImpl]].
底层次的任务调度接口,当前仅有org.apache.spark.scheduler.TaskSchedulerImpl继承。
This interface allows plugging in different task schedulers.
这个接口允许接入在不同任务调度
Each TaskScheduler schedules tasks for a single SparkContext.
每一个taskScheduler 调度职能有一个sparkcontext。
These schedulers get sets of tasks submitted to them from the DAGScheduler for each stage, and are responsible for sending the tasks to the cluster, running them, retrying if there are failures, and mitigating stragglers.
这些调度器为每个stage,从DAGScheduler得到task集合提交给他们,负责发送任务到集群,运行他们,当出现失败的时候重试,减轻掉队。
They return events to the DAGScheduler 然后返回事件DAGScheduler。

原创粉丝点击