Hadoop-0.20.203.0rcl公平调度器(FairScheduler)源代码解析

来源:互联网 发布:浙江基层网络 编辑:程序博客网 时间:2024/05/22 01:27

在Hadoop中,调度器是一个可插拔的模块,用户可以根据自己的实际应用要求设计调度器。在Hadoop中常见的调度器有三种:默认的FIFO调度器(FIFO Scheduler)计算能力调度器(Capacity Scheduler)公平调度器(Fair Schedule)。相关信息可以参考文章《Hadoop调度器总结》。

本文我们主要分析hadoop-0.20.203.0rcl中的公平调度器(FairScheduler)相关的代码。

Hadoop调度器框架

在Hadoop中,任务调度器(TaskTracker)是在JobTracker中加载和调用的。用户可以在配置文件mapred-site.xml中的mapred.jobtracker.taskScheduler属性中指定所使用的调度器。TaskScheduler,顾名思义,就是任务调度器。在Hadoop中,JobTracker接收JobClient提交的Job,将它们按InputFormat的划分以及其他相关配置,生成若干个Map和Reduce任务。然后,当一个TaskTracker通过心跳告知JobTracker自己还有空闲的任务Slot时,JobTracker就会向其分派任务。具体应该分派一些什么样的任务给这台TaskTracker,这就是任务调度器TaskScheduler所需要考虑的事情,即任务调度器的assignTasks方法的具体实现。

任务的分派配是由JobTracker的调度框架和TaskScheduler的具体调度策略配合完成。简单来说:
  • JobTracker通过一种Listener机制,将Job的变化情况同步给TaskScheduler。然后TaskScheduler就按照自己的策略将需要调度的Job管理起来;
  • 在JobTracker需要向TaskTracker分派任务时,调用TaskScheduler的assignTasks()方法来获得应当分派的任务;

(1) TaskScheduler

首先我们看一下,Hadoop提供的任务调度器抽象类:TaskScheduler。其源代码如下所示:
package org.apache.hadoop.mapred;import java.io.IOException;import java.util.Collection;import java.util.List;import org.apache.hadoop.conf.Configurable;import org.apache.hadoop.conf.Configuration;/** * Used by a {@link JobTracker} to schedule {@link Task}s on * {@link TaskTracker}s. * <p> * {@link TaskScheduler}s typically use one or more * {@link JobInProgressListener}s to receive notifications about jobs. * <p> * It is the responsibility of the {@link TaskScheduler} * to initialize tasks for a job, by calling {@link JobInProgress#initTasks()} * between the job being added (when * {@link JobInProgressListener#jobAdded(JobInProgress)} is called) * and tasks for that job being assigned (by * {@link #assignTasks(TaskTrackerStatus)}). * @see EagerTaskInitializationListener */abstract class TaskScheduler implements Configurable {  protected Configuration conf;  protected TaskTrackerManager taskTrackerManager;  public Configuration getConf() {    return conf;  }  public void setConf(Configuration conf) {    this.conf = conf;  }  public synchronized void setTaskTrackerManager(      TaskTrackerManager taskTrackerManager) {    this.taskTrackerManager = taskTrackerManager;  }    /**   * Lifecycle method to allow the scheduler to start any work in separate   * threads.   * @throws IOException   */  // 启动函数,如加载配置文件,初始化等  public void start() throws IOException {    // do nothing  }    /**   * Lifecycle method to allow the scheduler to stop any work it is doing.   * @throws IOException   */  // 结束函数  public void terminate() throws IOException {    // do nothing  }  /**   * Returns the tasks we'd like the TaskTracker to execute right now.   *    * @param taskTracker The TaskTracker for which we're looking for tasks.   * @return A list of tasks to run on that TaskTracker, possibly empty.   */  // 最为关键的一个函数,用来向该tasktracker分配task  public abstract List<Task> assignTasks(TaskTrackerStatus taskTracker)    throws IOException;  /**   * Returns a collection of jobs in an order which is specific to    * the particular scheduler.   * @param queueName   * @return   */  // 根据队列名字获取job列表  public abstract Collection<JobInProgress> getJobs(String queueName);    }
从TaskTracker的代码中可以看出任务调度器的关键在于如何进行任务的分配,即assignTasks()方法!

(2)JobTracker

JobTracker是Hadoop最核心的组件,它监控整个集群中的作业运行情况并对资源进行管理和调度。每个TaskTracker通过heartbeat向JobTracker汇报自己管理的机器的一些基本信息,包括内存使用量,内存剩余量,正在运行的task,空闲的slot数目等,一旦JobTracker发现该TaskTracker出现了空闲的slot,便会调用调度器中的assignTasks()方法为该TaskTracker分配task。

下面分析JobTracker调用TaskScheduler的具体流程:
1)加载配置文件mapred-default.xml和mapred-site.xml
2)在JobTracker的main()方法中调用startTracker(JobConf conf)方法,在startTracker(JobConf conf, String identifier),设置了调度器(TaskTracker)
3)提供服务,接受JobClient提交的job,TaskTracker通过heartbeat与JobTracker通信,以及task的分配等
package org.apache.hadoop.mapred;/******************************************************* * JobTracker is the central location for submitting and  * tracking MR jobs in a network environment. * *******************************************************/public class JobTracker implements MRConstants, InterTrackerProtocol,    JobSubmissionProtocol, TaskTrackerManager, RefreshUserMappingsProtocol,    RefreshAuthorizationPolicyProtocol, AdminOperationsProtocol,    JobTrackerMXBean {  // 加载配置文件  static{    Configuration.addDefaultResource("mapred-default.xml");    Configuration.addDefaultResource("mapred-site.xml");  }    ...    // 声明任务调度器对象  private final TaskScheduler taskScheduler;      ...    public static JobTracker startTracker(JobConf conf, String identifier)   throws IOException, InterruptedException {    DefaultMetricsSystem.initialize("JobTracker");    JobTracker result = null;    while (true) {      try {        result = new JobTracker(conf, identifier);        // 设置任务调度器的Manager,其实就是JobTracker,        // 从此处可以看出JobTracker和TaskScheduler是相互包含的。你中有我,我中有你        result.taskScheduler.setTaskTrackerManager(result);         break;      }       ...    }        ...    return result;  }  ...  JobTracker(final JobConf conf, String identifier, Clock clock)   throws IOException, InterruptedException {         ...    // 创建任务调度器实例    // 首先,从配置中获取TaskScheduler的实现类,由配置文件中的mapred.jobtracker.taskscheduler指定,默认是FIFO调度器(JobQueueTaskScheduler)    Class<? extends TaskScheduler> schedulerClass      = conf.getClass("mapred.jobtracker.taskScheduler",          JobQueueTaskScheduler.class, TaskScheduler.class);    // 然后通过JAVA的反射机制,创建任务调度器实例    taskScheduler = (TaskScheduler) ReflectionUtils.newInstance(schedulerClass, conf);        ...  }  ...  /**   * Run forever   */  public void offerService() throws InterruptedException, IOException {     ...    // 启动调度器    taskScheduler.start();        ...  }  ...  /**   * The periodic heartbeat mechanism between the {@link TaskTracker} and   * the {@link JobTracker}.   *    * The {@link JobTracker} processes the status information sent by the    * {@link TaskTracker} and responds with instructions to start/stop    * tasks or jobs, and also 'reset' instructions during contingencies.    */  public synchronized HeartbeatResponse heartbeat(TaskTrackerStatus status,                                                   boolean restarted,                                                  boolean initialContact,                                                  boolean acceptNewTasks,                                                   short responseId)     throws IOException {       ...    if (recoveryManager.shouldSchedule() && acceptNewTasks && !isBlacklisted) {      TaskTrackerStatus taskTrackerStatus = getTaskTrackerStatus(trackerName);      if (taskTrackerStatus == null) {        LOG.warn("Unknown task tracker polling; ignoring: " + trackerName);      } else {        List<Task> tasks = getSetupAndCleanupTasks(taskTrackerStatus);        if (tasks == null ) {          // 调用TaskScheduler的assignTasks()方法为该tasktracker分配task          tasks = taskScheduler.assignTasks(taskTrackers.get(trackerName));        }        ...    }          ...    return response;  }  ...    void close() throws IOException {    ...        if (taskScheduler != null) {      // 关闭TaskScheduler      taskScheduler.terminate();    }    ...  }


公平调度器——FairScheduler

公平调度器是由facebook贡献的,适合于多用户共享集群的环境的调度器。公平调度器按资源池(pool)来组织作业,并把资源公平的分到这些资源池里。默认情况下,每一个用户拥有一个独立的资源池,以使每个用户都能获得一份等同的集群资源而不管他们提交了多少作业。按用户的 Unix 群组或作业配置(jobconf)属性来设置作业的资源池也是可以的。在每一个资源池内,会使用公平共享(fair sharing)的方法在运行作业之间共享容量(capacity)。用户也可以给予资源池相应的权重,以不按比例的方式共享集群。
除了提供公平共享方法外,公平调度器允许赋给资源池保证(guaranteed)最小共享资源,这个用在确保特定用户、群组或生产应用程序总能获取到足够的资源时是很有用的。当一个资源池包含作业时,它至少能获取到它的最小共享资源,但是当资源池不完全需要它所拥有的保证共享资源时,额外的部分会在其它资源池间进行切分。主要特点如下:
  • 支持多用户多队列
  • 资源公平共享(公平共享量由优先级决定)
  • 保证最小共享量
  • 支持时间片抢占
  • 限制作业并发量,以防止中间数据塞满磁盘
有关公平调度算法更加详细的分析,请看文章《Hadoop-0.20.2公平调度算法解析》。

FairScheduler源代码分析

1)MapReduce任务的类型TaskType:map task 和 reduce task

public enum TaskType {  MAP, REDUCE}

2)资源池Pool

package org.apache.hadoop.mapred;import java.util.ArrayList;import java.util.Collection;/** * A schedulable pool of jobs. */public class Pool {  /** Name of the default pool, where jobs with no pool parameter go. */  // 资源池默认名称:default  public static final String DEFAULT_POOL_NAME = "default";    /** Pool name. */  private String name;    /** Jobs in this specific pool; does not include children pools' jobs. */  private Collection<JobInProgress> jobs = new ArrayList<JobInProgress>();  public Pool(String name) {    this.name = name;  }    // 获取当前资源池中的所有的Job  public Collection<JobInProgress> getJobs() {    return jobs;  }    // 将Job添加到当前的资源池中  public void addJob(JobInProgress job) {    jobs.add(job);  }    // 从当前资源池中移除Job  public void removeJob(JobInProgress job) {    jobs.remove(job);  }    public String getName() {    return name;  }  public boolean isDefaultPool() {    return Pool.DEFAULT_POOL_NAME.equals(name);  }}

3)资源池的管理者PoolManager

PoolManager的主要功能是加载资源池的配置信息,获取相关资源池的名字,资源池中的Job等等。下面我们主要分析一下它加载资源池配置文件的方法reloadAllocs().
/** * Maintains a hierarchy of pools. */public class PoolManager {  ...    // Map and reduce minimum allocations for each pool  // 每个资源池需要分配的最小的map和reducer slots数  private Map<String, Integer> mapAllocs = new HashMap<String, Integer>();  private Map<String, Integer> reduceAllocs = new HashMap<String, Integer>();  // Sharing weights for each pool  // 每个资源池的权重,默认为1.0  private Map<String, Double> poolWeights = new HashMap<String, Double>();    // Max concurrent running jobs for each pool and for each user; in addition,  // for users that have no max specified, we use the userMaxJobsDefault.  // 每个资源池同时运行的最大的Job数,以及资源池中每个用户同时运行的最大的Job数,  // 对于每个用户而言,如果没有设置最大的Job数,则默认值为Integer.MAX_VALUE  private Map<String, Integer> poolMaxJobs = new HashMap<String, Integer>();  private Map<String, Integer> userMaxJobs = new HashMap<String, Integer>();  private int userMaxJobsDefault = Integer.MAX_VALUE;  // 配置各个资源池的相关参数的配置文件路径  private String allocFile; // Path to XML file containing allocations  private String poolNameProperty; // Jobconf property to use for determining a                                   // job's pool name (default: mapred.job.queue.name)    private Map<String, Pool> pools = new HashMap<String, Pool>();    private long lastReloadAttempt; // Last time we tried to reload the pools file  private long lastSuccessfulReload; // Last time we successfully reloaded pools  private boolean lastReloadAttemptFailed = false;  public PoolManager(Configuration conf) throws IOException, SAXException,    ...    // 重新加载资源池配置文件allocation.xml,更新相关的配置信息    // PoolManager的核心功能,每隔10秒重新加载一次    reloadAllocs();    ...  }    ...
allocation.xml文件的格式如下所示:
<?xml version="1.0"?><allocations>    <pool name="POOLA">        <minMaps>5</minMaps>        <minReduces>5</minReduces>        <maxRunningJobs>10</maxRunningJobs>        <weight>1.0</weight>    </pool>    <pool name="POOLB">        <minMaps>10</minMaps>        <minReduces>10</minReduces>        <maxRunningJobs>10</maxRunningJobs>        <weight>1.0</weight>    </pool>    <user name="usera">        <maxRunningJobs>10</maxRunningJobs>     </user>     ...        <userMaxJobsDefault>5</userMaxJobsDefault></allocations> 

4)任务的选取TaskSelector和DefaultTaskSelector

TaskSelector是一个抽象类,一个插件;用来从Job中选取一个task去运行,它是被任务调度器TaskScheduler调用的。以下是其主要的几个方法,分别用来计算对于给定的Job需要启动多少speculative map tasks 和speculative reduce tasks,以及如何选取一个map task或者reduce task 运行在指定的tasktracker上:
  /**   * How many speculative map tasks does the given job want to launch?   * @param job The job to count speculative maps for   * @return Number of speculative maps that can be launched for job   */  public abstract int neededSpeculativeMaps(JobInProgress job);  /**   * How many speculative reduce tasks does the given job want to launch?   * @param job The job to count speculative reduces for   * @return Number of speculative reduces that can be launched for job   */  public abstract int neededSpeculativeReduces(JobInProgress job);    /**   * Choose a map task to run from the given job on the given TaskTracker.   * @param taskTracker {@link TaskTrackerStatus} of machine to run on   * @param job Job to select a task for   * @return A {@link Task} to run on the machine, or <code>null</code> if   *         no map should be launched from this job on the task tracker.   * @throws IOException    */  public abstract Task obtainNewMapTask(TaskTrackerStatus taskTracker,      JobInProgress job) throws IOException;  /**   * Choose a reduce task to run from the given job on the given TaskTracker.   * @param taskTracker {@link TaskTrackerStatus} of machine to run on   * @param job Job to select a task for   * @return A {@link Task} to run on the machine, or <code>null</code> if   *         no reduce should be launched from this job on the task tracker.   * @throws IOException    */  public abstract Task obtainNewReduceTask(TaskTrackerStatus taskTracker,      JobInProgress job) throws IOException;

DefaultTaskSelector继承于TaskSelector,重载了上述四个函数,它只是对JobInProgress.obtainNewMapTask和JobInProgress.obtainNewReduceTask方法进行了包装,
具体实现如下:
/** * A {@link TaskSelector} implementation that wraps around the default * {@link JobInProgress#obtainNewMapTask(TaskTrackerStatus, int)} and * {@link JobInProgress#obtainNewReduceTask(TaskTrackerStatus, int)} methods * in {@link JobInProgress}, using the default Hadoop locality and speculative * threshold algorithms. */public class DefaultTaskSelector extends TaskSelector {  @Override  public int neededSpeculativeMaps(JobInProgress job) {    int count = 0;    long time = System.currentTimeMillis();    double avgProgress = job.getStatus().mapProgress();    for (TaskInProgress tip: job.maps) {      if (tip.isRunning() && tip.hasSpeculativeTask(time, avgProgress)) {        count++;      }    }    return count;  }  @Override  public int neededSpeculativeReduces(JobInProgress job) {    int count = 0;    long time = System.currentTimeMillis();    double avgProgress = job.getStatus().reduceProgress();    for (TaskInProgress tip: job.reduces) {      if (tip.isRunning() && tip.hasSpeculativeTask(time, avgProgress)) {        count++;      }    }    return count;  }  @Override  public Task obtainNewMapTask(TaskTrackerStatus taskTracker, JobInProgress job)      throws IOException {    ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();    int numTaskTrackers = clusterStatus.getTaskTrackers();    return job.obtainNewMapTask(taskTracker, numTaskTrackers,        taskTrackerManager.getNumberOfUniqueHosts());  }  @Override  public Task obtainNewReduceTask(TaskTrackerStatus taskTracker, JobInProgress job)      throws IOException {    ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();    int numTaskTrackers = clusterStatus.getTaskTrackers();    return job.obtainNewReduceTask(taskTracker, numTaskTrackers,        taskTrackerManager.getNumberOfUniqueHosts());  }}

5)负载管理LoadManager和CapBasedLoadManager

TaskTracker的负载管理者,用来告诉TaskScheduler什么时候可以启动新的tasks。LoadManager是一个抽象类,其主要的方法有以下两个,用来判断TaskTracker是否可以启动了map或者reduce tasks:
  /**   * Can a given {@link TaskTracker} run another map task?   * @param tracker The machine we wish to run a new map on   * @param totalRunnableMaps Set of running jobs in the cluster   * @param totalMapSlots The total number of map slots in the cluster   * @return true if another map can be launched on <code>tracker</code>   */  public abstract boolean canAssignMap(TaskTrackerStatus tracker,      int totalRunnableMaps, int totalMapSlots);  /**   * Can a given {@link TaskTracker} run another reduce task?   * @param tracker The machine we wish to run a new map on   * @param totalRunnableReduces Set of running jobs in the cluster   * @param totalReduceSlots The total number of reduce slots in the cluster   * @return true if another reduce can be launched on <code>tracker</code>   */  public abstract boolean canAssignReduce(TaskTrackerStatus tracker,      int totalRunnableReduces, int totalReduceSlots);
CapBasedLoadManager继承于LoadManager,重装了上述两个方法:
/** * A {@link LoadManager} for use by the {@link FairScheduler} that allocates * tasks evenly across nodes up to their per-node maximum, using the default * load management algorithm in Hadoop. */public class CapBasedLoadManager extends LoadManager {  /**   * Determine how many tasks of a given type we want to run on a TaskTracker.    * This cap is chosen based on how many tasks of that type are outstanding in   * total, so that when the cluster is used below capacity, tasks are spread   * out uniformly across the nodes rather than being clumped up on whichever   * machines sent out heartbeats earliest.   */  int getCap(int totalRunnableTasks, int localMaxTasks, int totalSlots) {    double load = ((double)totalRunnableTasks) / totalSlots;    return (int) Math.ceil(localMaxTasks * Math.min(1.0, load));  }  @Override  public boolean canAssignMap(TaskTrackerStatus tracker,      int totalRunnableMaps, int totalMapSlots) {    return tracker.countMapTasks() < getCap(totalRunnableMaps,        tracker.getMaxMapSlots(), totalMapSlots);  }  @Override  public boolean canAssignReduce(TaskTrackerStatus tracker,      int totalRunnableReduces, int totalReduceSlots) {    return tracker.countReduceTasks() < getCap(totalRunnableReduces,        tracker.getMaxReduceSlots(), totalReduceSlots);  }}

6)权重的调整WeightAdjuster和NewJobWeightBooster

  // 权重的调整方法  public double adjustWeight(JobInProgress job, TaskType taskType,      double curWeight) {    long start = job.getStartTime();    long now = System.currentTimeMillis();    if (now - start < duration) {      return curWeight * factor;    } else {      return curWeight;    }  }

7)FairScheduler

FairScheduler的启动,start()方法:
  @Override  public void start() {    try {      // 从配置文件中获取相关的配置,若没有对相关的值进行配置,则使用默认值。      Configuration conf = getConf();      ...            // 开启一个线程定时(时间间隔为UPDATE_INTERVAL)更新job的deficit      if (runBackgroundUpdates)        new UpdateThread().start();           ...  }
接下来我们分析相关的值是然后更新的,UpdateThread.start()函数调用了update()方法。
  /**   * Recompute the internal variables used by the scheduler - per-job weights,   * fair shares, deficits, minimum slot allocations, and numbers of running   * and needed tasks of each type.    */  protected void update() {    //Making more granual locking so that clusterStatus can be fetched from Jobtracker.    ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();    // Got clusterStatus hence acquiring scheduler lock now    // Remove non-running jobs    synchronized(this){      List<JobInProgress> toRemove = new ArrayList<JobInProgress>();      for (JobInProgress job: infos.keySet()) {         int runState = job.getStatus().getRunState();        if (runState == JobStatus.SUCCEEDED || runState == JobStatus.FAILED          || runState == JobStatus.KILLED) {            toRemove.add(job);        }      }      for (JobInProgress job: toRemove) {        infos.remove(job);        poolMgr.removeJob(job);      }      // Update running jobs with deficits since last update, and compute new      // slot allocations, weight, shares and task counts      long now = clock.getTime();      long timeDelta = now - lastUpdateTime;      // 更新时间缺额      updateDeficits(timeDelta);      // 更新各个Pool中(各个user)正在运行的Job数      updateRunnability();      // 更新task数      updateTaskCounts();      // 更新Job权重      updateWeights();      // 更新最小共享量      updateMinSlots();      // 更新公平共享量      updateFairShares(clusterStatus);      lastUpdateTime = now;    }  }

a)更新Job的时间缺额

时间缺额,即作业在理想情况下所应得的计算时间与实际所获得的计算时间的缺额计算公式:mapDeficit = mapDeficit + (mapFairShare - runningMaps)*timeDelta
         reduceDeficit = reduceDeficit + (reduceFairShare - runningReduces)*timeDelta
代码如下:
  private void updateDeficits(long timeDelta) {    for (JobInfo info: infos.values()) {      info.mapDeficit +=        (info.mapFairShare - info.runningMaps) * timeDelta;      info.reduceDeficit +=        (info.reduceFairShare - info.runningReduces) * timeDelta;    }  }

b)更新各个Pool中(各个user)正在运行的Job数

统计各个资源池Pool个每个用户User正运行的Job数。代码如下:
  private void updateRunnability() {    // Start by marking everything as not runnable    for (JobInfo info: infos.values()) {      info.runnable = false;    }    // Create a list of sorted jobs in order of start time and priority    List<JobInProgress> jobs = new ArrayList<JobInProgress>(infos.keySet());    // 安装Job的启动时间和优先级排序    Collections.sort(jobs, new FifoJobComparator());    // Mark jobs as runnable in order of start time and priority, until    // user or pool limits have been reached.    Map<String, Integer> userJobs = new HashMap<String, Integer>();    Map<String, Integer> poolJobs = new HashMap<String, Integer>();    for (JobInProgress job: jobs) {      if (job.getStatus().getRunState() == JobStatus.RUNNING) {        String user = job.getJobConf().getUser();        String pool = poolMgr.getPoolName(job);        int userCount = userJobs.containsKey(user) ? userJobs.get(user) : 0;        int poolCount = poolJobs.containsKey(pool) ? poolJobs.get(pool) : 0;        if (userCount < poolMgr.getUserMaxJobs(user) &&             poolCount < poolMgr.getPoolMaxJobs(pool)) {          infos.get(job).runnable = true;          userJobs.put(user, userCount + 1);          poolJobs.put(pool, poolCount + 1);        }      }    }  }

c)更新各个Job正在运行的map/reduce task数和还需要多少map/reduce task

  private void updateTaskCounts() {    for (Map.Entry<JobInProgress, JobInfo> entry: infos.entrySet()) {      JobInProgress job = entry.getKey();      JobInfo info = entry.getValue();      if (job.getStatus().getRunState() != JobStatus.RUNNING)        continue; // Job is still in PREP state and tasks aren't initialized      // Count maps      // 该Job总共的map task数      int totalMaps = job.numMapTasks;      // 已完成的map task数      int finishedMaps = 0;      // 正在运行的map task数      int runningMaps = 0;      for (TaskInProgress tip :            job.getTasks(org.apache.hadoop.mapreduce.TaskType.MAP)) {        if (tip.isComplete()) {          finishedMaps += 1;        } else if (tip.isRunning()) {          runningMaps += tip.getActiveTasks().size();        }      }      info.runningMaps = runningMaps;      info.neededMaps = (totalMaps - runningMaps - finishedMaps          + taskSelector.neededSpeculativeMaps(job));      // Count reduces      // 该Job总共的reduce task数      int totalReduces = job.numReduceTasks;      // 已完成的reduce task数      int finishedReduces = 0;      // 正在运行的reduce task数      int runningReduces = 0;      for (TaskInProgress tip :            job.getTasks(org.apache.hadoop.mapreduce.TaskType.REDUCE)) {        if (tip.isComplete()) {          finishedReduces += 1;        } else if (tip.isRunning()) {          runningReduces += tip.getActiveTasks().size();        }      }      info.runningReduces = runningReduces;      // 判断该Job现在是否需要启动reduce task;至少要有一个map task完成了,才能启动reduce task!      if (enoughMapsFinishedToRunReduces(finishedMaps, totalMaps)) {        info.neededReduces = (totalReduces - runningReduces - finishedReduces             + taskSelector.neededSpeculativeReduces(job));      } else {        info.neededReduces = 0;      }      // If the job was marked as not runnable due to its user or pool having      // too many active jobs, set the neededMaps/neededReduces to 0. We still      // count runningMaps/runningReduces however so we can give it a deficit.      if (!info.runnable) {        info.neededMaps = 0;        info.neededReduces = 0;      }    }  }

d)更新Job的权重

初始的mapWeight和reduceWeight都是1.0;权重的更新公式:mapWeight = mapWeight * (poolWeight / mapWeightSum) reduceWeight = reduceWeight * (poolWeight / reduceWeightSum)
  private void updateWeights() {    // First, calculate raw weights for each job    for (Map.Entry<JobInProgress, JobInfo> entry: infos.entrySet()) {      JobInProgress job = entry.getKey();      JobInfo info = entry.getValue();      // 计算初始的权重,默认值为1.0      info.mapWeight = calculateRawWeight(job, TaskType.MAP);      info.reduceWeight = calculateRawWeight(job, TaskType.REDUCE);    }    // Now calculate job weight sums for each pool    // 计算每个资源池Pool的mapWeightSums和reduceWeightSums,用于后面的归一化    Map<String, Double> mapWeightSums = new HashMap<String, Double>();    Map<String, Double> reduceWeightSums = new HashMap<String, Double>();    for (Pool pool: poolMgr.getPools()) {      double mapWeightSum = 0;      double reduceWeightSum = 0;      for (JobInProgress job: pool.getJobs()) {        if (isRunnable(job)) {          if (runnableTasks(job, TaskType.MAP) > 0) {            mapWeightSum += infos.get(job).mapWeight;          }          if (runnableTasks(job, TaskType.REDUCE) > 0) {            reduceWeightSum += infos.get(job).reduceWeight;          }        }      }      mapWeightSums.put(pool.getName(), mapWeightSum);      reduceWeightSums.put(pool.getName(), reduceWeightSum);    }    // And normalize the weights based on pool sums and pool weights    // to share fairly across pools (proportional to their weights)    // 权重归一化    for (Map.Entry<JobInProgress, JobInfo> entry: infos.entrySet()) {      JobInProgress job = entry.getKey();      JobInfo info = entry.getValue();      String pool = poolMgr.getPoolName(job);      double poolWeight = poolMgr.getPoolWeight(pool);      double mapWeightSum = mapWeightSums.get(pool);      double reduceWeightSum = reduceWeightSums.get(pool);      if (mapWeightSum == 0)        info.mapWeight = 0;      else        info.mapWeight *= (poolWeight / mapWeightSum);       if (reduceWeightSum == 0)        info.reduceWeight = 0;      else        info.reduceWeight *= (poolWeight / reduceWeightSum);     }  }

e)更新Job的最小共享量minSlots

在每个资源池Pool中,将其拥有的slot按作业的权重分配给各个Job;分完之后将剩余的slot按作业的权重和缺额分配给仍需slot的Job;如果还有slot剩余,则将这些slot共享给其他资源池Pool中的Job。
  private void updateMinSlots() {    // Clear old minSlots    for (JobInfo info: infos.values()) {      info.minMaps = 0;      info.minReduces = 0;    }    // For each pool, distribute its task allocation among jobs in it that need    // slots. This is a little tricky since some jobs in the pool might not be    // able to use all the slots, e.g. they might have only a few tasks left.    // To deal with this, we repeatedly split up the available task slots    // between the jobs left, give each job min(its alloc, # of slots it needs),    // and redistribute any slots that are left over between jobs that still    // need slots on the next pass. If, in total, the jobs in our pool don't    // need all its allocation, we leave the leftover slots for general use.    PoolManager poolMgr = getPoolManager();    for (Pool pool: poolMgr.getPools()) {      for (final TaskType type: TaskType.values()) {        Set<JobInProgress> jobs = new HashSet<JobInProgress>(pool.getJobs());        int slotsLeft = poolMgr.getAllocation(pool.getName(), type);        // Keep assigning slots until none are left        while (slotsLeft > 0) {          // Figure out total weight of jobs that still need slots          double totalWeight = 0;          for (Iterator<JobInProgress> it = jobs.iterator(); it.hasNext();) {            JobInProgress job = it.next();            // 如果job是可以运行的,并且job尚需的slot数与正在运行的slot数之和大于最小共享量            // 则获取其权重,更新totalWeight            if (isRunnable(job) &&                runnableTasks(job, type) > minTasks(job, type)) {              totalWeight += weight(job, type);            } else {              it.remove();            }          }          if (totalWeight == 0) // No jobs that can use more slots are left             break;          // Assign slots to jobs, using the floor of their weight divided by          // total weight. This ensures that all jobs get some chance to take          // a slot. Then, if no slots were assigned this way, we do another          // pass where we use ceil, in case some slots were still left over.          int oldSlots = slotsLeft; // Copy slotsLeft so we can modify it          for (JobInProgress job: jobs) {            double weight = weight(job, type);            // 计算该作业可获得的共享值            int share = (int) Math.floor(oldSlots * weight / totalWeight);            slotsLeft = giveMinSlots(job, type, slotsLeft, share);          }          // 如果此轮循环中,slotsLeft值未变,即没有slot分给任何Job,          // 则将剩余的slot共享给Pool中所有job          if (slotsLeft == oldSlots) {            // No tasks were assigned; do another pass using ceil, giving the            // extra slots to jobs in order of weight then deficit            List<JobInProgress> sortedJobs = new ArrayList<JobInProgress>(jobs);            Collections.sort(sortedJobs, new Comparator<JobInProgress>() {              public int compare(JobInProgress j1, JobInProgress j2) {                double dif = weight(j2, type) - weight(j1, type);                if (dif == 0) // Weights are equal, compare by deficit                   dif = deficit(j2, type) - deficit(j1, type);                return (int) Math.signum(dif);              }            });            for (JobInProgress job: sortedJobs) {              double weight = weight(job, type);              int share = (int) Math.ceil(oldSlots * weight / totalWeight);              slotsLeft = giveMinSlots(job, type, slotsLeft, share);            }            if (slotsLeft > 0) {              LOG.warn("Had slotsLeft = " + slotsLeft + " after the final "                  + "loop in updateMinSlots. This probably means some fair "                  + "scheduler weights are being set to NaN or Infinity.");            }            break;          }        }      }    }  }  /**   * Give up to <code>tasksToGive</code> min slots to a job (potentially fewer   * if either the job needs fewer slots or there aren't enough slots left).   * Returns the number of slots left over.   */  private int giveMinSlots(JobInProgress job, TaskType type,      int slotsLeft, int slotsToGive) {    int runnable = runnableTasks(job, type);    int curMin = minTasks(job, type);    // 更新最小共享量    slotsToGive = Math.min(Math.min(slotsLeft, runnable - curMin), slotsToGive);    slotsLeft -= slotsToGive;    JobInfo info = infos.get(job);    if (type == TaskType.MAP)      info.minMaps += slotsToGive;    else      info.minReduces += slotsToGive;    return slotsLeft;  }

f)更新公平共享量

主要思想:基于job权重(weight)和最小共享量(minimum share)计算公平共享量。首先,根据权重分配可用slot数,如果job的最小共享量大于公平共享量,先要满足最小共享量,更新可用slot数,重复以上步骤,直到所有job的最小共享量小于或等于公平共享量,这样,每个job的最小共享量都得到了满足,最后,所有job平分剩下的slot数。
  private void updateFairShares(ClusterStatus clusterStatus) {    // Clear old fairShares    for (JobInfo info: infos.values()) {      info.mapFairShare = 0;      info.reduceFairShare = 0;    }    // Assign new shares, based on weight and minimum share. This is done    // as follows. First, we split up the available slots between all    // jobs according to weight. Then if there are any jobs whose minSlots is    // larger than their fair allocation, we give them their minSlots and    // remove them from the list, and start again with the amount of slots    // left over. This continues until all jobs' minSlots are less than their    // fair allocation, and at this point we know that we've met everyone's    // guarantee and we've split the excess capacity fairly among jobs left.    for (TaskType type: TaskType.values()) {      // Select only jobs that still need this type of task      HashSet<JobInfo> jobsLeft = new HashSet<JobInfo>();      for (Entry<JobInProgress, JobInfo> entry: infos.entrySet()) {        JobInProgress job = entry.getKey();        JobInfo info = entry.getValue();        if (isRunnable(job) && runnableTasks(job, type) > 0) {          jobsLeft.add(info);        }      }      double slotsLeft = getTotalSlots(type, clusterStatus);      while (!jobsLeft.isEmpty()) {        double totalWeight = 0;        for (JobInfo info: jobsLeft) {          double weight = (type == TaskType.MAP ?              info.mapWeight : info.reduceWeight);          totalWeight += weight;        }        boolean recomputeSlots = false;        double oldSlots = slotsLeft; // Copy slotsLeft so we can modify it        for (Iterator<JobInfo> iter = jobsLeft.iterator(); iter.hasNext();) {          JobInfo info = iter.next();          double minSlots = (type == TaskType.MAP ?              info.minMaps : info.minReduces);          double weight = (type == TaskType.MAP ?              info.mapWeight : info.reduceWeight);          double fairShare = weight / totalWeight * oldSlots;          if (minSlots > fairShare) {            // Job needs more slots than its fair share; give it its minSlots,            // remove it from the list, and set recomputeSlots = true to             // remember that we must loop again to redistribute unassigned slots            if (type == TaskType.MAP)              info.mapFairShare = minSlots;            else              info.reduceFairShare = minSlots;            slotsLeft -= minSlots;            iter.remove();            recomputeSlots = true;          }        }        if (!recomputeSlots) {          // All minimums are met. Give each job its fair share of excess slots.          for (JobInfo info: jobsLeft) {            double weight = (type == TaskType.MAP ?                info.mapWeight : info.reduceWeight);            double fairShare = weight / totalWeight * oldSlots;            if (type == TaskType.MAP)              info.mapFairShare = fairShare;            else              info.reduceFairShare = fairShare;          }          break;        }      }    }  }
至此,有关更新的操作都已经完成了。

下面我们来分析一下FairScheduler调度task的方法assignTasks().

h)slot的分配

当集群中有空闲的slot时,将它分配给时间缺额最大的job。
  @Override  public synchronized List<Task> assignTasks(TaskTracker tracker)      throws IOException {    if (!initialized) // Don't try to assign tasks if we haven't yet started up      return null;        // Reload allocations file if it hasn't been loaded in a while    poolMgr.reloadAllocsIfNecessary();        // Compute total runnable maps and reduces    int runnableMaps = 0;    int runnableReduces = 0;    for (JobInProgress job: infos.keySet()) {      runnableMaps += runnableTasks(job, TaskType.MAP);      runnableReduces += runnableTasks(job, TaskType.REDUCE);    }    ClusterStatus clusterStatus = taskTrackerManager.getClusterStatus();    // Compute total map/reduce slots    // In the future we can precompute this if the Scheduler becomes a     // listener of tracker join/leave events.    int totalMapSlots = getTotalSlots(TaskType.MAP, clusterStatus);    int totalReduceSlots = getTotalSlots(TaskType.REDUCE, clusterStatus);        // Scan to see whether any job needs to run a map, then a reduce    ArrayList<Task> tasks = new ArrayList<Task>();    TaskType[] types = new TaskType[] {TaskType.MAP, TaskType.REDUCE};    TaskTrackerStatus trackerStatus = tracker.getStatus();    for (TaskType taskType: types) {      boolean canAssign = (taskType == TaskType.MAP) ?           loadMgr.canAssignMap(trackerStatus, runnableMaps, totalMapSlots) :          loadMgr.canAssignReduce(trackerStatus, runnableReduces, totalReduceSlots);      if (canAssign) {        // Figure out the jobs that need this type of task        List<JobInProgress> candidates = new ArrayList<JobInProgress>();        for (JobInProgress job: infos.keySet()) {          if (job.getStatus().getRunState() == JobStatus.RUNNING &&               neededTasks(job, taskType) > 0) {            candidates.add(job);          }        }        // Sort jobs by deficit (for Fair Sharing) or submit time (for FIFO)        // 按照时间缺额对job进行排序        Comparator<JobInProgress> comparator = useFifo ?            new FifoJobComparator() : new DeficitComparator(taskType);        Collections.sort(candidates, comparator);        for (JobInProgress job: candidates) {          Task task = (taskType == TaskType.MAP ?               taskSelector.obtainNewMapTask(trackerStatus, job) :// 分配一个map task              taskSelector.obtainNewReduceTask(trackerStatus, job)); // 分配一个reduce task          if (task != null) {            // Update the JobInfo for this job so we account for the launched            // tasks during this update interval and don't try to launch more            // tasks than the job needed on future heartbeats            JobInfo info = infos.get(job);            if (taskType == TaskType.MAP) {              info.runningMaps++;              info.neededMaps--;            } else {              info.runningReduces++;              info.neededReduces--;            }            tasks.add(task);            if (!assignMultiple)              return tasks;            break;          }        }      }    }        // If no tasks were found, return null    return tasks.isEmpty() ? null : tasks;  }