YARN源码分析之ApplicationMaster分配策略

来源：互联网发布：大数据云计算考研编辑：程序博客网时间：2024/05/18 00:17

一次和朋友的谈话中涉及到ApplicationMaster的container分配策略是什么，我映像中是随机分配的，但他说是根据各节点空闲资源来分配的。
之前看代码的时候也没注意这块的逻辑，既然现在有了疑惑那就去代码里瞅瞅。

个人站点地址：http://bigdatadecode.club/YARN源码分析之ApplicationMaster分配策略.html

从MR的运行log中可以找到AM的container是在什么时候分配的，见log

2017-04-09 03:26:17,113 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: appattempt_1491729774382_0001_000001 State change from SUBMITTED to SCHEDULED2017-04-09 03:26:17,415 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1491729774382_0001_01_000001 Container Transitioned from NEW to ALLOCATED

AM container是在appattempt的状态由SUBMITTED变为SCHEDULED时初始化的。
appattempt由SUBMITTED变为SCHEDULED状态的处理逻辑为:

public static final class ScheduleTransition    implements    MultipleArcTransition<RMAppAttemptImpl, RMAppAttemptEvent, RMAppAttemptState> {  @Override  public RMAppAttemptState transition(RMAppAttemptImpl appAttempt,      RMAppAttemptEvent event) {    ApplicationSubmissionContext subCtx = appAttempt.submissionContext;    if (!subCtx.getUnmanagedAM()) {      // Need reset #containers before create new attempt, because this request      // will be passed to scheduler, and scheduler will deduct the number after      // AM container allocated      // 设置am container的请求      appAttempt.amReq.setNumContainers(1);      appAttempt.amReq.setPriority(AM_CONTAINER_PRIORITY);      // ResourceName为ANY表示任何机架上的任一机器      appAttempt.amReq.setResourceName(ResourceRequest.ANY);      appAttempt.amReq.setRelaxLocality(true);      // 由调度器来分配资源      Allocation amContainerAllocation =          appAttempt.scheduler.allocate(appAttempt.applicationAttemptId,              Collections.singletonList(appAttempt.amReq),              EMPTY_CONTAINER_RELEASE_LIST, null, null);      ...      return RMAppAttemptState.SCHEDULED;    } else {      ...    }  }}

首先为AM container构造container请求，其实从appAttempt.amReq.setResourceName(ResourceRequest.ANY)就可以看出am container的分配原则是随机的，因为在创建请求时对ResourceName并没有要求。但我们还是继续看下代码以验证下。
请求创建成功之后，由调度器来分配资源，这里默认使用的是Capacity调度，代码如下:

// CapacityScheduler.javapublic Allocation allocate(ApplicationAttemptId applicationAttemptId,    List<ResourceRequest> ask, List<ContainerId> release,     List<String> blacklistAdditions, List<String> blacklistRemovals) {  FiCaSchedulerApp application = getApplicationAttempt(applicationAttemptId);  ...  // Release containers  releaseContainers(release, application);  synchronized (application) {    ...    if (!ask.isEmpty()) {      ...      application.showRequests();      // 将请求该application attempt的map中      // Update application requests      application.updateResourceRequests(ask);      application.showRequests();    }    application.updateBlacklist(blacklistAdditions, blacklistRemovals);    //    return application.getAllocation(getResourceCalculator(),                 clusterResource, getMinimumResourceCapability());  }}

CapacityScheduler分配请求时，调用application.updateResourceRequests(ask)将请求放入map中，等待nm心跳时来取。
这个application是FiCaSchedulerApp的对象，FiCaSchedulerApp其实对应的是application attempt。updateResurceRequests代码如下:

public synchronized void updateResourceRequests(    List<ResourceRequest> requests) {  if (!isStopped) {    // AppSchedulingInfo.updateResourceRequests    appSchedulingInfo.updateResourceRequests(requests, false);  }}

AppSchedulingInfo记录了application的所有消费情况，当然也包括这个application正在运行或者已经完成的container。

synchronized public void updateResourceRequests(    List<ResourceRequest> requests, boolean recoverPreemptedRequest) {  // Update resource requests  for (ResourceRequest request : requests) {    Priority priority = request.getPriority();    String resourceName = request.getResourceName();    boolean updatePendingResources = false;    ResourceRequest lastRequest = null;    // 如果request的ResourceName是ResourceRequest.ANY    // 只有am container是ANY？？？不应该吧    if (resourceName.equals(ResourceRequest.ANY)) {      ...      // ResourceRequest.ANY才置为true？？      updatePendingResources = true;      // Premature optimization?      // Assumes that we won't see more than one priority request updated      // in one call, reasonable assumption... however, it's totally safe      // to activate same application more than once.      // Thus we don't need another loop ala the one in decrementOutstanding()        // which is needed during deactivate.      if (request.getNumContainers() > 0) {        activeUsersManager.activateApplication(user, applicationId);      }    }    // requests是一个请求列表 map    // 查看requests中是否已有该优先级的请求    // this.requests中存放的是这个application的request    Map<String, ResourceRequest> asks = this.requests.get(priority);    // 没有此优先级的请求，则new一个map    if (asks == null) {      asks = new HashMap<String, ResourceRequest>();      this.requests.put(priority, asks);      this.priorities.add(priority);    }    // asks不为null，查看asks中是否有与此请求ResourceName一样的请求    lastRequest = asks.get(resourceName);    if (recoverPreemptedRequest && lastRequest != null) {      // Increment the number of containers to 1, as it is recovering a      // single container.      request.setNumContainers(lastRequest.getNumContainers() + 1);    }    // 把原来的请求拿出赋值给lastRequest，    // 然后将新的request将入asks中，lastRequest怎么办？在哪处理的？    asks.put(resourceName, request);    if (updatePendingResources) {      // Similarly, deactivate application?      if (request.getNumContainers() <= 0) {        LOG.info("checking for deactivate... ");        checkForDeactivation();      }      int lastRequestContainers = lastRequest != null ? lastRequest          .getNumContainers() : 0;      Resource lastRequestCapability = lastRequest != null ? lastRequest          .getCapability() : Resources.none();      metrics.incrPendingResources(user, request.getNumContainers(),          request.getCapability());      metrics.decrPendingResources(user, lastRequestContainers,          lastRequestCapability);    }  }}

updateResourceRequests主要是将请求放入requests中，等待nm心跳来取。不过这里有点模糊，在更新requests之前，如果存在该ResourceName的请求则取出，赋值给lastRequest，然后这个lastRequest是怎么处理的呢？不知道怎么回事，标注下。

更新完requests之后，回到CapacityScheduler.allocate中继续执行，return时执行application.getAllocation返回一个Allocation对象，这里会给container创建TOKEN，这里创建TOKEN的container是已经分配给nm的，也就是已经实例化的RMContainer，是不是说调度器在调度container时，先创建一个请求，然后从newlyAllocatedContainers列表中取出上次请求container的响应结果？

am container的请求创建好之后，等待nm心跳来取

某个nm发送来了心跳，
代码如下:

// CapacityScheduler.handle  NODE_UPDATE事件case NODE_UPDATE:{  NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;  RMNode node = nodeUpdatedEvent.getRMNode();  // 更新该节点上的container的信息  // 对刚分配到该节点的container进行launch，已经完成的container进行状态转移  nodeUpdate(node);  if (!scheduleAsynchronously) {    // 该节点取container请求    allocateContainersToNode(getNode(node.getNodeID()));  }}

nm与CapacityScheduler心跳之后，通过nodeUpdate(node)对改节点上已有的container进行状态更新，然后调用allocateContainersToNode去拉取新的container请求。

private synchronized void allocateContainersToNode(FiCaSchedulerNode node) {  ...  // Assign new containers...  // 1. Check for reserved applications  // 2. Schedule if there are no reservations  // 如果有预留container的话先分配预留的container  ...  // Try to schedule more if there are no reservations to fulfill  if (node.getReservedContainer() == null) {    // 计算nm上是否还有空闲的资源进行分配container    if (calculator.computeAvailableContainers(node.getAvailableResource(),      minimumAllocation) > 0) {      if (LOG.isDebugEnabled()) {        LOG.debug("Trying to schedule on node: " + node.getNodeName() +            ", available: " + node.getAvailableResource());      }      // 分配container      root.assignContainers(clusterResource, node, false);    }  } else {    LOG.info("Skipping scheduling since node " + node.getNodeID() +         " is reserved by application " +         node.getReservedContainer().getContainerId().getApplicationAttemptId()        );  }}

调度器给这台nm调度container时，先判断这台nm上是否有预留的container，如果有先对预留的container进行分配，如果没有预留的container才调用root.assignContainers进行调度。
root是CSQueue对象，CSQueue是一个接口，抽象类AbstractCSQueue实现了该接口，而AbstractCSQueue又被ParentQueue和ChildQueue继承，这里调用的是ParentQueue的assignContainers，代码如下:

public synchronized CSAssignment assignContainers(    Resource clusterResource, FiCaSchedulerNode node, boolean needToUnreserve) {  CSAssignment assignment =       new CSAssignment(Resources.createResource(0, 0), NodeType.NODE_LOCAL);  // if our queue cannot access this node, just return  if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels,      labelManager.getLabelsOnNode(node.getNodeID()))) {    return assignment;  }  while (canAssign(clusterResource, node)) {    ...    boolean localNeedToUnreserve = false;    Set<String> nodeLabels = labelManager.getLabelsOnNode(node.getNodeID());     // Are we over maximum-capacity for this queue?    if (!canAssignToThisQueue(clusterResource, nodeLabels)) {      // check to see if we could if we unreserve first      localNeedToUnreserve = assignToQueueIfUnreserve(clusterResource);      if (!localNeedToUnreserve) {        break;      }    }    // Schedule    CSAssignment assignedToChild =         assignContainersToChildQueues(clusterResource, node, localNeedToUnreserve | needToUnreserve);    assignment.setType(assignedToChild.getType());    ...    // Do not assign more than one container if this isn't the root queue    // or if we've already assigned an off-switch container    if (!rootQueue || assignment.getType() == NodeType.OFF_SWITCH) {      if (LOG.isDebugEnabled()) {        if (rootQueue && assignment.getType() == NodeType.OFF_SWITCH) {          LOG.debug("Not assigning more than one off-switch container," +              " assignments so far: " + assignment);        }      }      break;    }  }   return assignment;}

分配时，先判断此队列是否可以访问该nm，然后判断是否可以访问该nm上的label，都判断通过之后调用assignContainersToChildQueues进行分配，

private synchronized CSAssignment assignContainersToChildQueues(Resource cluster,     FiCaSchedulerNode node, boolean needToUnreserve) {  ...  // Try to assign to most 'under-served' sub-queue  for (Iterator<CSQueue> iter=childQueues.iterator(); iter.hasNext();) {    CSQueue childQueue = iter.next();    ...    assignment = childQueue.assignContainers(cluster, node, needToUnreserve);    ...    // If we do assign, remove the queue and re-insert in-order to re-sort    if (Resources.greaterThan(            resourceCalculator, cluster,             assignment.getResource(), Resources.none())) {      // Remove and re-insert to sort      iter.remove();      LOG.info("Re-sorting assigned queue: " + childQueue.getQueuePath() +           " stats: " + childQueue);      childQueues.add(childQueue);      if (LOG.isDebugEnabled()) {        printChildQueues();      }      break;    }  }  return assignment;}

assignContainersToChildQueues调用ChildQueue的assignContainer进行分配，分配之后要讲改childQueue从队列中remove掉，然后重新插入到队列中，进行排序。
childQueue.assignContainers如下:

public synchronized CSAssignment assignContainers(Resource clusterResource,    FiCaSchedulerNode node, boolean needToUnreserve) {  ...  // if our queue cannot access this node, just return  if (!SchedulerUtils.checkQueueAccessToNode(accessibleLabels,      labelManager.getLabelsOnNode(node.getNodeID()))) {    return NULL_ASSIGNMENT;  }  // Check for reserved resources  RMContainer reservedContainer = node.getReservedContainer();  if (reservedContainer != null) {    FiCaSchedulerApp application =         getApplication(reservedContainer.getApplicationAttemptId());    synchronized (application) {      return assignReservedContainer(application, node, reservedContainer,          clusterResource);    }  }  // Try to assign containers to applications in order  for (FiCaSchedulerApp application : activeApplications) {    ...    // 加锁    synchronized (application) {      // Check if this resource is on the blacklist      if (SchedulerAppUtils.isBlacklisted(application, node, LOG)) {        continue;      }      // Schedule in priority order      for (Priority priority : application.getPriorities()) {        // 为什么是ANY？        // 如果当前application中的request中没有ANY就不分配？        // 想办法debug试一下        ResourceRequest anyRequest =            application.getResourceRequest(priority, ResourceRequest.ANY);        if (null == anyRequest) {          continue;        }        // Required resource        Resource required = anyRequest.getCapability();        // Do we need containers at this 'priority'?        if (application.getTotalRequiredResources(priority) <= 0) {          continue;        }        if (!this.reservationsContinueLooking) {          if (!needContainers(application, priority, required)) {            if (LOG.isDebugEnabled()) {              LOG.debug("doesn't need containers based on reservation algo!");            }            continue;          }        }        Set<String> requestedNodeLabels =            getRequestLabelSetByExpression(anyRequest                .getNodeLabelExpression());        // Compute user-limit & set headroom        // Note: We compute both user-limit & headroom with the highest         //       priority request as the target.         //       This works since we never assign lower priority requests        //       before all higher priority ones are serviced.        Resource userLimit =             computeUserLimitAndSetHeadroom(application, clusterResource,                 required, requestedNodeLabels);                  // Check queue max-capacity limit        if (!canAssignToThisQueue(clusterResource, required,            labelManager.getLabelsOnNode(node.getNodeID()), application, true)) {          return NULL_ASSIGNMENT;        }        // Check user limit        if (!assignToUser(clusterResource, application.getUser(), userLimit,            application, true, requestedNodeLabels)) {          break;        }        // Inform the application it is about to get a scheduling opportunity        // 这又是什么鬼？增加调度的机会？        application.addSchedulingOpportunity(priority);        // Try to schedule        // 开始调度        CSAssignment assignment =            assignContainersOnNode(clusterResource, node, application, priority,               null, needToUnreserve);        // Did the application skip this node?        if (assignment.getSkipped()) {          // Don't count 'skipped nodes' as a scheduling opportunity!          application.subtractSchedulingOpportunity(priority);          continue;        }        // Did we schedule or reserve a container?        Resource assigned = assignment.getResource();        if (Resources.greaterThan(            resourceCalculator, clusterResource, assigned, Resources.none())) {          // Book-keeping           // Note: Update headroom to account for current allocation too...          allocateResource(clusterResource, application, assigned,              labelManager.getLabelsOnNode(node.getNodeID()));          // Don't reset scheduling opportunities for non-local assignments          // otherwise the app will be delayed for each non-local assignment.          // This helps apps with many off-cluster requests schedule faster.          if (assignment.getType() != NodeType.OFF_SWITCH) {            if (LOG.isDebugEnabled()) {              LOG.debug("Resetting scheduling opportunities");            }            application.resetSchedulingOpportunities(priority);          }          // Done          return assignment;        } else {          // Do not assign out of order w.r.t priorities          break;        }      }    }    if(LOG.isDebugEnabled()) {      LOG.debug("post-assignContainers for application "        + application.getApplicationId());    }    application.showRequests();  }  return NULL_ASSIGNMENT;}

LeafQueue.assignContainers会从遍历当前队列中正在运行的application的container请求，通过一系列的逻辑之后调用assignContainersOnNode进行调度

private CSAssignment assignContainersOnNode(Resource clusterResource,    FiCaSchedulerNode node, FiCaSchedulerApp application, Priority priority,    RMContainer reservedContainer, boolean needToUnreserve) {  Resource assigned = Resources.none();  // 如果ResourceName是NODE_LOCAL  ResourceRequest nodeLocalResourceRequest =      application.getResourceRequest(priority, node.getNodeName());  if (nodeLocalResourceRequest != null) {    assigned =         assignNodeLocalContainers(clusterResource, nodeLocalResourceRequest,             node, application, priority, reservedContainer, needToUnreserve);     if (Resources.greaterThan(resourceCalculator, clusterResource,         assigned, Resources.none())) {      return new CSAssignment(assigned, NodeType.NODE_LOCAL);    }  }  // 如果ResourceName是Rack-local  ResourceRequest rackLocalResourceRequest =      application.getResourceRequest(priority, node.getRackName());  if (rackLocalResourceRequest != null) {    if (!rackLocalResourceRequest.getRelaxLocality()) {      return SKIP_ASSIGNMENT;    }    assigned =         assignRackLocalContainers(clusterResource, rackLocalResourceRequest,             node, application, priority, reservedContainer, needToUnreserve);    if (Resources.greaterThan(resourceCalculator, clusterResource,         assigned, Resources.none())) {      return new CSAssignment(assigned, NodeType.RACK_LOCAL);    }  }  // 如果ResourceName是Off-switch，也就是ANY  ResourceRequest offSwitchResourceRequest =      application.getResourceRequest(priority, ResourceRequest.ANY);  if (offSwitchResourceRequest != null) {    if (!offSwitchResourceRequest.getRelaxLocality()) {      return SKIP_ASSIGNMENT;    }    return new CSAssignment(        assignOffSwitchContainers(clusterResource, offSwitchResourceRequest,            node, application, priority, reservedContainer, needToUnreserve),             NodeType.OFF_SWITCH);  }  return SKIP_ASSIGNMENT;}

assignContainersOnNode会根据请求中资源的类型进行不同的逻辑处理，由于am container中ResourceRequest为ANY，所以这里只关注下Off-switch的处理逻辑，代码如下：

private Resource assignOffSwitchContainers(    Resource clusterResource, ResourceRequest offSwitchResourceRequest,    FiCaSchedulerNode node, FiCaSchedulerApp application, Priority priority,     RMContainer reservedContainer, boolean needToUnreserve) {  if (canAssign(application, priority, node, NodeType.OFF_SWITCH,       reservedContainer)) {    return assignContainer(clusterResource, node, application, priority,        offSwitchResourceRequest, NodeType.OFF_SWITCH, reservedContainer,        needToUnreserve);  }  return Resources.none();}

assignOffSwitchContainers又调用了assignContainer，继续跟踪

private Resource assignContainer(Resource clusterResource, FiCaSchedulerNode node,     FiCaSchedulerApp application, Priority priority,     ResourceRequest request, NodeType type, RMContainer rmContainer,    boolean needToUnreserve) {  ...  // container的资源大小    Resource capability = request.getCapability();  // 节点可用的资源大小  Resource available = node.getAvailableResource();  // 节点总共资源大小  Resource totalResource = node.getTotalResource();  // 判断请求的资源是否超过了节点的总量  if (!Resources.fitsIn(capability, totalResource)) {    LOG.warn("Node : " + node.getNodeID()        + " does not have sufficient resource for request : " + request        + " node total capability : " + node.getTotalResource());    return Resources.none();  }  assert Resources.greaterThan(      resourceCalculator, clusterResource, available, Resources.none());  // Create the container if necessary  // 生成containerId  Container container =       getContainer(rmContainer, application, node, capability, priority);  ...  // 先判断是否可以分配预留的container，  // 可以分配正常的container时，才去判断空闲的资源是否可以分配  // Can we allocate a container on this node?  int availableContainers =       resourceCalculator.computeAvailableContainers(available, capability);  if (availableContainers > 0) {    // Allocate...    ...    // Inform the application    RMContainer allocatedContainer =         application.allocate(type, node, priority, request, container);    // Does the application need this resource?    if (allocatedContainer == null) {      return Resources.none();    }    // 通知node进行分配，将container放入launchedContainers map中    // Inform the node    node.allocateContainer(allocatedContainer);    LOG.info("assignedContainer" +        " application attempt=" + application.getApplicationAttemptId() +        " container=" + container +         " queue=" + this +         " clusterResource=" + clusterResource);    return container.getResource();  } else {    // if we are allowed to allocate but this node doesn't have space, reserve it or    // if this was an already a reserved container, reserve it again    ...    return Resources.none();  }}

assignContainer首先判断container请求的资源是否超过了节点的总资源量，如果没有超过调用getContainer查看当前节点上是否有预留的container，没有则createContainer，生成containerId。containerId生成之后，去判断当前节点上的空闲资源能否够分配，如果可以的话就调用application.allocate进行分配，application是FiCaSchedulerApp的对象。最后将container放入launchedContainers中，随后会心跳返回给node。allocate代码如下:

synchronized public RMContainer allocate(NodeType type, FiCaSchedulerNode node,    Priority priority, ResourceRequest request,     Container container) {  ...  // container在RM端称为RMcontainer  // Create RMContainer  RMContainer rmContainer = new RMContainerImpl(container, this      .getApplicationAttemptId(), node.getNodeID(),      appSchedulingInfo.getUser(), this.rmContext);  // Add it to allContainers list.  // 将生成的container放入allContainers list  // 调度器在调度的时候从中取出container  newlyAllocatedContainers.add(rmContainer);  liveContainers.put(container.getId(), rmContainer);      // Update consumption and track allocations  List<ResourceRequest> resourceRequestList = appSchedulingInfo.allocate(      type, node, priority, request, container);  Resources.addTo(currentConsumption, container.getResource());  // Update resource requests related to "request" and store in RMContainer   ((RMContainerImpl)rmContainer).setResourceRequests(resourceRequestList);  // Inform the container  // 时间调度器来通知container已经准备好，触发container状态机  rmContainer.handle(      new RMContainerEvent(container.getId(), RMContainerEventType.START));  ...  return rmContainer;}

allocate创建一个RMContainer，并将其放入allContainers列表newlyAllocatedContainers中，调度器从中取出container分配给node。

总结

大致的流程顺着代码理解的差不多了，但一些细节还是没有搞太清楚，随后详细debug下，在更新吧。
说下我目前的理解，调度器首先创建一个container请求，并查看newlyAllocatedContainers中是否有可调度的container，如果有则创建该container的TOKEN。然后nm来进行心跳的时候，从requests中取出对应的请求进行实例化，随后再放入newlyAllocatedContainers列表中，等待调度。

阅读全文

0 0