Hadoop源码解析之申请与分配Container

来源：互联网发布：网络大电影方案编辑：程序博客网时间：2024/06/04 18:36

本文从源码方面介绍应用程序的AM在NM上成功启动并向RM注册后，向RM请求资源（Container）到获取资源的整个过程，

以及RM内部涉及的主要工作流程。整个过程可看做以下两个阶段的迭代循环：

阶段1：AM汇报资源需求并领取已经分配到的资源。

阶段2：NM向RM汇报各个Container的运行状态，如果RM发现它上面有空闲的资源，则进行一次资源分配，并将分配的资源保存

到对应的数据结构中，等待下一次AM发送心跳信息时获取。

Container分配与申请流程具体步骤如下：

阶段1：

步骤1： AM通过RPC函数ApplicationMasterProtocol#allocate向RM汇报资源需求（周期性调用）

包括新的资源需求描述、待释放的Container列表、请求加入黑名单的节点列表、请求移除黑名单的节点列表。

//RMContainerAllocator.javaprotected synchronized void heartbeat() throws Exception {    scheduleStats.updateAndLogIfChanged("Before Scheduling: ");    List<Container> allocatedContainers = getResources();    if (allocatedContainers.size() > 0) {      scheduledRequests.assign(allocatedContainers);    }    int completedMaps = getJob().getCompletedMaps();    int completedTasks = completedMaps + getJob().getCompletedReduces();    if ((lastCompletedTasks != completedTasks) ||          (scheduledRequests.maps.size() > 0)) {      lastCompletedTasks = completedTasks;      recalculateReduceSchedule = true;    }    if (recalculateReduceSchedule) {      preemptReducesIfNeeded();      scheduleReduces(          getJob().getTotalMaps(), completedMaps,          scheduledRequests.maps.size(), scheduledRequests.reduces.size(),           assignedRequests.maps.size(), assignedRequests.reduces.size(),          mapResourceRequest, reduceResourceRequest,          pendingReduces.size(),           maxReduceRampupLimit, reduceSlowStart);      recalculateReduceSchedule = false;    }    scheduleStats.updateAndLogIfChanged("After Scheduling: ");  }

在类RMContainerAllocator的心跳函数heartbeat中，调用函数

List<Container> allocatedContainers = getResources();

获取Container列表，进入函数getResources：

//RMContainerAllocator.javaprivate List<Container> getResources() throws Exception {    int headRoom = getAvailableResources() != null        ? getAvailableResources().getMemory() : 0;//first time it would be null    AllocateResponse response;    /*     * If contact with RM is lost, the AM will wait MR_AM_TO_RM_WAIT_INTERVAL_MS     * milliseconds before aborting. During this interval, AM will still try     * to contact the RM.     */    try {      response = makeRemoteRequest();      // Reset retry count if no exception occurred.      retrystartTime = System.currentTimeMillis();    } catch (Exception e) {        ...    }

进入response = makeRemoteRequest();

//RMContainerRequestor.javaprivate List<Container> getResources() throws Exception {protected AllocateResponse makeRemoteRequest() throws IOException {    ResourceBlacklistRequest blacklistRequest =        ResourceBlacklistRequest.newInstance(new ArrayList<String>(blacklistAdditions),            new ArrayList<String>(blacklistRemovals));    AllocateRequest allocateRequest =        AllocateRequest.newInstance(lastResponseID,          super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),          new ArrayList<ContainerId>(release), blacklistRequest);    AllocateResponse allocateResponse;    try {      allocateResponse = scheduler.allocate(allocateRequest);    } catch (YarnException e) {      throw new IOException(e);    }    lastResponseID = allocateResponse.getResponseId();    availableResources = allocateResponse.getAvailableResources();    lastClusterNmCount = clusterNmCount;    clusterNmCount = allocateResponse.getNumClusterNodes();    if (ask.size() > 0 || release.size() > 0) {      LOG.info("getResources() for " + applicationId + ":" + " ask="          + ask.size() + " release= " + release.size() + " newContainers="          + allocateResponse.getAllocatedContainers().size()          + " finishedContainers="          + allocateResponse.getCompletedContainersStatuses().size()          + " resourcelimit=" + availableResources + " knownNMs="          + clusterNmCount);    }    ask.clear();    release.clear();    if (blacklistAdditions.size() > 0 || blacklistRemovals.size() > 0) {      LOG.info("Update the blacklist for " + applicationId +          ": blacklistAdditions=" + blacklistAdditions.size() +          " blacklistRemovals=" +  blacklistRemovals.size());    }    blacklistAdditions.clear();    blacklistRemovals.clear();    return allocateResponse;  }

主要看allocateResponse = scheduler.allocate(allocateRequest);

变量scheduler定义和实现都在类RMCommunicator中：

//RMCommunicator.javaprotected ApplicationMasterProtocol scheduler;...protected void serviceStart() throws Exception {    scheduler= createSchedulerProxy();    JobID id = TypeConverter.fromYarn(this.applicationId);    JobId jobId = TypeConverter.toYarn(id);    job = context.getJob(jobId);    register();    startAllocatorThread();    super.serviceStart();  }    ...protected ApplicationMasterProtocol createSchedulerProxy() {    final Configuration conf = getConfig();    try {      return ClientRMProxy.createRMProxy(conf, ApplicationMasterProtocol.class);    } catch (IOException e) {      throw new YarnRuntimeException(e);    }  }

可以看到变量scheduler就是RPC协议ApplicationMasterProtocol的一个代理类实现。

可以调用ApplicationMasterProtocol的allocate函数。

scheduler.allocate(allocateRequest);的入参allocateRequest中，

包含了资源需求描述、待释放的Container列表、请求加入黑名单的节点列表、请求移除黑名单的节点列表。

//RMContainerRequestor.javaprivate List<Container> getResources() throws Exception {protected AllocateResponse makeRemoteRequest() throws IOException {    ResourceBlacklistRequest blacklistRequest =        ResourceBlacklistRequest.newInstance(new ArrayList<String>(blacklistAdditions),            new ArrayList<String>(blacklistRemovals));    AllocateRequest allocateRequest =        AllocateRequest.newInstance(lastResponseID,          super.getApplicationProgress(), new ArrayList<ResourceRequest>(ask),          new ArrayList<ContainerId>(release), blacklistRequest);    AllocateResponse allocateResponse;    try {      allocateResponse = scheduler.allocate(allocateRequest);    } catch (YarnException e) {      throw new IOException(e);    }

其中资源需求描述：new ArrayList<ResourceRequest>(ask)

//RMContainerRequestor.javaprivate final Set<ResourceRequest> ask = new TreeSet<ResourceRequest>(      new org.apache.hadoop.yarn.api.records.ResourceRequest.ResourceRequestComparator());

待释放的Container列表：new ArrayList<ContainerId>(release)

//RMContainerRequestor.javaprivate final Set<ContainerId> release = new TreeSet<ContainerId>();

请求加入黑名单的节点列表：new ArrayList<String>(blacklistAdditions)

//RMContainerRequestor.javaprivate final Set<String> blacklistAdditions = Collections      .newSetFromMap(new ConcurrentHashMap<String, Boolean>());

请求移除黑名单的节点列表：new ArrayList<String>(blacklistRemovals)

//RMContainerRequestor.javaprivate final Set<String> blacklistRemovals = Collections      .newSetFromMap(new ConcurrentHashMap<String, Boolean>());

步骤2： RM中的ApplicationMasterService负责处理来自ApplicationMaster的请求，一旦收到该请求，会向RMAppAttemptImpl

发送RMAppAttemptEventType.STATUS_UPDATE事件：

//ApplicationMasterService.javapublic AllocateResponse allocate(AllocateRequest request)      throws YarnException, IOException {  LOG.info("ApplicationMasterService::allocate, begin");    ...        // Send the status update to the appAttempt.      LOG.info("Send the status update to the appAttempt.");      this.rmContext.getDispatcher().getEventHandler().handle(          new RMAppAttemptStatusupdateEvent(appAttemptId, request              .getProgress()));...}

RMAppAttemptImpl收到该事件后，状态从RUNNING到RUNNING，

//RMAppAttemptImpl.java.addTransition(RMAppAttemptState.RUNNING, RMAppAttemptState.RUNNING,          RMAppAttemptEventType.STATUS_UPDATE, new StatusUpdateTransition())

并调用StatusUpdateTransition类中的transition函数：

//RMAppAttemptImpl.javaprivate static final class StatusUpdateTransition extends      BaseTransition {    @Override    public void transition(RMAppAttemptImpl appAttempt,        RMAppAttemptEvent event) {      RMAppAttemptStatusupdateEvent statusUpdateEvent        = (RMAppAttemptStatusupdateEvent) event;      // Update progress      appAttempt.progress = statusUpdateEvent.getProgress();      // Ping to AMLivelinessMonitor      appAttempt.rmContext.getAMLivelinessMonitor().receivedPing(          statusUpdateEvent.getApplicationAttemptId());    }  }

函数中首先更新应用程序执行进度，然后更新AMLivelinessMonitor中记录的应用程序最近更新时间。

看一下AMLivelinessMonitor中的receivedPing函数：

//AbstractLivelinessMonitor.javapublic synchronized void receivedPing(O ob) {    //only put for the registered objects    if (running.containsKey(ob)) {      running.put(ob, clock.getTime());    }  }

变量running是一个map：

//AbstractLivelinessMonitor.javapublic abstract class AbstractLivelinessMonitor<O> extends AbstractService {...private Map<O, Long> running = new HashMap<O, Long>();...}

步骤3：ApplicationMasterService调用CapacityScheduler::allocate函数，将该AM资源需求汇报给CapacityScheduler。

//ApplicationMasterService.javapublic AllocateResponse allocate(AllocateRequest request)      throws YarnException, IOException {  LOG.info("ApplicationMasterService::allocate, begin");  ...  // Send new requests to appAttempt.      LOG.info("Send new requests to appAttempt.");      Allocation allocation =          this.rScheduler.allocate(appAttemptId, ask, release,               blacklistAdditions, blacklistRemovals);

步骤4： CapacityScheduler首先读取待释放的列表List<ContainerId> release，依次向RMContainerImpl发送

RMContainerEventType.RELEASED事件，以杀死正在运行的Container。

//CapacityScheduler.javapublic Allocation allocate(ApplicationAttemptId applicationAttemptId,      List<ResourceRequest> ask, List<ContainerId> release,       List<String> blacklistAdditions, List<String> blacklistRemovals) {  LOG.info("CapacityScheduler::allocate, begin.");     ....    // Release containers    for (ContainerId releasedContainerId : release) {    LOG.info("Release containers");      RMContainer rmContainer = getRMContainer(releasedContainerId);      if (rmContainer == null) {         RMAuditLogger.logFailure(application.getUser(),             AuditConstants.RELEASE_CONTAINER,              "Unauthorized access or invalid container", "CapacityScheduler",             "Trying to release container not owned by app or with invalid id",             application.getApplicationId(), releasedContainerId);      }      completedContainer(rmContainer,          SchedulerUtils.createAbnormalContainerStatus(              releasedContainerId,               SchedulerUtils.RELEASED_CONTAINER),          RMContainerEventType.RELEASED);    }    ...  }

进入函数completedContainer：

//CapacityScheduler.javaprivate synchronized void completedContainer(RMContainer rmContainer,      ContainerStatus containerStatus, RMContainerEventType event) {    if (rmContainer == null) {      LOG.info("Null container completed...");      return;    }        Container container = rmContainer.getContainer();        // Get the application for the finished container    FiCaSchedulerApp application =        getCurrentAttemptForContainer(container.getId());    ApplicationId appId =        container.getId().getApplicationAttemptId().getApplicationId();    if (application == null) {      LOG.info("Container " + container + " of" + " unknown application "          + appId + " completed with event " + event);      return;    }        // Get the node on which the container was allocated    FiCaSchedulerNode node = getNode(container.getNodeId());        // Inform the queue    LeafQueue queue = (LeafQueue)application.getQueue();    queue.completedContainer(clusterResource, application, node,         rmContainer, containerStatus, event, null);    LOG.info("Application attempt " + application.getApplicationAttemptId()        + " released container " + container.getId() + " on node: " + node        + " with event: " + event);  }

然后将新的资源请求更新到对应的数据结构中，并返回已经为该应用程序分配的资源：

//CapacityScheduler.javapublic Allocation allocate(ApplicationAttemptId applicationAttemptId,      List<ResourceRequest> ask, List<ContainerId> release,       List<String> blacklistAdditions, List<String> blacklistRemovals) {  LOG.info("CapacityScheduler::allocate, begin.");  ...  FiCaSchedulerApp application = getApplicationAttempt(applicationAttemptId);  ...  synchronized (application) {      if (!ask.isEmpty()) {          LOG.info("allocate: pre-update" +            " applicationAttemptId=" + applicationAttemptId +             " application=" + application);                application.showRequests();          // Update application requests        application.updateResourceRequests(ask);          LOG.info("allocate: post-update");        application.showRequests();      }              LOG.info("allocate:" +          " applicationAttemptId=" + applicationAttemptId +           " #ask=" + ask.size());            application.updateBlacklist(blacklistAdditions, blacklistRemovals);      return application.getAllocation(getResourceCalculator(),                   clusterResource, getMinimumResourceCapability());    }

阶段2：

步骤1：NM通过RPC函数ResourceTracker#nodeHeartbeat向RM汇报各个Container运行状态。

NodeManager类中，初始化时添加了服务nodeStatusUpdater：

//NodeManager.javaprotected void serviceInit(Configuration conf) throws Exception {nodeStatusUpdater =        createNodeStatusUpdater(context, dispatcher, nodeHealthChecker);          // StatusUpdater should be added last so that it get started last     // so that we make sure everything is up before registering with RM.     addService(nodeStatusUpdater);        }   ... protected NodeStatusUpdater createNodeStatusUpdater(Context context,      Dispatcher dispatcher, NodeHealthCheckerService healthChecker) {    return new NodeStatusUpdaterImpl(context, dispatcher, healthChecker,      metrics);  }

进入NodeStatusUpdaterImpl类的服务启动：

//NodeStatusUpdaterImpl.javaprotected void serviceStart() throws Exception {    // NodeManager is the last service to start, so NodeId is available.    this.nodeId = this.context.getNodeId();    this.httpPort = this.context.getHttpPort();    this.nodeManagerVersionId = YarnVersionInfo.getVersion();    try {      // Registration has to be in start so that ContainerManager can get the      // perNM tokens needed to authenticate ContainerTokens.      this.resourceTracker = getRMClient();      registerWithRM();      super.serviceStart();      startStatusUpdater();    } catch (Exception e) {      String errorMessage = "Unexpected error starting NodeStatusUpdater";      LOG.error(errorMessage, e);      throw new YarnRuntimeException(e);    }  }

this.resourceTracker = getRMClient();初始化了resourceTracker 协议：

//NodeStatusUpdaterImpl.javaprotected ResourceTracker getRMClient() throws IOException {    Configuration conf = getConfig();    return ServerRMProxy.createRMProxy(conf, ResourceTracker.class);  }

注意到这里创建RPC协议resourceTracker 的代理时，用的是ServerRMProxy，而前面的RPC协议用的是ClientRMProxy

看下RMProxy的类图：

其中ClientRMProxy，代理ApplicationClientProtocol、ApplicationMasterProtocol、ResourceManagerAdministrationProtocol，

实现 Yarn client、AM与RM的连接。

ServerRMProxy提供给NM连接RM使用。代理ResourceTracker。

ResourceTracker协议的proto文件如下：

//hadoop-2.5.2-src\hadoop-yarn-project\hadoop-yarn\hadoop-yarn-server\//hadoop-yarn-server-common\src\main\proto\ResourceTracker.protooption java_package = "org.apache.hadoop.yarn.proto";option java_outer_classname = "ResourceTracker";option java_generic_services = true;option java_generate_equals_and_hash = true;package hadoop.yarn;import "yarn_server_common_service_protos.proto";service ResourceTrackerService {  rpc registerNodeManager(RegisterNodeManagerRequestProto) returns (RegisterNodeManagerResponseProto);  rpc nodeHeartbeat(NodeHeartbeatRequestProto) returns (NodeHeartbeatResponseProto);}

this.resourceTracker对应的实现类是ResourceTrackerPBClientImpl。

创建this.resourceTracker之后，进入startStatusUpdater（）：

//NodeStatusUpdaterImpl.javaprotected void startStatusUpdater() {    statusUpdaterRunnable = new Runnable() {      @Override      @SuppressWarnings("unchecked")      public void run() {        int lastHeartBeatID = 0;        while (!isStopped) {          // Send heartbeat          try {            NodeHeartbeatResponse response = null;            NodeStatus nodeStatus = getNodeStatus(lastHeartBeatID);                        NodeHeartbeatRequest request =                NodeHeartbeatRequest.newInstance(nodeStatus,                  NodeStatusUpdaterImpl.this.context                    .getContainerTokenSecretManager().getCurrentKey(),                  NodeStatusUpdaterImpl.this.context.getNMTokenSecretManager()                    .getCurrentKey());                        LOG.info("resourceTracker= "+resourceTracker.toString());            response = resourceTracker.nodeHeartbeat(request);            //get next heartbeat interval from response            nextHeartBeatInterval = response.getNextHeartBeatInterval();            updateMasterKeys(response);            ...        }      }      ...    };    statusUpdater =        new Thread(statusUpdaterRunnable, "Node Status Updater");    statusUpdater.start();  }

定义了一个线程statusUpdaterRunnable，调用了nodeHeartbeat：
response = resourceTracker.nodeHeartbeat(request);

最后启动了statusUpdater：

statusUpdater = new Thread(statusUpdaterRunnable, "Node Status Updater");
statusUpdater.start();

this.resourceTracker的实现类是ResourceTrackerPBClientImpl，进入nodeHeartbeat：

//ResourceTrackerPBClientImpl.javapublic NodeHeartbeatResponse nodeHeartbeat(NodeHeartbeatRequest request)      throws YarnException, IOException {    NodeHeartbeatRequestProto requestProto = ((NodeHeartbeatRequestPBImpl)request).getProto();    try {      return new NodeHeartbeatResponsePBImpl(proxy.nodeHeartbeat(null, requestProto));    } catch (ServiceException e) {      RPCUtil.unwrapAndThrowException(e);      return null;    }  }

其中，proxy的实现为：

//ResourceTrackerPBClientImpl.javaproxy = (ResourceTrackerPB)RPC.getProxy(        ResourceTrackerPB.class, clientVersion, addr, conf);

这里使用静态方法getProxy构造客户端代理对象，直接通过代理对象调用远程端的方法。

参考：Hadoop源码解析之RPC协议

步骤2：RM中的ResourceTrackerService负责处理来自NM的请求，一旦收到请求，会向RMNodeImpl

发送RMNodeEventType.STATUS_UPDATE事件：

//ResourceTrackerService.javapublic NodeHeartbeatResponse nodeHeartbeat(NodeHeartbeatRequest request)      throws YarnException, IOException {    NodeStatus remoteNodeStatus = request.getNodeStatus();    /**     * Here is the node heartbeat sequence...     * 1. Check if it's a registered node     * 2. Check if it's a valid (i.e. not excluded) node      * 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat      * 4. Send healthStatus to RMNode     */    NodeId nodeId = remoteNodeStatus.getNodeId();    // 1. Check if it's a registered node    RMNode rmNode = this.rmContext.getRMNodes().get(nodeId);    if (rmNode == null) {      /* node does not exist */      String message = "Node not found resyncing " + remoteNodeStatus.getNodeId();      LOG.info(message);      resync.setDiagnosticsMessage(message);      return resync;    }    // Send ping    this.nmLivelinessMonitor.receivedPing(nodeId);    // 2. Check if it's a valid (i.e. not excluded) node    if (!this.nodesListManager.isValidNode(rmNode.getHostName())) {      String message =          "Disallowed NodeManager nodeId: " + nodeId + " hostname: "              + rmNode.getNodeAddress();      LOG.info(message);      shutDown.setDiagnosticsMessage(message);      this.rmContext.getDispatcher().getEventHandler().handle(          new RMNodeEvent(nodeId, RMNodeEventType.DECOMMISSION));      return shutDown;    }        // 3. Check if it's a 'fresh' heartbeat i.e. not duplicate heartbeat    NodeHeartbeatResponse lastNodeHeartbeatResponse = rmNode.getLastNodeHeartBeatResponse();    if (remoteNodeStatus.getResponseId() + 1 == lastNodeHeartbeatResponse        .getResponseId()) {      LOG.info("Received duplicate heartbeat from node "          + rmNode.getNodeAddress());      return lastNodeHeartbeatResponse;    } else if (remoteNodeStatus.getResponseId() + 1 < lastNodeHeartbeatResponse        .getResponseId()) {      String message =          "Too far behind rm response id:"              + lastNodeHeartbeatResponse.getResponseId() + " nm response id:"              + remoteNodeStatus.getResponseId();      LOG.info(message);      resync.setDiagnosticsMessage(message);      // TODO: Just sending reboot is not enough. Think more.      this.rmContext.getDispatcher().getEventHandler().handle(          new RMNodeEvent(nodeId, RMNodeEventType.REBOOTING));      return resync;    }    // Heartbeat response    NodeHeartbeatResponse nodeHeartBeatResponse = YarnServerBuilderUtils        .newNodeHeartbeatResponse(lastNodeHeartbeatResponse.            getResponseId() + 1, NodeAction.NORMAL, null, null, null, null,            nextHeartBeatInterval);    rmNode.updateNodeHeartbeatResponseForCleanup(nodeHeartBeatResponse);    populateKeys(request, nodeHeartBeatResponse);    // 4. Send status to RMNode, saving the latest response.    this.rmContext.getDispatcher().getEventHandler().handle(        new RMNodeStatusEvent(nodeId, remoteNodeStatus.getNodeHealthStatus(),            remoteNodeStatus.getContainersStatuses(),             remoteNodeStatus.getKeepAliveApplications(), nodeHeartBeatResponse));    return nodeHeartBeatResponse;  }

RMNodeImpl收到该事件后，更新各个Container的运行状态，并向CapacityScheduler

发送SchedulerEventType.NODE_UPDATE事件。

//RMNodeImpl.java//Transitions from RUNNING state     .addTransition(NodeState.RUNNING,          EnumSet.of(NodeState.RUNNING, NodeState.UNHEALTHY),         RMNodeEventType.STATUS_UPDATE, new StatusUpdateWhenHealthyTransition())                  public static class StatusUpdateWhenHealthyTransition implements      MultipleArcTransition<RMNodeImpl, RMNodeEvent, NodeState> {    @Override    public NodeState transition(RMNodeImpl rmNode, RMNodeEvent event) {      RMNodeStatusEvent statusEvent = (RMNodeStatusEvent) event;      // Switch the last heartbeatresponse.      rmNode.latestNodeHeartBeatResponse = statusEvent.getLatestResponse();      ...        // Process running containers        if (remoteContainer.getState() == ContainerState.RUNNING) {          if (!rmNode.justLaunchedContainers.containsKey(containerId)) {            // Just launched container. RM knows about it the first time.            rmNode.justLaunchedContainers.put(containerId, remoteContainer);            newlyLaunchedContainers.add(remoteContainer);          }        } else {          // A finished container          rmNode.justLaunchedContainers.remove(containerId);          completedContainers.add(remoteContainer);        }      }      ...      if(rmNode.nextHeartBeat) {        rmNode.nextHeartBeat = false;        rmNode.context.getDispatcher().getEventHandler().handle(            new NodeUpdateSchedulerEvent(rmNode));      }      // Update DTRenewer in secure mode to keep these apps alive. Today this is      // needed for log-aggregation to finish long after the apps are gone.      if (UserGroupInformation.isSecurityEnabled()) {        rmNode.context.getDelegationTokenRenewer().updateKeepAliveApplications(          statusEvent.getKeepAliveAppIds());      }      return NodeState.RUNNING;    }  }

步骤3：CapacityScheduler收到SchedulerEventType.NODE_UPDATE事件后，

//CapacityScheduler.javapublic void handle(SchedulerEvent event) {    switch(event.getType()) {    ...    case NODE_UPDATE:    {      NodeUpdateSchedulerEvent nodeUpdatedEvent = (NodeUpdateSchedulerEvent)event;      RMNode node = nodeUpdatedEvent.getRMNode();      nodeUpdate(node);      if (!scheduleAsynchronously) {        allocateContainersToNode(getNode(node.getNodeID()));      }    }    break;

调用nodeUpdate函数，将分配的资源记录到对应数据结构中，等待AM的下次心跳机制来领取。

0 0