官方译文【hadoop yarn 运行原理剖析】

来源:互联网 发布:软件测试在线课程 编辑:程序博客网 时间:2024/05/23 22:02

虽然效率比较低,还是感觉翻译一下走脑走心。。。。不见得全背过,思想犹存。。。犹存。。尴尬

Hadoop MapReduce Next Generation -Writing YARN Applications

Go Back ]

  • Hadoop MapReduce Next Generation - Writing YARN Applications
    • Purpose
    • Concepts and Flow
    • Interfaces
    • Writing a Simple Yarn Application
      • Writing a simple Client
      • Writing an ApplicationMaster
    • FAQ
      • How can I distribute my application's jars to all of the nodes in the YARN cluster that need it?
      • How do I get the ApplicationMaster's ApplicationAttemptId?
      • My container is being killed by the Node Manager
      • How do I include native libraries?
    • Useful Links

目标

实现运行在YARN上的app.

概念及流程

一般的概念是指客户端提交一个app给YARN Resource Manager.
客户端使用ApplicationClientProtocol#getNewApplication获得一个新的'ApplicationId',和ResourdeManager进行连接,然后通过ApplicationClientProtocol#submitApplication提交app。作为ApplicationClientProtocol#submitApplication方法的一部分,客户端必须提供足够的信息来促使ResourceManager启动这个app的第一个container,也就是ApplicationMaster。你必须提供用于运行app的本地的文件或者jar包,实际需要运行的命令,还有救是unix系统的一些环境变量等。为了效率跟高些,你必须为ApplicationMaster解释这些unix线程,以便于启动。

然后YARN ResourceManager就会在一个分配好的container上按照你指定的参数启动ApplicationMaster,然后这个ApplicationMaster在按照预期使用'ApplicationMasterProtocol'和ResourceManager进行通信。首先ApplicationMaster必须在ResourceManager进行注册,ApplicationMaster需要调用ApplicationMasterProtocol#allocate方法请求和接收containers,利用这些container来完成task。接收到一个container以后,ApplicationMaster使用ContainerManager#startContainer就和NodeManager通信,启动这个node上的task。启动过程中,ApplicationMaster必须指定ContainerLaunchContext,这个就类似于ApplicationSubmissionContext,包含着一些特定的命令行参数和环境等信息。一旦task完成,ApplicationMaster会调用ApplicationMasterProtocol#finishApplicationMaster通知ResourceManager。

同时,客户端可以通过查询ResourceManager或者直接咨询ApplicationMaster(如果支持)监测到app执行的状态。有必要的话,客户端还可以使用ApplicationClientProtocol#forceKillApplication来结束这个app.

接口

  • ApplicationClientProtocol - 客户端<-->ResourceManager
    客户端用来与ResourceManager建立通道,监视app,结束app。
  • ApplicationMasterProtocol - ApplicationMaster<-->ResourceManager
    ApplicationMaster用来向ResourceManager注册和反注册自身,也用来请求资源和报告任务完成情况.
  • ContainerManager - ApplicationMaster<-->NodeManager
    ApplicationMaster用来与NodeManager建立通道,启动/停止containers,获得container的状态更新报告.

一个简单的 Yarn Application

客户端

  • 首先要和ResourceManager建立连接, the ApplicationsManager (AsM) interface of the ResourceManager.

·            

    ApplicationClientProtocol applicationsManager;     YarnConfiguration yarnConf = new YarnConfiguration(conf);    InetSocketAddress rmAddress =         NetUtils.createSocketAddr(yarnConf.get(            YarnConfiguration.RM_ADDRESS,            YarnConfiguration.DEFAULT_RM_ADDRESS));                 LOG.info("Connecting to ResourceManager at " + rmAddress);    Configuration appsManagerServerConf = new Configuration(conf);    appsManagerServerConf.setClass(        YarnConfiguration.YARN_SECURITY_INFO,        ClientRMSecurityInfo.class, SecurityInfo.class);    applicationsManager = ((ApplicationClientProtocol) rpc.getProxy(        ApplicationClientProtocol.class, rmAddress, appsManagerServerConf));    

  • 一旦拿到ASM的代理对象,客户端就向ResourceManager请求一个新的ApplicationId.

·            

    GetNewApplicationRequest request =         Records.newRecord(GetNewApplicationRequest.class);                  GetNewApplicationResponse response =         applicationsManager.getNewApplication(request);    LOG.info("Got new ApplicationId=" + response.getApplicationId());

  • ASM返回的response也包含一些集群信息,比如集群的最大和最小资源容量。 这有助于我们给需要启动ApplicationMaster的container指定适当的参数。更多查看GetNewApplicationResponse
  • 客户端关键是要建立ApplicationSubmissionContext,ApplicationSubmissionContext定义了ResourceManager启动ApplicationMaster需要的所有信息,如下:
    • Application 信息: id, name
    • 所属队列, 优先级: app应被提交到哪个队列以及app的优先级.
    • 提交者: The user submitting the application
    • ContainerLaunchContext: ApplicationMaster在哪个container启动. 这个参数定义了 ApplicationMaster需要的所有参数,比如本地资源(binaries, jars, files 等), 安全口令, 环境设置(CLASSPATH等) 以及即将执行的命令.

·           

 // 新建ApplicationSubmissionContext·            ApplicationSubmissionContext appContext =·                Records.newRecord(ApplicationSubmissionContext.class);·            // 设置ApplicationId·            appContext.setApplicationId(appId);·            // 设置application 名称·            appContext.setApplicationName(appName);·            ·            // 为AM's container新建一个上下文·            ContainerLaunchContext amContainer =·                Records.newRecord(ContainerLaunchContext.class);·         ·            // 设置需要的本地资源·            Map<String, LocalResource>localResources =·                new HashMap<String,LocalResource>();·            // 为ApplicationMaster 设定jar,HDFS路径·            Path jarPath; // <- jar文件的路径 ·            FileStatus jarStatus =fs.getFileStatus(jarPath);·            LocalResource amJarRsrc =Records.newRecord(LocalResource.class);·            // 设置资源类型,file或archive·            // archives也设置成amJarRsrc.setType(LocalResourceType.FILE);·            // 设置resource可见性                amJarRsrc.setVisibility(LocalResourceVisibility.APPLICATION);         ·            // 设置资源的目的地   amJarRsrc.setResource(ConverterUtils.getYarnUrlFromPath(jarPath));·            // 设置timestam和文件长度,这样在复制完成之后框架可以对其进行检查,验证是否上传正确·            amJarRsrc.setTimestamp(jarStatus.getModificationTime());·            amJarRsrc.setSize(jarStatus.getLen());·            // 框架会在工作目录创建一个名为AppMaster.jar的链接,指向实际的jar文件. ApplicationMaster需要jar文件时,就使用这个链接名称。·            localResources.put("AppMaster.jar",  amJarRsrc);   ·            // 把本地资源放进启动的上下文(launch context)   ·            amContainer.setLocalResources(localResources);·         ·            // 为启动上下文launch context设定环境变量·            Map<String, String> env = newHashMap<String, String>();   ·            // 例如我们可以指定需要的类路径.·            // 假设我们的类路径就是命令运行的路径,那就加个 ".".·            // 默认情况下,所有hadoop指定的类路径都被加载,所以注意不要把原来指定的值覆盖了·            String classPathEnv ="$CLASSPATH:./*:";   ·            env.put("CLASSPATH",classPathEnv);·            amContainer.setEnvironment(env);·            ·            //构造在启动的container运行的命令·            String command =·                "${JAVA_HOME}" +/bin/java" +·                " MyAppMaster" +·                " arg1 arg2 arg3" +·                " 1>" +ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout" +·                " 2>" +ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr";                    ·         ·            List<String> commands = newArrayList<String>();·            commands.add(command);·            // 如果需要还可以加上额外命令               ·         ·            // 把这些命令组放进指定的container·            amContainer.setCommands(commands);·            ·            // 为container定义需要的资源·            // 现在, YARN只支持内存管理,所以我们就申请内存.·            // 如果进程使用了多余指定内存的量,就会被框架杀掉.·            // 申请的内存应该小于集群的最大容量,并且所有的请求都应该尽量最小化。·            Resource capability =Records.newRecord(Resource.class);·            capability.setMemory(amMemory);·            amContainer.setResource(capability);·            ·            //把启动上下文launch content放进ApplicationSubmissionContext    appContext.setAMContainerSpec(amContainer);
//设置完进程之后, 客户端准备向ASM提交.·            // 建立向ApplicationsManager提交的request对象·            SubmitApplicationRequest appRequest =·                Records.newRecord(SubmitApplicationRequest.class);·            appRequest.setApplicationSubmissionContext(appContext);·         ·            // 向ApplicationsManager提交app,不管response是正确还是出错。·            applicationsManager.submitApplication(appRequest);

  • 现在, ResourceManager 在后台已经接收到这个app,它会为指定的请求条件分配一个container,并且这这个container上启动ApplicationMaster.
  • 客户度可以通过以下方式跟踪task进度.
    • 和ResourceManager通信, ApplicationClientProtocol#getApplicationReport.

o          

GetApplicationReportRequest reportRequest =           Records.newRecord(GetApplicationReportRequest.class);      reportRequest.setApplicationId(appId);      GetApplicationReportResponse reportResponse =           applicationsManager.getApplicationReport(reportRequest);      ApplicationReport report = reportResponse.getApplicationReport();

ApplicationReport从ResourceManager接收如下信息:

      • app信息: ApplicationId, 所在队列, 提交者,开始时间.
      • ApplicationMaster细节: ApplicationMaster 所在主机, 监听客户度request的rpc端口,客户度与ApplicationMaster通信的口令.
      • Application跟踪信息: 如果app支持进度追踪, 客户端可以指定url使用 ApplicationReport#getTrackingUrl进行监视.
    • Application状态:
      运行————ApplicationReport#getYarnApplicationState.
      完成———— ApplicationReport#getFinalApplicationStatus。
      失败————ApplicationReport#getDiagnostics
      如果 ApplicationMaster 支持,客户端可以直接通过host:rpcport查询ApplicationMaster
  • 特性情形下, 如果app花了太长时间或者其他因素造成时间长, 客户端可能会想直接秒了这个app.  ApplicationClientProtocol的forceKillApplication方法允许客户端通过ResourceManager给ApplicationMaster发信号,ApplicationMaster也支持客户端使用rpc端口直接发送.

·            

    KillApplicationRequest killRequest =         Records.newRecord(KillApplicationRequest.class);                    killRequest.setApplicationId(appId);    applicationsManager.forceKillApplication(killRequest);

一个ApplicationMaster程序

  • ApplicationMaster是job的实际拥有着. 它由ResourceManager启动,由客户端提供必需的信息和资源.
  • As the ApplicationMaster is launched within a container that may (likely will) be sharing a physical host with other containers, given the multi-tenancy nature, amongst other issues, it cannot make any assumptions of things like pre-configured ports that it can listen on.
  • 当ApplicationMaster一启动, 几个参数就成为环境参数生效了,比如 ApplicationMaster container的ContainerId, app提交时间,Application Master 运行所在的NodeManager 细节信息. 看 ApplicationConstants找这些参数.
  • 每次与ResourceManager 打交道都需要ApplicationAttemptId (每个app要尝试很多次,因为可能失败). ApplicationAttemptId 可以跟ApplicationMaster containerId相同. 有转化这些环境变量的值进入对象的API.

·            

     Map<String, String> envs = System.getenv();·            String containerIdString =·                envs.get(ApplicationConstants.AM_CONTAINER_ID_ENV);·            if (containerIdString == null) {·              // container id是必需的·              throw new IllegalArgumentException(·                  "ContainerId not set in theenvironment");·            }·            ContainerId containerId =ConverterUtils.toContainerId(containerIdString);    ApplicationAttemptId appAttemptID =containerId.getApplicationAttemptId();

  • 在ApplicationMaster完成自我初始化之后, 就调用 ApplicationMasterProtocol#registerApplicationMaster向ResourceManager进行注册. ApplicationMaster通过ResourceManager 的接口Scheduler和ResourceManager进行通信.

·           

      // 连接ResourceManager的Scheduler·            YarnConfiguration yarnConf = newYarnConfiguration(conf);·            InetSocketAddress rmAddress =·                NetUtils.createSocketAddr(yarnConf.get(·                    YarnConfiguration.RM_SCHEDULER_ADDRESS,·                    YarnConfiguration.DEFAULT_RM_SCHEDULER_ADDRESS));          ·            LOG.info("Connecting toResourceManager at " + rmAddress);·            ApplicationMasterProtocol resourceManager =·                (ApplicationMasterProtocol)rpc.getProxy(ApplicationMasterProtocol.class, rmAddress, conf);·         ·            // 向RM注册·            // 设置注册所需信息:·            // ApplicationAttemptId,·            // ApplicationMaster运行的主机名·            // ApplicationMaster接收客户端请求的端口·            // 客户端跟踪app进度的url                 RegisterApplicationMasterRequest appMasterRequest =·                Records.newRecord(RegisterApplicationMasterRequest.class);·            appMasterRequest.setApplicationAttemptId(appAttemptID);    ·            appMasterRequest.setHost(appMasterHostname);·            appMasterRequest.setRpcPort(appMasterRpcPort);·            appMasterRequest.setTrackingUrl(appMasterTrackingUrl);·         ·            // 注册返回的response很有用,它包含有集群信息.·            // 类似客户端GetNewApplicationResponse, 它包含集群的最大和最小资源容量,ApplicationMaster在申请container的时候需要这些信息·            RegisterApplicationMasterResponse response=       resourceManager.registerApplicationMaster(appMasterRequest);

  • ApplicationMaster需要定时向ResourceManager发送心跳,告诉它自己运转良好。心跳间隔通过YarnConfiguration.RM_AM_EXPIRY_INTERVAL_MS设定,默认的是YarnConfiguration.DEFAULT_RM_AM_EXPIRY_INTERVAL_MS。ApplicationMasterProtocol#allocate方法向ResourceManager发送心跳信息以及app进度报告。
  • 基于task请求,ApplicationMaster可以申请一系列的container来跑task。ApplicationMaster必须使用ResourceRequest类定义下面container参数:
    • 主机名: If containers are required to be hosted on a particular rack or a specific host. '*' is a special value that implies any host will do.
    • 资源量: Currently, YARN only supports memory based resource requirements so the request should define how much memory is needed. The value is defined in MB and has to less than the max capability of the cluster and an exact multiple of the min capability. Memory resources correspond to physical memory limits imposed on the task containers.
    • 优先级: 当申请几个集合的containers时, ApplicationMaster可以为不同的container集合设定不同的优先级。比如, Map-Reduce ApplicationMaster 可以为执行Map task的container设定一个高一些的优先级,为执行Reduce task的container设定低一些的优先级.

·          

      // 资源请求·            ResourceRequest rsrcRequest =Records.newRecord(ResourceRequest.class);·         ·            // 设定申请的主机的名称·            // whether a particular rack/host is needed·            // 对于那些敏感于数据本地化的app比较有用·            rsrcRequest.setHostName("*");·         ·            // 设定优先级·            Priority pri =Records.newRecord(Priority.class);·            pri.setPriority(requestPriority);·            rsrcRequest.setPriority(pri);          ·         ·            // 设定资源类别·            // 现在, 只支持内存,那就只请求内存·            Resource capability =Records.newRecord(Resource.class);·            capability.setMemory(containerMemory);·            rsrcRequest.setCapability(capability);·         ·            // 设置需要的container数量·            // 要合规范   rsrcRequest.setNumContainers(numContainers);

  • 定义完这些container的参数后, ApplicationMaster 必须建立一个AllocateRequest发送给ResourceManager. AllocateRequest包含:
    • 申请的containers: The container specifications and the no. of containers being requested for by the ApplicationMaster from the ResourceManager.
    • Released containers:ApplicationMaster需要多申请几个container以防有运行失败的。ApplicationMaster可以考虑如果没有必要的话就释放几个container回去给ResourceManager.
    • ResponseId: allocate方法的返回值.
    • 进度信息: ApplicationMaster发送进度信息给 (0 to 1) ResourceManager.

·            

    List<ResourceRequest>requestedContainers;·            List<ContainerId>releasedContainers   ·            AllocateRequest req =Records.newRecord(AllocateRequest.class);·         ·            // 这里设定的responseid应该与分配后返回的responseid一致.·            // 这样ApplicationMaster就可以将response对应上这个request·            req.setResponseId(rmRequestID);·            ·            // 设置ApplicationAttemptId·            req.setApplicationAttemptId(appAttemptID);·            ·            // 设置申请的所有container·            req.addAllAsks(requestedContainers);·            ·            // 如果ApplicationMaster不再需要某些container,就放回给·        ResourceManager·            req.addAllReleases(releasedContainers);·            ·            // 假设ApplicationMaster可以跟踪进度·            req.setProgress(currentProgress);·                AllocateResponse allocateResponse =resourceManager.allocate(req);       

       

  • ResourceManager 返回的AllocateResponse包含:
    • Reboot flag: 当ApplicationMaste与ResourceManager不同步时使用.
    • Allocated containers: 分配给ApplicationMaster的container.
    • Headroom: Headroom for resources in the cluster. Based on this information and knowing its needs, an ApplicationMaster can make intelligent decisions such as re-prioritizing sub-tasks to take advantage of currently allocated containers, bailing out faster if resources are not becoming available etc.
    • Completed containers: 一旦ApplicationMaster triggers a launch an allocated container, it will receive an update from the ResourceManager when the container completes. The ApplicationMaster can look into the status of the completed container and take appropriate actions such as re-trying a particular sub-task in case of a failure.
    • Number of cluster nodes: 集群中可用的主机数量.

需要注意的是container不会马上被分配给ApplicationMaster. 这不意味着ApplicationMaster 应该重复提交请求. 发送过请求后,集群会基于自己的容量,优先级以及调度政策,为ApplicationMaster分配container. 只有原来的估计有变化,想更改container时,ApplicationMaster才需要重新发送请求.

   

   // 从response中找到已经分配的container,在每个container    //中假设我们启动了同一个job    List<Container> allocatedContainers =allocateResponse.getAllocatedContainers();    for (Container allocatedContainer :allocatedContainers) {      LOG.info("Launching shell command ona new container."          + ", containerId=" +allocatedContainer.getId()          + ", containerNode=" +allocatedContainer.getNodeId().getHost()          + ":" +allocatedContainer.getNodeId().getPort()          + ", containerNodeURI=" +allocatedContainer.getNodeHttpAddress()          + ", containerState" +allocatedContainer.getState()          + ",containerResourceMemory"           +allocatedContainer.getResource().getMemory());               //Launch and start the container on aseparate thread to keep the main      // thread unblocked as all containers maynot be allocated at one go.      LaunchContainerRunnable runnableLaunchContainer =          newLaunchContainerRunnable(allocatedContainer);      Thread launchThread = new Thread(runnableLaunchContainer);             launchThreads.add(launchThread);      launchThread.start();    }     // 检查集群中当前可用资源    Resource availableResources =allocateResponse.getAvailableResources();    // 基于这些信息,ApplicationMaster做决定     // 检查完成的containers    // 假设我们有了一些成功和失败的container数量    List<ContainerStatus> completedContainers =       allocateResponse.getCompletedContainersStatuses();    for (ContainerStatus containerStatus :completedContainers) {                                    LOG.info("Got container status forcontainerID= "          + containerStatus.getContainerId()          + ", state=" +containerStatus.getState()              + ", exitStatus=" +containerStatus.getExitStatus()          + ", diagnostics=" +containerStatus.getDiagnostics());       int exitStatus =containerStatus.getExitStatus();      if (0 != exitStatus) {        // container失败        // -100 代表container中断或者有其他错误        if (-100 != exitStatus) {          // container上的app返回一个非零值          // 完成的数量         numCompletedContainers.incrementAndGet();         numFailedContainers.incrementAndGet();                                                                }        else {          // 其他错误发生          // app job由于未知原因未完成          // 我们应该再试一次,因为container由于某些原因被遗漏了          // 在下次请求中减少请求的container数量.                  numRequestedContainers.decrementAndGet();          // we do not need to release thecontainer as that has already          // been done by the ResourceManager/NodeManager.        }        }        else {          // 啥也不用管          // container成功完成         numCompletedContainers.incrementAndGet();         numSuccessfulContainers.incrementAndGet();        }      }    }

  • 在某个container被分配给ApplicationMaster后, it needs to follow a similar process that the Client followed in setting up the ContainerLaunchContext for the eventual task that is going to be running on the allocated Container. Once the ContainerLaunchContext is defined, the ApplicationMaster can then communicate with the ContainerManager to start its allocated container.

·               

·            

// 假设这是一个从AllocateResponse获取的container·            Container container;  ·            // 在刚才的container上,连接ContainerManager·            String cmIpPortStr =container.getNodeId().getHost() + ":"·                + container.getNodeId().getPort();             ·            InetSocketAddress cmAddress =NetUtils.createSocketAddr(cmIpPortStr);               ·            ContainerManager cm =·                (ContainerManager)rpc.getProxy(ContainerManager.class,cmAddress, conf);    ·         ·            // 现在我们设置一个ContainerLaunchContext ·            ContainerLaunchContext ctx =·                Records.newRecord(ContainerLaunchContext.class);·         ·            ctx.setContainerId(container.getId())ctx.setResource(container.getResource());·         ·            try {·              ctx.setUser(UserGroupInformation.getCurrentUser().getShortUserName());·            } catch (IOException e) {·              LOG.info(·                  "Getting current user failedwhen trying to launch the container",·                  + e.getMessage());·            }·         ·            // 设置环境变量·            Map<String, String> unixEnv;·            // Setup the required env.·            // 注意启动后的container不会继承ApplicationMaster中的环境变量,所以所有的环境变量应该在这个container上重新设置一遍  ·            ctx.setEnvironment(unixEnv);·         ·            // 设置本地资源·            Map<String, LocalResource> localResources=·                new HashMap<String,LocalResource>();·            // Again, the local resources from theApplicationMaster is not copied over·            // by default to the allocated container.Thus, it is the responsibility·            // of the ApplicationMaster to set up all the necessary local resources·            // needed by the job that will be executed on the allocated container.·              ·            // Assume that we are executing a shellscript on the allocated container·            // and the shell script's location in thefilesystem is known to us.·            Path shellScriptPath;·            LocalResource shellRsrc =Records.newRecord(LocalResource.class);·            shellRsrc.setType(LocalResourceType.FILE);·            shellRsrc.setVisibility(LocalResourceVisibility.APPLICATION);         ·            shellRsrc.setResource(·                ConverterUtils.getYarnUrlFromURI(newURI(shellScriptPath)));·            shellRsrc.setTimestamp(shellScriptPathTimestamp);·            shellRsrc.setSize(shellScriptPathLen);·            localResources.put("MyExecShell.sh",shellRsrc);·         ·            ctx.setLocalResources(localResources);                     ·         ·            // 设置在这台container上需要执行的命令·            String command = "/bin/sh./MyExecShell.sh"·                + " 1>" +ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout"·                + " 2>" +ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr";·         ·            List<String> commands = newArrayList<String>();·            commands.add(command);·            ctx.setCommands(commands);·         ·            // Send the start request to theContainerManager·            StartContainerRequest startReq =Records.newRecord(StartContainerRequest.class);·            startReq.setContainerLaunchContext(ctx);    cm.startContainer(startReq);

  • The ApplicationMaster, as mentioned previously, will get updates of completed containers as part of the response from the ApplicationMasterProtocol#allocate calls. It can also monitor its launched containers pro-actively by querying the ContainerManager for the status.

·            

     GetContainerStatusRequest statusReq =         Records.newRecord(GetContainerStatusRequest.class);    statusReq.setContainerId(container.getId());    GetContainerStatusResponse statusResp = cm.getContainerStatus(statusReq);    LOG.info("Container Status"        + ", id=" + container.getId()        + ", status=" + statusResp.getStatus());

频率较高的问题

How can I distribute my application's jars to all of the nodes in the YARNcluster that need it?

你可以使用 LocalResource 添加app需要的资源. 这回促使YARN分发资源给 ApplicationMaster节点. 如果资源是tgz, zip, or jar文件-你可以使用YARN解压。然后就把解压后的文件夹放进你的类路径。

You can use the LocalResource to add resources to your applicationrequest. This will cause YARN to distribute the resource to theApplicationMaster node. If the resource is a tgz, zip, or jar - you can haveYARN unzip it. Then, all you need to do is add the unzipped folder to yourclasspath. For example, when creating your application request:

   

  File packageFile = new File(packagePath);    Url packageUrl = ConverterUtils.getYarnUrlFromPath(        FileContext.getFileContext.makeQualified(new Path(packagePath)));    packageResource.setResource(packageUrl);    packageResource.setSize(packageFile.length());    packageResource.setTimestamp(packageFile.lastModified());    packageResource.setType(LocalResourceType.ARCHIVE);    packageResource.setVisibility(LocalResourceVisibility.APPLICATION);    resource.setMemory(memory)    containerCtx.setResource(resource)    containerCtx.setCommands(ImmutableList.of(        "java -cp './package/*' some.class.to.Run "        + "1>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stdout "        + "2>" + ApplicationConstants.LOG_DIR_EXPANSION_VAR + "/stderr"))    containerCtx.setLocalResources(        Collections.singletonMap("package", packageResource))    appCtx.setApplicationId(appId)    appCtx.setUser(user.getShortUserName)    appCtx.setAMContainerSpec(containerCtx)    request.setApplicationSubmissionContext(appCtx)    applicationsManager.submitApplication(request)

就像你看到的,setLocalResources的参数放进了一个map。在你的app的当前工作路径,它会是一个链接,所以你只要使用./package/*来引用就可以了。

注意:java的类路径是敏感的。请确保语法正确。

一旦你的包被分发给你的ApplicationMaster,当ApplicationMaster启动一个新的container时,就会走同样的流程。这些代码是一样的,你只需要确保你为ApplicationMaster指定了正确的包的路径(HDFS或者Local)。

As you can see, the setLocalResources command takes a map of names toresources. The name becomes a sym link in your application's cwd, so you canjust refer to the artifacts inside by using ./package/*.

Note: Java's classpath (cp) argument is VERY sensitive. Make sure you getthe syntax EXACTLY correct.

Once your package is distributed to your ApplicationMaster, you'll need tofollow the same process whenever your ApplicationMaster starts a new container(assuming you want the resources to be sent to your container). The code forthis is the same. You just need to make sure that you give yourApplicationMaster the package path (either HDFS, or local), so that it can sendthe resource URL along with the container ctx.

How do I get the ApplicationMaster's ApplicationAttemptId?

ApplicationAttemptId可以通过ConverterUtils工具类中的方法从环境变量中转化,交给ApplicationMaster。

The ApplicationAttemptId will be passed to the ApplicationMaster via theenvironment and the value from the environment can be converted into anApplicationAttemptId object via the ConverterUtils helper function.

My container is being killed by the Node Manager

这可能是因为内存使用太多,超过了当初你申请使用的内存大小。有很多原因导致这个问题。首先,查看nodeManager杀掉container的过程。你可以关注一下物理内存和虚拟内存。如果你使用的物理内存超过了你的app申请的内存,就可能导致。如果你在运行一个java app,你可以使用 -hprof来查看是什么占用了这么多的内存。如果你超过了虚拟内存,就必须在集群范围内增加。。。。(不太确定,求指正,最后一句)

This is likely due to high memory usage exceeding your requested containermemory size. There are a number of reasons that can cause this. First, look atthe process tree that the node manager dumps when it kills your container. Thetwo things you're interested in are physical memory and virtual memory. If youhave exceeded physical memory limits your app is using too much physicalmemory. If you're running a Java app, you can use -hprof to look at what istaking up space in the heap. If you have exceeded virtual memory, you may needto increase the value of the the cluster-wide configuration variable yarn.nodemanager.vmem-pmem-ratio.

How do I include native libraries?

当启动container时加上这个参数-Djava.library.path,会导致hadoop不能正确加载,运行失败。应该是参数LD_LIBRARY_PATH才对!

Setting -Djava.library.path on the command line while launching acontainer can cause native libraries used by Hadoop to not be loaded correctlyand can result in errors. It is cleaner to use LD_LIBRARY_PATH instead.


    如果错误,敬请指教!
    Any suggestions will be apprecaited!
    0 0