hadoop权威指南(第四版)要点翻译(6)——Chapter 4. YARN(1)

来源：互联网发布：游戏截图软件fraps 编辑：程序博客网时间：2024/05/29 17:37

Chapter 4. YARN
Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system. YARN was introduced in Hadoop 2 to improve the MapReduce implementation, but it is general enough to support other distributed computing paradigms as well.
Apache YARN是一个hadoop的集群资源管理系统。YARN在hadoop 2 中被引入，用来改善MapReduce的实现功能，但是通常它对于其他的分布式计算模式也有足够的支持。
YARN provides APIs for requesting and working with cluster resources, but these APIs are not typically used directly by user code. Instead, users write to higher-level APIs provided by distributed computing frameworks, which themselves are built on YARN and hide the resource management details from the user. The situation is illustrated in Figure 4-1, which shows some distributed computing frameworks (MapReduce, Spark, and so on) running as YARN applications on the cluster compute layer (YARN) and the cluster storage layer (HDFS and HBase).
这里写图片描述
YARN为请求和使用集群资源提供了API，但是那些API不是典型的可以通过用户代码就可以执行使用的，取而代之的是，用户可以使用由分布式计算框架提供的更高界别的API，这些API本身就是建立在YARN上的，并且对用户隐藏了资源管理的细节。这种情况在图表4-1中有说明，其展示了一些以YARN应用程序运行在集群计算层（YARN）和集群存储层（HDFS和HBase）的分布式计算框架(MapReduce，Spark等)。
There is also a layer of applications that build on the frameworks shown in Figure 4-1. Pig, Hive, and Crunch are all examples of processing frameworks that run on MapReduce, Spark, or Tez (or on all three), and don’t interact with YARN directly.
在图表4-1中也有一个建立在此架构之上的应用层。Pig，Hive，Crunch都是运行在MapReduce，Spark或者Tez（亦或者三者）之上的处理框架的例子，且不会直接作用于YARN。
1) Anatomy of a YARN Application Run
剖析一个YARN应用的运行原理
a) YARN provides its core services via two types of long-running daemon: a resource manager (one per cluster) to manage the use of resources across the cluster, and node managers running on all the nodes in the cluster to launch and monitor containers. A container executes an application-specific process with a constrained set of resources (memory, CPU, and so on). Depending on how YARN is configured (see YARN), a container may be a Unix process or a Linux cgroup. Figure 4-2 illustrates how YARN runs an application.
YARN经由两种类型的长时间运行的守护进程来提供其核心服务：一个resource manager（对应每一个集群）来管理整个集群上的资源使用,运行在集群所有节点上的node manager 启动并监控container。一个container利用有限制的资源集合来执行特定的应用程序（内存，CPU等）。根据YARN的配置方式，一个container有可能是一个Unix程序或者Linux的cgroup。图表4-2展示了YARN如何运行一个应用。
b) 这里写图片描述
c) To run an application on YARN, a client contacts the resource manager and asks it to run an application master process (step 1 in Figure 4-2). The resource manager then finds a node manager that can launch the application master in a container (steps 2a and 2b).Precisely what the application master does once it is running depends on the application. It could simply run a computation in the container it is running in and return the result to the client. Or it could request more containers from the resource managers (step 3), and use them to run a distributed computation (steps 4a and 4b). The latter is what the MapReduce YARN application does, which we’ll look at in more detail in Anatomy of a MapReduce Job Run.
为了在YARN上运行应用，客户端需要联系资源管理器并且要求它运行一个应用主程序（step 1），然后资源管理器寻找一个节点管理器，其可以在一个container中启动一个应用主程序（step 2a and 2b）。一旦它开始运行，主程序能够做什么恰恰取决于应用程序。它可以简单的在它正在运行的container中运行一个计算任务，并且返回结果给客户端。或者它也可以向资源管理器请求更多的container（step 3），进而使用它们去运行一个分布式计算（step 4a 和 4b）。而后者正是MapReduce YARN应用所要干的事，我们将会在Anatomy of a MapReduce Job Run中看到更详细的描述。
d) Resource Requests
e) YARN has a flexible model for making resource requests. A request for a set of containers can express the amount of computer resources required for each container (memory and CPU), as well as locality constraints for the containers in that request.
YARN对于资源请求有一个弹性模型。一组container的请求可以体现出每个container所需要的机器资源数量，以及对于container的局部约束条件。
f) A YARN application can make resource requests at any time while it is running. For example, an application can make all of its requests up front, or it can take a more dynamic approach whereby it requests more resources dynamically to meet the changing needs of the application.
当一个YARN应用程序运行的时候，它可以在任何时候进行资源请求。举个例子，一个应用可以让其所有请求都置于最前面，或者它可以通过动态的请求更多资源以更多动态的方式来满足应用程序的需求变化。
g) Spark takes the first approach, starting a fixed number of executors on the cluster (see Spark on YARN). MapReduce, on the other hand, has two phases: the map task containers are requested up front, but the reduce task containers are not started until later. Also, if any tasks fail, additional containers will be requested so the failed tasks can be rerun.
Spark选择了第一种方式，在集群上开启固定数量的执行者。在另一方面，MapReduce有两个阶段：map任务的container被要求置于最前面，而reduce任务的container直到最后才启动。另外，如果任何一个任务失败，将会请求额外的container，以便于失败的任务能够重新运行。
h) The lifespan of a YARN application can vary dramatically: from a short-lived application of a few seconds to a long-running application that runs for days or even months. Rather than look at how long the application runs for, it’s useful to categorize applications in terms of how they map to the jobs that users run. The simplest case is one application per user job, which is the approach that MapReduce takes.
一个YARN应用的生命周期会发生戏剧性的变化：从几秒钟的短暂应用到运行几天甚至几个月的长时间运行的应用。与其关注于应用运行时间长短，还不如根据他们如何映射到用户运行的作业来分类应用来的有意义。最简单的案例就是每个用户作业的一个应用，其为MapReduce所采取的方式。
i) The second model is to run one application per workflow or user session of (possibly unrelated) jobs. This approach can be more efficient than the first, since containers can be reused between jobs, and there is also the potential to cache intermediate data between jobs. Spark is an example that uses this model.
第二种模式就是每个工作流或者作业的每个用户会话运行一个应用。这种方式比第一种方式更加高效，因为container可以在作业间重复使用，并且具有缓存作业间中间数据的潜力。Spark就是使用这种模式的例子。
j) The third model is a long-running application that is shared by different users. Such an application often acts in some kind of coordination role.
第三种模式就是能够被不同用户分享的长时间运行的应用。这样的应用经常用来扮演某种协调角色。
k) Building YARN Applications
构建YARN应用
l) Writing a YARN application from scratch is fairly involved, but in many cases is not necessary, as it is often possible to use an existing application that fits the bill.
从头开始编写一个YARN应用是相当复杂的，但是在许多案例里面都是不需要的，因为经常可以用一个已经存在的应用来满足这种需求。
m) There are a couple of projects that simplify the process of building a YARN application. Apache Slider, mentioned earlier, makes it possible to run existing distributed applications on YARN. Apache Twill is similar to Slider, but in addition provides a simple programming model for developing distributed applications on YARN.
有几个工程可以简化创建一个YARN应用的过程。Apache Slider，之前提到过的，可以使得在YARN上运行已经存在的分布式应用成为可能。Apache Twill跟Slider相似，但是它还另外提供了一个简单的程序模型用来在YARN上开发分布式应用。
n) In cases where none of these options are sufficient — such as an application that has complex scheduling requirements — then the distributed shell application that is a part of the YARN project itself serves as an example of how to write a YARN application. It demonstrates how to use YARN’s client APIs to handle communication between the client or application master and the YARN daemons.
在这种情况下，没有一个选择是足够的，这就像一个应用程序一样，具有复杂的调度安排需求，然后作为YARN工程自身一部分的分布式shell应用程序充当如何编写一个YARN应用的例子。它提供了一个示范，如何使用YARN的客户端API去处理客户端或者应用主程序与YARN守护进程之间的通信。
2) YARN Compared to MapReduce 1
a) In MapReduce 1, the jobtracker takes care of both job scheduling (matching tasks with tasktrackers) and task progress monitoring (keeping track of tasks, restarting failed or slow tasks, and doing task bookkeeping, such as maintaining counter totals). By contrast, in YARN these responsibilities are handled by separate entities: the resource manager and an application master (one for each MapReduce job). The jobtracker is also responsible for storing job history for completed jobs, although it is possible to run a job history server as a separate daemon to take the load off the jobtracker. In YARN, the equivalent role is the timeline server, which stores application history.
在MapReduce 1中，jobtracker关注于作业调度（由tasktracker来匹配任务）和任务进度监控（联系任务，重启失败或迟钝的任务，记录任务信息，比如维持计数器总数）。相比之下，在YARN中，那些职责由独立的实体来负责：资源管理器和主应用程序（每个MapReduce 作业对应一个主程序）。尽管有可能会运行一个历史作业服务作为一个单独的守护进程来让jobtracker休息一下，但是jobtracker也会为已完成的作业存储其作业历史休息。在YARN中，与之相同的角色是timeline服务，其存储应用程序的历史信息。
b) YARN was designed to address many of the limitations in MapReduce 1. The benefits to using YARN include the following:
YARN被设计用来处理MapReduce 1 中的许多限制。使用YARN的好处包括如下：
c) Scalability
可扩展性
d) YARN can run on larger clusters than MapReduce 1. MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks,stemming from the fact that the jobtracker has to manage both jobs and tasks. YARN overcomes these limitations by virtue of its split resource manager/application master architecture: it is designed to scale up to 10,000 nodes and 100,000 tasks.
YARN相比 MapReduce 1 可以运行在更大的集群上。在MapReduce 1中在4000个节点和40000个任务附近就会触及可扩展性的瓶颈，而这源于jobtracker不得不同时管理job和task的事实。YARN凭借分开资源管理和主应用程序的架构克服了那些限制，YARN可以按比例扩展到10000个节点和10000个任务。
e) Availability
可用性
f) High availability (HA) is usually achieved by replicating the state needed for another daemon to take over the work needed to provide the service, in the event of the service daemon failing.
在服务进程失败的事件中，高可用性通常通过复制另一个守护进程所需状态，进而接管提供所需服务的工作来完成。
g) With the jobtracker’s responsibilities split between the resource manager and application master in YARN, making the service highly available became a divide-and-conquer problem: provide HA for the resource manager, then for YARN applications (on a per-application basis).
在YARN中，随着jobtracker的职责分离为资源管理器和主应用程序，使得高可用服务变成了一个分而治之的问题：给资源管理器提供高可用性，然后是YARN 应用程序。
h) Utilization
利用率
i) In YARN, a node manager manages a pool of resources, rather than a fixed number of designated slots. MapReduce running on YARN will not hit the situation where a reduce task has to wait because only map slots are available on the cluster, which can happen in MapReduce 1. If the resources to run the task are available, then the application will be eligible for them.
在YARN中，一个节点管理器管理着一池的资源，而不是固定数量的特定槽。在YARN上运行MapReduce将不会碰到由于在集群上只有map插槽可用而导致的reduce任务需要等待的情况，而这种情况在MapReduce 1中可能遇到。如果这些资源对于运行这个任务是可用的，那么这个应用程序将适合于这些资源。
j) Furthermore, resources in YARN are fine grained, so an application can make a request for what it needs, rather than for an indivisible slot, which may be too big (which is wasteful of resources) or too small (which may cause a failure) for the particular task.
此外，YARN中的资源是细粒性分布的，一个应用程序可以请求它所需要的资源，这些资源并不是不可分割的，否则对于特定的任务，所请求的资源将会太大（浪费资源）或者太小（导致失败）。
k) Multitenancy
多组织性
l) In some ways, the biggest benefit of YARN is that it opens up Hadoop to other types of distributed application beyond MapReduce. MapReduce is just one YARN application among many.
在某种程度上，YARN最大的好处就是向其他分布式应用开放了Hadoop，而不仅仅是MapReduce。MapReduce仅仅只是众多YARN应用中的一个。

0 0