Ambari 架构设计

来源：互联网发布：时间分配软件编辑：程序博客网时间：2024/06/12 17:00

原文地址

Ambari Architecture: https://issues.apache.org/jira/secure/attachment/12559939/Ambari_Architecture.pdf

1 Design Goals设计目标

1.1 Platform Independence（平台独立性）

The system must architecturally support any hardware and operating system, e.g.
RHEL, SLES, Ubuntu, Windows, etc. Components which are inherently dependent
on a platform (e.g., components dealing with yum, rpm packages, debian packages,
etc) should be pluggable with well-defined interfaces.
系统必须在架构上支持任何硬件和操作系统，例如。RHEL，SLES，Ubuntu，Windows等。本质上依赖的组件在平台上（例如，处理yum，rpm包，debian包的组件，等等）应该可以通过定义良好的接口插入。

1.2 Pluggable Components（可插拔组件）

The architecture must not assume specific tools and technologies. Any specific
tools and technologies must be encapsulated by pluggable components. The
architecture will focus on pluggability of Puppet and related components which is a
provisioning and configuration tool of choice, and the database to persist the state.
The goal is not to immediately support replacements of Puppet, but the architecture
should be easily extensible to do so in the future.
该架构不能承担特定的工具和技术。任何特定的工具和技术必须由可插入组件封装。该架构将关注Puppet和相关组件的可插入性，这是一个选择的供应和配置工具，以及数据库持久化状态。目标不是立即支持替换Puppet，但是架构应该很容易扩展，以便在将来这样做。
The pluggability goal doesn’t encompass standardization of inter-component
protocols or interfaces to work with third-party implementations of components.
可插性目标不包括组件间的标准化协议或接口以与组件的第三方实现一起工作。

1.3Version Management & Upgrade（版本管理和升级）

Ambari components running on various nodes must support multiple versions of
the protocols to support independent upgrade of components. Upgrade of any
component of Ambari must not affect the cluster state.
在各种节点上运行的Ambari组件必须支持多个版本的协议，以支持组件的独立升级。升级Ambari的任何组件不得影响群集状态。

1.4Extensibility可扩展性

The design should support easy addition of new services, components and APIs.
Extensibility also implies ease in modifying any configuration or provisioning steps
for the Hadoop stack. Also, the possibility of supporting Hadoop stacks other than
HDP needs to be taken into account.
设计应该支持轻松添加新的服务，组件和API。
可扩展性还意味着修改Hadoop堆栈的任何配置或配置步骤。此外，还需要考虑支持HDP以外的Hadoop堆栈的可能性。

1.5Failure Recovery（故障恢复）

The system must be able to recover from any component failure to a consistent
state. The system should try to complete the pending operations after recovery. If
certain errors are unrecoverable, failure should still keep the system in a consistent
state.
系统必须能够从任何组件故障恢复到一致状态。系统应该在恢复后尝试完成挂起的操作。如果某些错误是不可恢复的，故障应该仍然保持系统在一致的状态。

1.6Security安全

The security implies 1) authentication and role-based authorization of Ambari
users (both API and Web UI), 2) installation, management, and monitoring of the
Hadoop stack secured via Kerberos, and 3) authenticating and encrypting over-the-
wire communication between Ambari components (e.g., Ambari master-agent
communication).
安全性意味着1）Ambari用户（API和Web UI）的认证和基于角色的授权，2）通过Kerberos保护的Hadoop堆栈的安装，管理和监控，以及3）认证和加密在线通信Ambari组件之间（例如，Ambari主代理通信）。

1.7Error Trace（错误跟踪）

The design strives to simplify the process of tracing failures. The failures should
be propagated to the user with sufficient details and pointers for analysis.
该设计努力简化跟踪失败的过程。故障应该传播给用户，具有足够的细节和指针用于分析。

1.8 Near Real-Time and Intermediate Feedback for Operations（近实时和中级反馈操作）

For operations that take a while to complete, the system needs to be able to
provide the user feedback with intermediate progress regarding currently running
tasks, % of operation complete, a reference to a operation log, etc., in a timely
manner (near real-time). In the previous version of Ambari, this was not available
due to Puppet’s Master-Agent architecture and its status reporting mechanism.
对于需要一段时间来完成的操作，系统需要能够以及时的方式向用户反馈提供关于当前运行的任务，操作完成百分比，对操作日志的引用等的中间进度（接近真实-时间）。在以前版本的Ambari中，由于Puppet的Master-Agent架构及其状态报告机制，此版本不可用。

2.Terminology术语

2.1Service服务

Service refers to services in the Hadoop stack. HDFS, HBase, and Pig are examples of services. A service may have multiple components (e.g., HDFS has NameNode, Secondary NameNode, DataNode, etc). A service can just be a client library (e.g., Pig does not have any daemon services, but just has a client library).
服务是指Hadoop堆栈中的服务。 HDFS，HBase和Pig是服务的示例。服务可以具有多个组件（例如，HDFS具有NameNode，Secondary NameNode，DataNode等）。服务只能是客户端库（例如，Pig不具有任何守护程序服务，但只有一个客户端库）。

2.2Component组件

A service consists of one or more components. For example, HDFS has 3 components: NameNode, DataNode and Secondary NameNode. Components may be optional. A component may span multiple nodes (e.g., DataNode instances on multiple nodes).
服务由一个或多个组件组成。例如，HDFS有3个组件：NameNode，DataNode和Secondary NameNode。组件可以是可选的。组件可以跨越多个节点（例如，多个节点上的DataNode实例）。

2.3Node/Host （节点/主机）

Node refers to a machine in the cluster. Node and host are used interchangeably in this document.
节点是指集群中的一台机器。节点和主机可互换使用

2.4 Node-Component 节点组件

Node-component refers to an instance of a component on a particular node. For example, a particular DataNode instance on a particular node is a node-component.
节点组件指特定节点上的组件的实例。例如，特定节点上的特定DataNode实例是节点组件。

2.5 operation操作

An operation refers to a set of changes or actions performed on a cluster to satisfy a user request or to achieve a desirable state change in the cluster. For example, starting of a service is an operation and running a smoke test is an operation. If a user requests to add a new service to the cluster and that includes running a smoke test as well, then the entire set of actions to meet the user request will constitute an operation. An operation can consist of multiple “actions” that are ordered (see below).
操作指的是在群集上执行以满足用户请求或在群集中实现期望的状态改变的一组改变或动作。例如，服务的启动是操作并且运行烟雾测试是操作。如果用户请求向集群添加新服务并且也包括运行冒烟测试，则满足用户请求的整个动作集将构成操作。一个操作可以由多个“动作”组成，这些动作是有序的（见下文）。

2.6 Task任务

Task is the unit of work that is sent to a node to execute. A task is the work that node has to carry out as part of an action. For example, an “action” can consist of installing a datanode on Node n1 and installing a datanode and a secondary namenode on Node n2. In this case, the “task” for n1 will be to install a datanode and the “tasks” for n2 will be to install both a datanode and a secondary namenode.
任务是发送到节点以执行的工作单元。任务是节点作为动作的一部分必须执行的工作。例如，“动作”可以包括在节点n1上安装数据节点并在节点n2上安装数据节点和辅助节点。在这种情况下，n1的“任务”将安装一个datanode，n2的“任务”将安装一个datanode和一个secondary namenode。

2.6 Stage阶段

A stage refers to a set of tasks that are required to complete an operation and are independent of each other; all tasks in the same stage can be run across different nodes in parallel.
阶段是指完成操作所需的一组任务相互独立;在同一个阶段的所有任务可以跨不同的运行节点。

2.7Action动作

An ‘action’ consists of a task or tasks on a machine or a group of machines. Each action is tracked by an action id and nodes report the status at least at the granularity of the action. An action can be considered a stage under execution. In this document a stage and an action have one-to-one correspondence unless specified otherwise. An action id will be a bijection of request-id, stage-id.
“动作”包括机器或一组机器上的一个或多个任务。每个操作由操作ID跟踪，并且节点至少在操作的粒度上报告状态。动作可以被认为是正在执行的阶段。在本文档中，阶段和动作具有一一对应的关系，除非另有规定。动作ID将是request-id，stage-id的双向。

2.8 Stage Plan（平面计划）

An operation typically consists of multiple tasks on various machines and they usually have dependencies requiring them to run in a particular order. Some tasks are required to complete before others can be scheduled. Therefore, the tasks required for an operation can be divided in various stages where each stage must be completed before the next stage, but all the tasks in the same stage can be scheduled in parallel across different nodes.
操作通常包括在各种机器上的多个任务，并且它们通常具有要求它们以特定顺序运行的依赖性。某些任务需要完成才能安排其他任务。因此，操作所需的任务可以被划分为各个阶段，其中每个阶段必须在下一阶段之前完成，但是同一阶段中的所有任务可以跨不同节点并行地调度。

2.9 Manifest清单

Manifest refers to the definition of a task which is sent to a node for execution. The manifest must completely define the task and must be serializable. Manifest can also be persisted on disk for recovery or record.
清单指的是发送到节点以供执行的任务的定义。清单必须完全定义任务，并且必须是可序列化的。清单也可以持久保存在磁盘上进行恢复或记录。

2.10 Role

A role maps to either a component (e.g., NameNode, DataNode) or an action (e.g., HDFS rebalancing, HBase smoke test, other admin commands, etc.)
角色映射到组件（例如，NameNode，DataNode）或操作（例如，HDFS重新平衡，HBase烟测试，其他管理命令等）

3 Ambari Architecture （Ambari 体系架构）

3.1 顶层图

Following figure captures the high level architecture of Ambari.
下图描述了Ambari的高级架构。
这里写图片描述
其中master模块接受API和Agent Interface的请求，完成ambari-server的集中式管理监控逻辑，而每个agent节点只负责所在节点的状态采集及维护。

补充内容：Ambari采用的并不是新的架构，只是充分利用了一些优秀的开源软件及其思想，将其巧妙的结合，使其在分布式环境中能够做到集群式服务管理、监控、展示等。Ambari的架构采用的是C/S模型，即：Server/Client模式，能够集中式管理分布式集群的安装配置及部署。Ambari除了ambari-server和ambari-agent，另外它还提供了一个界面优美的管理监控页面ambari-web，这些页面由ambari-server提供。ambari-server对外开放了REST API，这些用途有二，其一用于为ambari-web提供管理监控服务，其二用于与ambari-agent交互，接受ambari-agent向ambari-server发送的心跳请求。

3.2 Ambari Server

Following figure describes the design of Ambari Server
下图描述了Ambari服务器的设计

这里写图片描述

补充内容：
而对于ambari-server来说，其是一个有状态的，它维护着自己的一个有限状态FSM（源代码中标识符含有fsm）。同时这些状态存储与数据库当中（DB目前可以支持多种，可按序自选），Server端主要维持三类状态：

Live Cluster State：集群现有状态，各个节点汇报上来的状态信息会更改该状态
Desired State：使用者希望该节点所处状态，是用户在页面进行了一系列的操作，需要更改某些服务的状态，这些状态还没有在节点商阐述作用
Action State：操作状态，该状态是一种中间状态，这种状态可以辅助Live Cluster State向Desired State状态的转变。

Ambari-server的Heartbeat Handler模块用于接收各个Agent的心跳请求（其中包含节点状态信息和返回的操作结果），把节点状态信息传递给图中的FSM模块去维护该节点的状态，并把响应之后的操作结果信息返回给Action Manager去做更加详细的处理。Coordinator模块可以看作API Handler，主要在接收Web端操作请求后，校验其合法性，Stage Planner分解成一组操作，最后提供给Action 过 Manager去完成执行操作。

因而，从上图中，我们可以看出，ambari-server的所有状态信息的维护和变化都会被记录在数据库当中，使用者做一些更改服务的操作都会在数据库商做对应的记录，同时，Agent通过心跳来获取数据库的变动历史信息。

3.3 Ambari-agent

Following figure describes the design of Ambari Agent.
下图描述了Ambari Agent的设计。
这里写图片描述

补充内容：
ambari-agent是一个无状态的，主要功能如下所示：

采集所在节点的信息并且汇总发送心跳给ambari-server
处理ambari-server的响应请求。

因而，它有两种队列：MessageQueue和ActionQueue。

MessageQueue：包含节点状态信息（注册信息等）和执行结果信息，并且汇总后通过心跳发送给ambari-server
ActionQueue：用于接收ambari-server返回过来的状态操作，然后能通过执行器按序调用puppet或python脚本等模块完成任务

4 Use cases（用例）

In this section we cover a few basic use cases and describe how the request are served by the system at a high level and how components interact.
在本节中，我们介绍了几个基本用例，并描述了系统如何在高级别提供请求，以及组件如何交互。

4.1 Add service（添加服务）:

Add a new service to an existing cluster. Let’s take a specific example of adding Hbase service to an existing cluster, which is already running HDFS. HBase master and slaves will be added to the subset of existing nodes (no additional nodes). It will go through following steps:
向现有集群添加新服务。让我们举一个具体的例子，向已经运行HDFS的现有集群添加Hbase服务。 HBase主节点和从节点将添加到现有节点的子集（无其他节点）。它将执行以下步骤：

The request lands on the server via API and a request id is generated and attached to the request. A handler for this API is invoked in the Coordinator. 请求通过API到达服务器，并生成请求标识并附加到请求。在协调器中调用此API的处理程序。
The API Handler implements the steps needed to start a new service to an existing cluster. In this case the steps would be: install all the service components along with the required prerequisites, start the prerequisites and the service components in a specific order, and re-configure Nagios server to add new service monitoring as well. API处理程序实现了向现有集群启动新服务所需的步骤。在这种情况下，步骤将是：安装所有服务组件以及所需的先决条件，以特定顺序启动先决条件和服务组件，并重新配置Nagios服务器以添加新的服务监控。
The Coordinator will lookup in Dependency Tracker and find the prerequisites for HBase. Dependency Tracker will return HDFS and ZooKeeper components. Coordinator will also lookup dependencies for Nagios server and which will return HBase client. Dependency Tracker will also return the required state of the required components. Thus, Coordinator will know the entire set of components and their required states. Coordinator will set the desired state for all these components in the DB. 协调器将在依赖关系跟踪器中查找并找到HBase的先决条件。依赖关系跟踪器将返回HDFS和ZooKeeper组件。协调器还将查找Nagios服务器的依赖关系，并返回HBase客户端。依赖关系跟踪器还将返回所需组件的所需状态。因此，协调器将知道整个组件及其所需状态。协调器将在DB中为所有这些组件设置所需的状态。
During the previous step, the Coordinator may also determine that it requires the user’s input to select nodes for ZooKeeper and may return an appropriate response. This depends on the API semantics. 在上一步骤中，协调器也可以确定它需要用户的输入以选择ZooKeeper的节点并且可以返回适当的响应。这取决于API语义。
The Coordinator will then pass the list of components and their desired states to the Stage Planner. Stage Planner will return the staged sequence of operations that need to be performed on each node where these components are to be installed/started/modified. The Stage Planner will also generate the manifest (tasks for each individual nodes for each stage) using the Manifest Generator. 协调器然后将组件及其所需状态的列表传递给舞台计划器。阶段计划器将返回需要在将要安装/启动/修改这些组件的每个节点上执行的操作序列。舞台计划器还将使用清单生成器生成清单（每个阶段的每个单独节点的任务）。
Coordinator will pass this ordered list of stages to the Action Manager with the corresponding request id. 协调器将这个有序的阶段列表传递给具有相应请求ID的动作管理器。
Action Manager will update the state of each node-component, in the FSM, which will reflect that an operation is in progress. Note that the FSM for each affected node-component is updated. In this step, the FSM may detect an invalid event and throw failure, which will abort the operation and all actions will be marked as failed with an error. 操作管理器将更新FSM中每个节点组件的状态，这将反映操作正在进行中。请注意，每个受影响的节点组件的FSM都会更新。在此步骤中，FSM可能会检测到无效事件并抛出失败，这将中止操作，并且所有操作都将标记为失败并显示错误。
Action Manager will create an action id for each operation and add it to the Plan. The Action Manager will pick first Stage from the plan and adds each action in this Stage to the queue for each of the affected nodes. The next Stage will be picked when first Stage completes. Action Manager will also start a timer for scheduled actions. 操作管理器将为每个操作创建一个操作ID，并将其添加到计划中。操作管理器将从计划中选取第一个阶段，并将此阶段中的每个操作添加到每个受影响节点的队列。当第一个阶段完成时，将选择下一个阶段。操作管理器还将启动计划操作的计时器。
Heartbeat Handler will receive the response for the actions and notify the Action Manager. Action Manager will send an event to the FSM to update the state. In case of a timeout, the action will be scheduled again or marked as failed. Once all nodes for an action have reached completion (response received or final timeout) the action is considered completed. Once all actions for a Stage are completed the Stage is considered completed. Heartbeat处理程序将收到操作的响应并通知操作管理器。操作管理器将向FSM发送事件以更新状态。如果超时，操作将再次调度或标记为失败。一旦动作的所有节点已达到完成（接收到响应或最终超时），则认为动作完成。一旦舞台的所有操作完成后，舞台即被视为已完成。
Action completion is also recorded in the database. 操作完成也记录在数据库中。
The Action Manager proceeds to the next Stage and repeats. 操作管理器进入下一个阶段并重复。

1 0