Lustre(待修正)

来源:互联网 发布:hurst指数 python 编辑:程序博客网 时间:2024/04/30 07:34

Lustre: A Scalable, High-Performance File System

Cluster File Systems, Inc.

Lustre:一个可扩展的高性能的文件系统

 

Abstract:

Today's network-oriented computing environments require high-performance, network-aware file systems that can satisfy both the data storage requirements of individual systems and the data sharing requirements of workgroups and clusters of cooperative systems. The Lustre File System, an open source, high-performance file system from Cluster File Systems, Inc., is a distributed file system that eliminates the performance, availability, and scalability problems that are present in many traditional distributed file systems.  Lustre is a highly modular next generation storage architecture that combines established, open standards, the Linux operating system, and innovative protocols into a reliable, network-neutral data storage and retrieval solution. Lustre provides high I/O throughput in clusters and shared-data environments and also provides independence from the location of data on the physical storage, protection from single points of failure, and fast recovery from cluster reconfiguration and server or network outages.

1. Overview

Network-centric computing environments demand reliable, high-performance storage systems that properly authenticated clients can depend on for data storage and delivery. Simple cooperative computing environments such as enterprise networks typically satisfy these requirements using distributed file systems based on a standard client/server model. Distributed file systems such as NFS and AFS have been successful in a variety of enterprise scenarios but do not satisfy the requirements of today's high-performance computing environments. The Lustre distributed file system provides significant performance and scalability advantages over existing distributed file systems. Lustre leverages the power and flexibility of the Open Source Linux operating system to provide a truly modern POSIX compliant file system that satisfies the requirements of large clusters today, while providing a clear design and extension path for even larger environments tomorrow. The name "Lustre" is an amalgam of the terms "Linux" and "Clusters".

网络为中心的计算环境要求能可靠的,高性能的存储系统,为经过合理验证的用户提供数据存储和传输。简单的交互计算环境,比如企业网络,会很典型的通过基于标准C/S模型的分布式文件系统来满足用户的这些需求。NFSAFS等分布式文件系统在很多种企业环境下成功的满足了多数需要,但他们还不能满足今天的高性能计算环境的要求。Lustre分布式文件系统在现有分布式文件系统的基础上提供了显著的性能和扩展性的优势。Lustre利用开源Linux操作系统的一系列优势提供了一个真正现代的兼容POSIX的文件系统,在为将来更大的环境下提供了清晰的设计和扩展的道路的同时,也满足了今天的large clusters 的需要(别扭,倒过来说还差不多)。Lustre的这个名称是LinuxClusters的混合

Distributed file systems have well-known advantages. They decouple computational and storage resources, enabling desktop systems to focus on user and application requests while file servers focus on reading, delivering, and writing data. Centralizing storage on file servers facilitates centralized system administration, simplifying operational tasks such as backups, storage expansion, and general storage reconfiguration without requiring desktop downtime or other interruptions in service. Beyond the standard features required by definition in a distributed file system, a more advanced distributed file system such as AFS simplifies data access and usability by providing a consistent view of the distributed file system from all client systems. It also supports redundancy, which means that failover services in conjunction with redundant storage devices provide multiple, synchronized copies of critical resources eliminating single points of failure.   In the event of the failure of any critical resource, the file system automatically provides a replica of the failed entity that can therefore provide uninterrupted service. This eliminates single points of failure in the distributed file system environment.

分布式文件系统有众所周知的优势,它将计算和存储分离开来,并让桌面系统专著于用户和应用程序的需求,而文件服务器专著于读写和传递数据。文件服务器集中的文件存储促进了集中的系统管理,简化了操作任务,例如备份、存储扩展和一般的存储再配置,它不需要桌面系统的临时停止或者其他的服务中断。除了分布式文件系统一般特点的要求,一个更先进的分布式文件系统例如AFS通过提供一个从所有客户系统得到的分布式文件系统的一致视图而简化了数据访问和使用。它也支持冗余,这意味着一个失败转移服务与冗余的存储设备一起提供紧急资源的多重的,同步的拷贝以消除单点的失败。在任何紧急资源失败的情况下,文件系统自动提供失败点入口的备份从而提供不间断的服务。这就消除了分布式文件系统环境下的单点失败。

Lustre provides significant advantages over the aging distributed file systems that preceded it. These advantages will be discussed in more detail throughout this paper, but are highlighted here for convenience. Most importantly, Lustre runs on commodity hardware and uses object based disks for storage and metadata servers for storing file system metadata. This design provides a substantially more efficient division of labor between computing and storage resources. Replicated, failover metadata Servers (MDSs) maintain a transactional record of high-level file and file system changes. Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices, which will be explained in more detail in the next section. This division of labor and responsibility leads to a truly scalable file system anddmore reliable recoverability from failure conditions by providing a unique combination of the advantages of journaling and distributed file systems. Lustre supports strong file and metadata locking semantics to maintain total coherency of the file systems even in the presence of concurrent access. File locking is distributed across the storage targets (OSTs) that constitute the file system, with each OST handling locks for the objects that it stores.

Lustre系统有着了显著的优势,这远超过那些在它之前的已经过时的分布式文件系统所提供的。我们会在此白皮书中更详尽的谈论这些优势,但这里我们会着重阐述它如何易于使用。最关键的,Lustre可以在平常的硬件上运行的很好,并且使用基于对象的磁盘来存储,用元数据服务器来存储文件系统元数据。这样的设计为计算和存储的提供了足够有效的分工。同样的,失败转移元数据服务器维护一份高级别的文档和文档系统变化的交互记录。分布对象存储任务(Distributed Object Storage Targets (OSTs))负责日常文件系统I/O和与存储设备的连接,下一节我们会详细解释这一点。这种分工和任务通过提供独特的日志的优势和分布式文件系统的结合而产生一个真正可扩展的文件系统并且更加可靠的错误恢复。Lustre提供健壮的文件和元数据锁定语法来维持文件系统的一致性,即便是并发访问也是如此。文件锁定贯穿组成文件系统的OSTs,每个OST为它存储的目标处理锁定

Lustre uses an open networking API, the Portals API, made available by Sandia. At the top of the stack is a very sophisticated request processing layer provided by Lustre, resting on top of the Portals protocol stack. At the bottom is a network abstraction layer (NAL) that provides out-of-the-box support for multiple types of networks. Like Lustre's use of Linux, Lustre's use of open, flexible standards makes it easy to integrate new and emerging network and storage technologies. Lustre provides security in the form of authentication, authorization and privacy by leveraging existing security systems. This makes it easy to incorporate Lustre into existing enterprise security environments without requiring changes in Lustre itself. Similarly, Lustre leverages the underlying journaling file systems provided by Linux to enable persistent state recovery, enabling resiliency and recoverability from failed OSTs. Finally, Lustre's configuration and state information is recorded and managed using open standards such as XML and LDAP, making it easy to integrate Lustre management and administration into existing environments and sets of third-party tools.

Lustre使用一个开放网络APISandia提供的入口API。在堆的顶端是Lustre提供的一个非常精密的需求处理层,它在入口协议栈的顶端。底层是一个网络抽象层,用来为多种不同类型的网络提供Out -of-the-box支持。正如Lustre使用LinuxLustre使用的开放的可扩展的标准让它易于同新的正形成的网络和存储技术相结合。Lustre通过利用现有安全系统的认证、授权、隐私的方式提供安全保证,这让它可以在不改变自身的情况下应用到现有的企业安全环境中。类似的,Lustre利用Linux提供的日志文件系统支持状态恢复,从失败的OST中恢复。最后,Lustre配置和状态信息通过开放的标准,例如XMLLDAP被记录管理,这让它很容易地融合到现有的环境以及第三方工具集中。

The remainder of this white paper provides a more detailed analysis of various aspects of the design and implementation of Lustre along with a roadmap for planned Lustre enhancements. The Lustre web site at http://www.lustre.org provides additional documentation on Lustre along with the source code. For more information about Lustre, contact Cluster File Systems, Inc. via email at info@clusterfs.com.

白皮书的剩余部分为Lustre的设计和实现提供了多方面更加详细的分析,Lustre的加强开发计划也包含其中。Lustre的网址是http://www.lustre.org,上面有其他文档和源码资料。欲获取有关Lustre的更多信息,请发邮件至info@clusterfs.com以联系CFS。。

 

2. Lustre Functionality

The Lustre file system provides several abstractions designed to improve both performance and scalability. At the file system level, Lustre treats files as objects that are located through metadata Servers (MDSs). Metadata Servers support all file system namespace operations, such as file lookups, file creation, and file and directory attribute manipulation, directing actual file I/O requests to Object Storage Targets (OSTs), which manage the storage that is physically located on underlying Object-Based Disks (OBDs). Metadata servers keep a transactional record of file system metadata changes and cluster status, and support failover so that the hardware and network outages that affect one metadata Server do not affect the operation of the file system itself.

Lustre文件系统提供多个抽象层用来提高性能和扩展性。在文件系统层,Lustre把位于元数据服务器的文件当成对象看待。元数据服务器支持所有的文件系统命名空间操作,例如文件查找,文件创建,文件和目录属性操作,把实际的文件I/O的请求转移到OST上,它管理物理位置处于潜在的基于对象的磁盘的存储。元数据服务器维持一份文件系统元数据交换和簇状态的交互记录,并且支持失败转移,这样,影响一个元数据服务器的硬件和网络损耗就不会影响到文件系统本身的操作。

Like other file systems, the Lustre file system has a unique inode for every regular file, directory, symbolic link,and special file. The regular file inodes hold references to objects on OSTs that store the file data instead of references to the actual file data itself. In existing file systems, creating a new file causes the file system to allocate an inode and set some of its basic attributes. In Lustre, creating a new file causes the client to contact a metadata server, which creates an inode for the file and then contacts the OSTs to create objects that will actually hold file data. Metadata for the objects is held in the inode as extended attributes for the file. The objects allocated on OSTs hold the data associated with the file and can be striped across several OSTs in a RAID pattern. Within the OST, data is actually read and written to underlying storage known as Object-Based Disks (OBDs). Subsequent I/O to the newly created file is done directly between the client and the OST, which interacts with the

underlying OBDs to read and write data. The metadata server is only updated when additional namespace changes associated with the new file are required.

和其他文件系统一样,Lustre文件系统为每个正常的文件、目录、符号连接和特殊文件都建立一个唯一的inode。正常的文件inodes有指向对象的引用on OTS,它存储文件数据而不是指向实际的文件数据的引用。在已有的文件系统中,每当创建一个新文件文件系统就会分配一个inode并且设置它的一些基本属性。在Lustre里,创建一个新文件时,客户会联系一个元数据服务器,这将创建一个针对此文件的inode,然后联系OSTs来创建一个对象,这个对象包含实际的文件数据。对象的元数据在inode里作为文件的扩展属性存在。分配在OST上的对象拥有文件相关的数据,并且可以以RAID模式被分散在几个OST上。通过OST,数据通常被读或写到潜在的存储中,即基于对象的磁盘。接下来针对新建文件的I/O在客户和OST之间直接完成,它同潜在的基于对象的磁盘交互来读写数据。元数据服务器只有当附加的和新建文件相关的命名空间的改变是必需的时候才会被更新。

 

 

Object Storage Targets handle all of the interaction between client data requests and the underlying physical storage. This storage is generally referred to as Object-Based Disks (OBDs), but is not actually limited to disks because the interaction between the OST and the actual storage device is done through a device driver. The characteristics and capabilities of the device driver mask the specific identity of the underlying storage that is being used. This enables Lustre to leverage existing Linux file systems and storage devices for its underlying storage while providing the flexibility required to integrate new technologies such as smart disks and new types of file systems. For example, Lustre currently provides OBD device drivers that support Lustre data storage within journaling Linux file systems such as ext3, JFS, ReiserFS and XFS. This further increases the reliability and recoverability of Lustre by leveraging the journaling mechanisms already provided in such file systems. Lustre can also be used with specialized 3rd party object storage targets like those provided by BlueArc.

 

对象存储目标处理所有的客户数据请求和潜在的物理存储之间的交互。这种存储通常指基于对象的磁盘,但不局限于磁盘,因为OST和实际存储设备的交互通过设备驱动完成。设备驱动的特点和处理能力掩盖了特定的潜在存储设备的身份。这让Lustre利用已有的Linux文件系统和存储设备在要求提供结合新技术的扩展性(例如smart disks和新型文件系统)的时候来为它的潜在存储服务。比如说,Lustre现在提供OBD设备驱动,它通过Linux文件系统的日志象ext3,JFS,ReiserFsXFS 支持Lustre数据存储。这样,通过使用现存的日志机制进一步提高了Lustre可靠性和可恢复性。Lustre也可以通过象BlueArc提供的特定的第三方对象存储目标来使用。

 

Lustre's division of actual storage and allocation into OSTs and underlying OBDs facilitates hardware development that can provide additional performance improvements in the guise of a new generation of smart disk drives which provide object-oriented allocation and data management facilities in hardware. Cluster File Systems is actively working with several storage manufacturers to develop integrated OBD support in disk drive hardware. This is analogous to the way in which the SCSI standard pioneered smart devices that offloaded much of the direct hardware interaction into the device's interface and drive controller hardware. Such smart, OBDaware hardware can provide instant performance improvements to existing Lustre installations and will continue the modern computing trend of offloading device-specific processing to the device itself.

Lustre关于实际存储和Ost的分配以及潜在的OBD的分离,促进了硬件开发,这也能通过新一代的smart disk驱动提供附加的性能提高,它以硬件形式提供基于对象的分配和数据管理设备。簇文件系统被多家存储厂商运用,来开发支持磁盘驱动的硬件的集成OBD 。这类似于SCSI。。。。这种灵巧的能感知OBD的硬件能为现有的Lustre安装提供即时的性能提升,并且会延续现代计算在消除特定设备处理的趋势

 

Beyond the storage abstraction that they provide, OSTs also provide a flexible model for adding new storage to an existing Lustre file system. New OSTs can easily be brought online and added to the pool of OSTs that a cluster's metadata servers can use for storage. Similarly, new OBDs can easily be added to the pool of underlying storage associated with any OST. Lustre provides a powerful and unique recovery mechanism used when any communication or storage failure occurs. If a server or network interconnect fails, the client incurs a timeout trying to access data. It can then query a LDAP server to obtain information about a replacement server and immediately direct subsequent requests to that server. An 'epoch' number on every storage controller, an incarnation number on the metadata server/cluster, and a generation number associated with connections between clients and other systems form the infrastructure for Lustre recovery, enabling clients and servers to detect restarts and select appropriate and equivalent servers.

除了存储抽象,OST还提供一个可扩展的模型,用来在已存在的Lustre文件系统中添加新的存储。新的OST可以很容易的在线添加到OST的缓冲池,簇的元数据服务器用它来存储。同样的,OBD可以方便的添加到任何OST相关的潜在的存储的缓冲池。Luster提供了一种强大并独特的恢复机制,当任何通信或者存储失败发生的时候,这种机制便会产生作用。如果一个服务器或者网络连接失败了,客户端便会导致一个试图访问数据的超时情况。继而会查询一个LDAP服务器来获得一个替换服务器的信息,并且立刻将随后的请求应声到那台服务器。。。。。让客户端和服务器端能检测到重启并且选择适当的相同的服务器

When failover OSTs are not available Lustre will automatically adapt. If an OST fails - except for raising administrative alarms - it will only generate errors when data cannot be accessed. New file creation operations will automatically avoid a malfunctioning OST.

当失败转移OST不可用,Lustre将会自动适应。如果一个OST失败了,除非提出管理警报,当数据不能被访问的时候它将会产生错误。新文件创建操作会自动避免一台有故障的OST

 

3. File System Metadata and Metadata Servers

File system metadata is "information about information", which essentially means that metadata is information about the files and directories that make up a file system. This information can simply be information about local files, directories, and associated status information, but can also be information about mount points for other file systems within the current file system, information about symbolic links, and so on. Many modern file systems use metadata journaling to maximize file system consistency. The file system keeps a journal of all changes to file system metadata, and asynchronously updates the file system based on completed changes that have been written to its journal. If a system outage occurs, file system consistency can be quickly restored simply by replaying completed transactions from the metadata journal.

文件系统元数据事关于“信息的信息”,这本质上是指元数据是关于组成一个文件系统的文件和目录的信息。这些信息可能只是关于本地文件、目录和相关状态信息,但也可能是关于其他文件系统装入点的信息within。。。许多现代文件系统使用元数据日志来最大限度地实现文件系统一致性。文件系统保留了所有对文件系统元数据的改变,并且同步更新基于写道日志中断完整改变的文件系统。如果出现系统短时中止,文件系统一致性会通过重现来自元数据日志的完整的事务来简单的被恢复。

 

In Lustre, file system metadata is stored on a metadata server (MDS) and file data is stored in objects on the OSTs. This design divides file system updates into two distinct types of operations: file system metadata updates on the MDS and actual file data updates on the OSTs. File System namespace operations are done on the MDS so that they do not impact the performance of operations that only manipulate actual object (file) data. Once the MDS identifies the storage location of a file, all subsequent file I/O is done between the client and the OSTs. Using metadata servers to manage the file system namespace provides a variety of immediate opportunities for performance optimization. For example, metadata servers can maintain a cache of pre-allocated objects on various OSTs, expediting file creation operations. The scalability of metadata operations on Lustre is further improved through the use of an intent based locking scheme. For example, when a client wishes to create a file, it requests a lock from an MDS to enable a lookup operation on the parent directory, and also tags this request with the intended operation, namely file creation. If the lock request is granted, the MDS then uses the intention specified in the lock request to modify the directory, creating the requested file and returning a lock on the new file instead of the directory.

Lustre中,文件系统元数据被存储在元数据服务器中,文件数据被存放在OST的对象中。这种设计使文件系统更新分成不同类型的两种操作:MDS上的文件系统元数据更新和OST上的实际文件数据更新。文件系统命名空间操作在MDS上被完成,这样他们不会影响到仅仅控制现实对象数据的操作的性能。一旦MDS确定一个文件的存储位置,所有后继的文件I/O就通过客户端和OST来完成。使用元数据服务器来管理文件系统命名空间提供了多种性能优化的机会。比如,元数据服务器可以在不同的OST上维持一个预先分配对象的缓存,以加快文件创建操作。Lustre元数据操作的扩展性通过使用基于锁定方案的目标被大大提高。例如,当一个客户端想创建一个文件,它从MDS请求一个锁定以实现在其父目录上的查找,并且用有目的的操作标识出这中需求,即文件创建。如果锁定请求被允许,MDS就使用锁定请求中指定的计划来修改目录,创建请求的文件并且返回一个新文件而不是目录上的锁定。

Divorcing file system metadata operations from actual file data operations improves immediate performance, but also improves long-term aspects of the file system such as recoverability and availability. Actual file I/O is done directly between Object Storage Targets and client systems, eliminating intermediaries. General file system availability is improved by providing a single failover metadata server and by using distributed Object Storage Targets, eliminating any one MDS or OST as a single point of failure. In the event of wide-spread hardware or network outages, the transactional nature of the metadata stored on the metadata servers significantly reduces the time it takes to restore file system consistency by minimizing the chance of losing file system control information such as object storage locations and actual object attribute information.

从实际文件数据操作中脱离文件系统元数据操作提升了即时性能,同时也提高了文件系统的长期的方面,例如可恢复性和可用性。实际的文件I/O在对象存储目标和客户端系统之间直接完成,而不通过任何媒介。通常的文件系统的有效性通过提供一个失败恢复元数据服务器和使用分布式对象存储目标来提高,这消除了任何单个MDS或者OST作为单点失败的影响。如果出现大规模硬件或者网络中断,存储在元数据服务器上的元数据的交互的本性能显著地减少它恢复文件系统一致性的时间,这通过最大限度降低丢失文件系统控制信息比如对象存储位置和实际对象属性信息等的出现机会来实现。

File System availability and reliability are critical to any computer system, but become even more significant when the number of clients and the amount of managed storage increases. Local file system outages only affect the usability of a single workstation, but central resource outages such as a distributed file system have the potential to affect the usability of hundreds or thousands of client systems that need to access that storage. Lustre's flexibility, reliable and highly-available design, and inherent scalability make Lustre well-suited for use as a cluster file system today, when cluster clients number in the hundreds or low thousands, and tomorrow, when the number of clients depending on distributed file system resources will only continue to grow.

 

文件系统的有效性和可靠性对任何计算机系统都是很关键的,而当客户端和管理的存储的数量的数目增加的时候这会变得更加显著。本地文件系统的短时中断只会影响单个工作站的可用性,但集中的资源中断例如分布式文件系统却可能影响到成百上千个需要访问特定存储的客户系统。Lustre的弹性,可靠性和高可用性的设计和固有的可扩展性让它成为今天,当簇客户端数量在几百或者几千台的情况下,一个很好的簇文件系统, 在将来,以来分布式文件系统资源的客户端只会增加。

4. Network Independence in Lustre

As mentioned earlier in this paper, Lustre can be used over a wide variety of networks due to its use of an open Network Abstraction Layer. Lustre is currently in use over TCP and Quadrics (QSWNet) networks. Myrinet, Fibre Channel, Stargen and InfiniBand support are under development. Lustre's network-neutrality enables Lustre to instantly take advantage of performance improvements provided by network hardware and protocol improvements offered by new systems. Lustre provides unique support for heterogeneous networks. For example, it is possible to connect some clients over an Ethernet to the MDS and OST servers, and others over a QSW network, all in a single installation.

正如此白皮书前面提到的,Lustre可以在多种网络结构上应用,这得益于它使用了开放的网络抽象层。Lustre当前在TCPQSWNet网络环境下使用。Myrinet,Fibre频道,Stargen InfiniBand支持正在开放之中。Lustre的网络无关性让它能在新系统的网络硬件和协议的改进中得到即时性能提升的优势。Lustre对不同类型的网络提供了独特的支持。例如,它可以让基于以太网的客户端连接到MDSOST服务器,并且其它建立在QSQ网络上的客户端都有一个独立的安装

 

Lustre also provides routers that can route the Lustre protocol between different kinds of supported networks.This is useful to connect to 3rd party OSTs that may not support all specialized networks available on generic hardware.

Lustre也提供路由,这样就能在它所支持的不同的网络下路由Lustre协议,这对连接到那些在一般硬件上可能不支持的所有的特定的网络的第三方OST很有用。

 

Lustre provides a very sophisticated request processing layer on top of the Portals protocol stack, originally developed by Sandia but now available to the Open Source community. Below this is the network data movement layer responsible for moving data vectors from one system to another. Beneath this layer, the Portals message passing layer sits on top of the network abstraction layer, which finally defines the interactions between underlying network devices. The Portals stack provides support for high-performance network data transfer, such as Remote Direct Memory Access (RDMA), direct OS-bypass I/O, and scatter gather I/O (memory and disk) for more efficient bulk data movement.

LustrePortals协议栈之上提供非常复杂的请求处理层,此协议栈出自Sandia但现在开源社区可以自由的使用它。在此之下是网络数据移动层,负责从一个系统到另一系统移动数据向量。网络数据移动层之下Portals消息传递层在网络抽象层之上,这最终定义了潜在的网络设备的相互作用。Portals栈为高性能网络数据传输例如远程直接内存访问(RDMA)提供了支持,令忽略操作系统I/O 和分散集合I/O更有效的支持块整体数据移动

 

5. Lustre Administration Overview

5Lustre 管理预览

Lustre's commitment to using open standards such as Linux, the Portals Network Abstraction Layer, and existing Linux journaling file systems such as ext3 for journaling metadata storage is reflected in its commitment to creating open administrative and configuration information. Lustre's configuration information is stored in eXtensible Markup Language (XML) files that conform to a simple Document Type Definition (DTD), which is published in the open Lustre documentation. Maintaining configuration information in standard text files means that it can easily be manipulated using simple tools such as text editors. Maintaining it in a consistent fashion with a published DTD makes it easy to integrate Lustre configuration into third-party and open source administration utilities.

Lustre的约定中使用开放标准例如LinuxPortals网络抽象层,和已有的Linux日志文件系统例如ext3来记录元数据存储,这被表现在它的约定中来创建开放的管理和配置信息。Lustre的配置信息存储在符合一个简单的DTDXML文件中,它被发布到开放的Lustre的文档中。在标准的文本文件中维护配置信息意味着它可以通过使用简单的工具象文本编辑器来操作。用一个风格一致的发布了的DTD来维护使得集成Lustre配置到第三方和开源管理工具中变得容易。

These configuration files can be generated and updated using the lmc (Lustre make configuration) configurationutility. The lmc utility quickly generates initial configuration files, even for very large clusters complex clusters involving 100's of OSTs, routers and clients.

这些配置文件可以用imc 配置工具(Lustre make configuration)来生成和更新Imc工具快速生成处事配置文件,即便为包含100OST,路由和客户端的非常大的簇和复杂的簇。

Lustre is integrated with open network data resources and administrative mechanisms such as the Light-Weight Directory Access Protocol (LDAP) and the Simple Network Management Protocol (SNMP). The LMC utility can convert LDAP based configuration to and from XML based configuration information. The LDAP infrastructure provides redundancy and assists with cluster recovery.

Lustre集成开放网络数据资源和管理机制例如轻型目录访问协议(LDAP)和简单网络管理协议(SNMP)Lmc工具可以将基于LDAP的配置和基于XML的配置信息相互转换。LOAP基础结构提供簇恢复的冗余和帮助

 

To provide enterprise wide monitoring, Lustre exports status and configuration information into the SNMP agent, offering a Lustre MIB to the management stations.

为了提供企业级的监控,Lustre将状态和配置信息输出到SNMP代理,提供一个Lustre MIB到管理站

Lustre provides several basic, command-line oriented utilities for initial configuration and administration. The lctl (Lustre control) utility can be used to perform low level Lustre network and device configuration tasks, as well as batch driven tests to check the sanity of a cluster.

Lustre提供了初始化配置和管理的几个基本的面向命令行的工具。Lctl工具可以用来执行底层的Lustre网络和设备配置工作和检测簇完整性的批驱动的测试

The lconf (Lustre configuration) utility enables administrators to configure Lustre on specific nodes using userspecified configuration files. The Lustre documentation provides extensive examples of using these commands to configure, start, reconfigure, and stop or restart Lustre services.

Lconf工具让管理员能用用户指定的配置文件在特殊点配置LustreLustre文档提供了扩展的例程以使用这些命令来配置,启动,重配置和停止或重启Lustre服务

 

6. Future Directions for Lustre

Lustre's distributed design and use of metadata servers and Object Storage Targets as intermediaries between client requests and actual data access leads to a very scalable file system. The next few sections highlight several issues on the Lustre roadmap that will help to further improve the performance, scalability, and security of Lustre.

Lustre的分布式设计和元数据服务器的使用和作为客户端请求和实际对象访问仲裁的对象存储目标产生了非常具有扩展性的文件系统。下面的几节着重突出了Lustre开发规划的几个话题,这些规划将进一步提高性能,扩展性和安全性。

6.1 The Lustre Global Namespace

As mentioned earlier in this paper, distributed file systems provide a number of administrative advantages. From the end-user perspective, the primary advantages of distributed file systems are that they provide access to substantially more storage than could be physically attached to a single system, and that this storage can be accessed from any authorized workstation. However, accessing shared storage from different systems can be confusing if there is no uniform way of referring to and accessing that storage. The classic way of providing a single way of referring to and accessing the files in a distributed filesystem is by providing a "global namespace". A global namespace is typically a single directory on which an entire distributed filesystem is made available to users. This is known as mounting the distributed filesystem on that directory. In the AFS distributed file system, the global namespace is the directory /afs, which provides hierarchical access to filesets on various servers that are mounted as subdirectories somewhere under the /afs directory. When traversing fileset mountpoints, AFS does not store configuration data on the client to find the target fileset, but instead contacts a fileset location server to determine the server on which the fileset is physically stored.

 In AFS, mountpoint objects are represented as symbolic links that point to a fileset name/identifier. This requires that AFS mount objects must be translated from symbolic links to specific directories and filesets whenever you mount a fileset. Unfortunately, existing file systems like AFS contain hardwired references to mountpoints for the file systems. These file systems must therefore always be found at those locations, and can only be found at those locations.

正如此白皮书前面提到的,分布式文件系统提供一系列的管理优势。从终端用户角度,分布式文件系统的基本优势是它能提供访问远多于一个单独系统物理连接所能访问的存储,这种存储可以从任何经过验证的工作站访问。但是,如果没有一个同一的方式来访问,从不同的系统访问共享数据可能会引起混淆,提供一个单独的访问分布式文件系统的经典方式是提供一个全局命名空间。典型的全局命名空间是一个让所有分布式文件系统对用户可用的一个单独目录。就是大家熟悉的装配分布式文件系统到那个目录。在AFS分布式文件系统,全局命名空间是目录/afs,提供了到不同服务器的文件集的访问,这些文件集被映射到/afs下的子目录。当越过文件集的装配点,AFS不会在客户端存储配置数据来寻找目标文件集,而是联系一个文件集的位置服务器来确定文件集的物理位置被存储在哪台服务器。在AFS种,装配点对象被表现为指向文件集名称/标识符的符号连接。这要求当你想装配一个文件集的时候,AFS装备对象一定要从符号连接被翻译成特定目录和文件集。不幸的是,现有的文件系统例如AFS包含对文件系统装配点的硬件连线的引用。这样这些文件系统就会一直在那些位置,并且只能从那些位置找到他们。

Unlike existing distributed filesystems, Lustre intends to provide the best of both worlds by providing a global namespace that can be easily grafted onto any directory in an existing Linux filesystem. Once a Lustre filesystem is mounted, any authenticated client can access files within it using the same path and filename, but the initial mount point for the Lustre filesystem is not pre-defined and need not be the same on every client system.

If desired, a uniform mountpoint for the Lustre filesystem can be enforced administratively by simply mounting the Lustre filesystem on the same directory on every client, but this is not mandatory.

不像现有的分布式系统,Lustre试图提供两个世界最好的  通过提供一个全局命名空间来嫁接到一个现有Linux文件系统的任何目录中。一旦一个Lustre文件系统被装配,任何经过鉴定的客户都可以通过使用相同的路径和文件名访问它的文件,但Lustre文件系统初始的装配点不是预定义的,并且不需要在每台客户系统上都一样。如果你想要,一个Lustre文件系统的统一形式的装配点可以通过在每台客户机的相同目录中装配Lustre文件系统来强制管理,但并不一定非要这样做

 

Lustre intends to simplify mounting remote storage by setting special bits on directories that are to be used as mount points, and then storing the mount information in a special file in each such directory. This is completely compatible with every existing Linux file system, eliminates the extra overhead required in obtaining the mount information from a symbolic link, and makes it possible to identify mountpoints without actually traversing them unless you actually need information about the remote file system.

Lustre试图通过设置目录中被用来作为装配点的特定位来简化装载远程存储,然后将装配信息存储到每个这种目录的一个特殊文件中。这跟现有的Linux文件系统完全相容的,而不需要额外的代价从符号连接中获得装配信息,并且,你不需要。。。来确定装配点,除非你确实需要远程文件系统的信息

An easily overlooked benefit of the Lustre mount mechanism is that it provides greater flexibility than existing Linux mount mechanisms. Standard Linux client systems use the file /etc/fstab to maintain information about all of the file systems that should be mounted on that client. The Lustre mount mechanism transfers the responsibility for maintaining mount information from a single, per-client file into the client file system.

 The Lustre mount mechanism also makes it easy to mount other file systems within a Lustre file system, without requiring that each client be aware of all file systems mounted within Lustre. Lustre is therefore not only a powerful distributed file system in its own right, but also serves as a powerful integration mechanism for other existing distributed file systems.

一个简单的纵览Lustre装配机制的好处是它提供了比现有Linux装配机制更灵活的弹性。标准Linux客户系统使用文件/etc/fstab来维持所有文件系统的信息,这些信息应当被装配到客户端。Lustre装配机制把从单个的、每客户的文件的维护装配信息装载到客户文件系统。

Lustre装配机制也让在Lustre文件系统下装载其他文件系统变得容易,不需要每台客户端都知道Lustre中装载的所有文件系统。所以,Lustre不仅仅本身是一个强大的分布式文件系统,同时它也为其他现存分布式文件系统的提供强大的集成机制

6.2. Metadata and File I/O Performance Improvements

One of the first optimizations to be introduced is the use of a writeback cache for file writes to provide higher overall performance. Currently, Lustre writes are write-though, which means that write requests are not complete until the data is actually flushed to the target OSTs. In busy clusters, this can impose a considerable delay on every file write operation. The use of a writeback cache, where file writes are journaled and committed asynchronously, provides a great deal of promise for substantially higher-performance in file write requests.

 

首先要介绍的第一个最优化是提供更高整体性能的文件回写缓存的使用。当前,Lustre的写是完全写,这意味着写请求直到数据被完全刷新到目标OST之后才会完成。在繁忙的簇中,这可能导致文件写操作的极大延时。通过使用回写缓存,这样写文件被记录并同步执行,从而为写请求充分的高性能提供了保证。

The Lustre file system currently stores and updates all file metadata (except allocation data, which is held on the OST) through a single (failover) metadata server. While simple, accurate, and already very scalable, depending upon a single metadata server can reduce the performance of metadata operations in Lustre. Metadata performance can be greatly improved by implementing clustered metadata servers. Distributing metadata information across the cluster will also result in distributing the metadata processing load across the cluster, improving the overall throughput of metadata operations.

Lustre文件系统现在通过一个单独的(失败转移)元数据服务器存储并更新所有文件元数据(除了分配数据)。然而,在Lustre中,简单准确并且已非常具有扩展性,取决于一个单独元数据的服务器可以降低元数据操作的性能。元数据性能可以通过配置簇元数据服务器而获得极大提升。簇中的分布式元数据信息也导致簇中的分布式元数据处理负载,提高元数据操作的整体吞吐量

A writeback cache for Lustre metadata servers is also being considered. If a writeback cache for metadata is present, metadata updates would be first written to this cache and would subsequently be flushed to persistent storage on the metadata servers at a later time. This will dramatically improve the latency of updates as seen by client systems. It also enables batch metadata updates, which could reduce communications and increase parallelism.

Lustre元数据服务器的回写缓存也被纳入考虑。如果一个元数据的回写缓存被提出,元数据更新将首先被写到这个缓存,继而在稍后的时间理被刷新到元数据服务器稳定的存储中。这极大的提高了客户系统更新的潜伏期。这也是批元数据更新成为可能,从而降低通信和提高并行

Read scalability is achieved in a very different manner. Good examples of potential bottlenecks in a clustered environment are system files or centralized binaries that are stored in Lustre but which are required by multiple clients throughout the cluster at boot-time. If multiple clients were rebooted at the same time, they would all need to access the same system files simultaneously. However, if every client has to read the data from the same server, the network load at that server could be extremely high and the available network bandwidth at that server could pose a serious bottleneck for general file system performance. Using a collaborative cache, where frequently requested files could be cached across multiple servers, would help distribute the read load across multiple nodes, reducing the bandwidth requirements at each.

读可测量性通过不同的方式被实现。簇环境下的潜在瓶颈是系统文件或集中的二进制文件被存储在Lustre中,但不在启动时间被多个客户端需要。如果多客户端被在同时重启,他们将同时需要访问相同的系统文件。然而,如果每个客户端都一定要从同一台服务器读数据,那台服务器的网络负载将异常沉重,并且那台服务器的可用网络贷款将引发一个通常文件系统性能的严重瓶颈。通过使用相互协作的缓存,(频繁被请求的文件可以被多个服务器缓冲,)将帮助将读负载分散到多个节点,降低了每台服务器的带宽需求

A second example arises in situations like video servers where a large amount of data is read and written out to network attached devices. Lustre will provide QOS guarantees to meet these situations with high rate I/O.

第二个例子在类似视频服务器中的情形被提出,这种情形下,大量数据被读写到网络相连的设备中。Lustre会提供QOS保证来满足这些高吞吐量的I/O情形

A collaborative cache for Lustre will be added which enables multiple OSTs to cache data that is frequently used by multiple clients. In Lustre, read requests for a file are serviced in two phases: a lock request precedes the actual read request, and while the OST is providing the read lock, it can asses where in the cluster the data has already been cached to include a referral to that node for reading. The Lustre collaborative cache is globally coherent.

Lustre的一个合作的缓存将被添加,这使多OST能缓冲被多个客户端频繁访问的数据。在Lustre中,一个文件的读请求被分成两个阶段:一个锁请求在实际的读请求之前,并且当OST提供读锁定的时候,它可以评估数据被缓冲到簇的什么位置,包括指向用来读的节点的引用。Lustre合作缓存是全局一致的

一个

6.3. Advanced Security

File System security is a very important aspect of a distributed file system. The standard aspects of security are authentication, authorization, and encryption. While SANs are largely unprotected, Lustre provides the OSTs with a secure network attached disk (NASD) features.

文件系统安全性对分布式文件系统是非常重要的一个方面。标准安全性的要求包括认证,授权和加密。然而SANs很大程度上不受保护的,Lustre提供具有附加安全网络磁盘特点的OST

Rather than selecting and integrating a specific authentication service, Lustre can easily be integrated with existing authentication mechanisms using the Generic Security Service Application Programming Interface (GSSAPI), an open standard that provides secure session communications supporting authentication, data integrity, and data confidentiality. Lustre authentication will support Kerberos 5 and PKI mechanisms as a backend for authentication.

Lustre可以通过使用普通的安全服务程序借口容易地被集成到现有的身份鉴定机制,而不是选择并集成到一个特定的身份鉴定服务。一个提供安全会话通信的开放的标准支持身份验证,数据一致性和数据机密行。Lustre身份认证将支持Kerberos5PKI机制作为认证的backend

Lustre intends to provide authorization using access control lists that follow the POSIX ACL semantics. The flexibility and additional capabilities provided by ACLs are especially important in clusters that may support thousands of nodes and user accounts.

Lustre试图通过遵循POSIX ACL语义的访问控制列表来提供身份认证。ACL提供的弹性和附加的能力在簇上异常重要,它可以支持数千节点和用户帐户

Data privacy is expected to be ensured using an encryption mechanism such as that provided by  theStorageTek/University of Minnesota SFS file system, where data can actually be automatically encrypted and decrypted on the client based on a shared key protocol which makes file sharing by project a natural operation.

人们希望通过使用加密机制来确保数据隐私,就像。。。提供的一样,数据在客户端可以被基于公用密钥协议自动加解密,这让文件共享成为很自然的一项操作

 The OSTs are protected with a very efficient capability-based security mechanism which provides very significant optimizations over the original NASD protocol.

OSTs被一个高效的的安全机制保护,它提供原始的NASD协议之上的非常显著的优化

 

6.4 Further Features

Sharing existing file systems in a cluster file system to provide redundancy and load balancing to existing operations is the ultimate dream for small scale clusters. Lustre's careful separation of protocols and code modules makes this a relatively simple target.

Lustre will provide file system sharing with full coherency by providing support for SAN networking, together with a combined MDS/OST which exports both the data and metadata API's from a single file system.

File system snapshots are a cornerstone of enterprise storage management. Lustre will provide fully featured snapshots, including rollback, old file directories, and copy on write semantics. This will be implemented as a combination of snapshot infrastructure on the client, OSTs, and metadata servers, each requiring only a small addition to its infrastructure.

 

在一个簇文件系统中共享现存文件系统来对已有操作提供冗余和负载平衡是对小规模簇的终极目标。Lustre在协议上和代码模块的谨慎分离让这个目标相对容易。Lustre会通过对SAN网络和绑定的MDS/OST的支持提供完整一致性的文件系统共享,它从一个单独的文件系统输出数据和元数据API。文件系统快照是企业存储管理的基石。Lustre将提供完整特色的快照,包括回滚,老式文件目录,和写语义的复制。这将被实现为一个客户端快照的基础设备,OST和元数据服务器,每个都只需要对其基础构造的很小的扩展

7. Summary

Lustre is an advanced storage architecture and distributed file system that provides significant performance, scalability, and flexibility to computing clusters, enterprise networks, and shared-data in network-oriented computing environments. Lustre uses an object storage model for file I/O, and storage management to provide a substantially more efficient division of labor between computing and storage resources. Replicated, failover metadata Servers (MDSs) maintain a transactional record of high-level file and file system changes. Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with local or networked storage devices known as Object-Based Disks (OBDs).

Lustre是一个先进的存储体系和分布式文件系统,它提供了卓著的性能,扩展性,计算簇的弹性,企业网络,和基于网络计算环境的共享数据。Lustre为文件I/O使用一个对象存储模型,存储管理提供充分有效的计算和存储资源的劳动力分割。复制的,失败转移元数据服务器维护一个交互的高级别文件和文件系统改变的记录。分布式对象存储目标负责世界的文件系统I/O和针对本地和网络存储设备的接口,即所谓的基于对象的磁盘

Lustre leverages open standards such as Linux, XML, LDAP, SNMP, readily available open source libraries, and existing file systems to provide a powerful, scalable, reliable distributed file system. Lustre uses sophisticated, cutting-edge failover, replication, and recovery techniques to eliminate downtime and to maximize file system availability, thereby maximizing performance and productivity. Cluster File Systems, Inc., creators of Lustre, are actively working with hardware manufacturers to help develop the next generation of intelligent storage devices, where hardware improvements can further offload data processing from the software components of a distributed file system to the storage devices themselves.

Lustre使用开放的标准例如Linux,XML,LDAP,SNMP,方便使用的开源库,和已有的文件系统来提供一个强有力的可扩展的值得信赖的分布式文件系统。Lustre使用精密的cutting-edge失败转移,复制和恢复技巧来避免系统服务中断,并且最大限度的提高文件系统可用性,因此极大的提高了性能和生产力。簇文件系统,Lustre的创造者,正在积极地和硬件厂商合作以协助开发下一代只能存储设备,硬件设备的提升能进一步将来自分布式文件系统的软件组件的数据处理的压力转移到硬件上。

 

Lustre is open source software licensed under the GPL. Cluster File Systems provides customization, contract development, training, and service for Lustre. In addition to service, our partners can also provide packaged solutions under their own licensing terms.

LustreGPL下经过许可的开源软件。簇文件系统提供定制,合同开发,训练和Lustre的服务。除了服务,他们也可以提供他们的许可条件下的方案

 

Contact Cluster File Systems, Inc. at info@clusterfs.com. You can obtain additional information about Lustre,

including the current documentation and source code, from the Lustre Web site at http://www.lustre.org.

Lustre Whitepaper Version 1.0: November 11th, 2002

 

原创粉丝点击