Overcoming the I/O Bottleneck with General Parallel File System

来源：互联网发布：大漠插件模块源码编辑：程序博客网时间：2024/06/17 15:45

Overcoming the I/O Bottleneck with General Parallel File System
By Andrew Naiberg

It used to be that I/O was faster than computation. In fact, not too long ago a supercomputer could be loosely defined as any machine that turned a compute-bound problem into an I/O-bound problem. However, dramatic increases in CPU, memory and bus speeds have turned this relationship on its head—now disk I/O is usually the critical factor limiting application performance and the ability to share data across a computing cluster. First seen in scientific supercomputers, this I/O bottleneck is now common in many data-intensive business applications such as digital media, financial analysis, business intelligence, engineering design, medical imaging, geographic/geological analysis and so forth. And with data volumes, CPU and interconnect speeds still increasing, the I/O bottleneck problem will likely only get worse.

Parallel file systems (also known as cluster file systems) have emerged as a powerful solution to the I/O bottleneck and IBM’s General Parallel File System (GPFS) is among the best. Originally developed for digital-media applications, GPFS now powers many of the world’s most powerful scientific supercomputers and holds the world’s terabyte sort record and several other performance awards. Parallel file systems offer three primary advantages over traditional distributed and SAN file systems:

· High bandwidth—Parallel file systems are very effective when distributed or SAN file systems can’t deliver the aggregate bandwidth required for the environment. Where network file systems typically deliver less than 125 MB/second and SAN file systems top out around 500 MB/second, GPFS has delivered 15 GB/second on a single node. Moreover, GPFS can scale this performance as more nodes are attached, delivering enormous aggregate bandwidth. In addition to its world terabyte sort record, GPFS won awards for both the highest bandwidth and the most I/Os per second (running a real application) at the 2004 Supercomputing Conference.

· Data sharing—Another key advantage of parallel file systems is that all of the attached nodes have equal access to all of the data on the underlying disks, making parallel file systems ideal for cluster environments where many users or applications work with the same data (e.g., many engineers can share a single set of design files). And GPFS recently added unique “multi-cluster” support, enabling data sharing and collaboration across interconnected clusters; this capability is currently being used to share data and results across a consortium of European research centers.

· High reliability without bottlenecks—Unlike distributed file systems, which transfer all of the data through a single server and path, parallel file systems aren’t client/server designs and employ redundant paths, allowing configurations that eliminate all single points of failure; if one path fails, data can flow via another one. Even SAN file systems, which don’t transfer all of the data through a single data server, typically must access a single metadata server to initiate a transfer, again impacting performance. And despite efforts to alleviate these bottlenecks with simple mechanisms to split the load among multiple data or metadata servers, there’s simply no good way to prevent overload or failure with these designs. Conversely, GPFS stores both data and metadata across any or all of the disks in the cluster so there’s no single data server, metadata server or data path to act as a bottleneck or single point of failure.

With these capabilities come several others, including the ability for multiple users or applications to access different parts of a single file simultaneously and high scalability. GPFS currently supports production clusters of more than 2,200 nodes and file systems comprising more than 1,000 disks and hundreds of terabytes of storage.

How Does GPFS Do It?

The key to GPFS’s bandwidth is that GPFS divides individual files into multiple blocks and “stripes” (stores) these blocks across all of the disks in a file system (see Figure 1, below). To read or write a file, GPFS initiates multiple I/Os in parallel, thereby performing the transfer quickly. In addition, each block is much larger than in a traditional file system—typically 256 KB and up to 1024 KB. This enables GPFS to transfer large amounts of data with each operation and reduce the effect of seek latency.

To keep track of all of these blocks and ensure data integrity, GPFS implements a distributed byte-range locking mechanism. The “distributed” part synchronizes file system operations across compute nodes so that, although file system management is distributed across many machines for optimal performance, the entire file system looks like a single file server to every node in the cluster. The “byte-range” part means that rather than locking an entire file when it’s accessed, thereby preventing all other access like traditional file systems, GPFS locks individual parts of a file separately. This enables multiple users, applications or parallel jobs to work on different parts of a single file simultaneously, offering many benefits. For example, multiple engineers or applications can access and update a single design file simultaneously, eliminating the additional storage and overhead associated with storing multiple copies, not to mention the effort required to merge all of the copies into a finished product. In a broadcast news environment, video editors can work on a live video feed as it streams in from the field, accelerating the time to air.

Designed for High Reliability

In addition to eliminating the single points of failure associated with individual servers or paths (as previously discussed), GPFS is also designed to accommodate hardware failures. In fact, several commercial customers use GPFS primarily for its reliability rather than its performance. To protect against failure of a compute node, each node logs all of its updates and stores them on shared disks just like all of the other data and metadata. If a node fails, another node can access the failed node’s log to determine what updates were in progress and restore the affected file(s). These files can be accessed normally once they’re consistent again. To protect against disk failures, GPFS can stripe its data across RAID disks and be configured to store copies of data and metadata on different disks. Finally, GPFS doesn’t require the file system to be taken down to make configuration changes such as adding, removing or replacing disks in an existing file system or adding nodes to an existing cluster.

Highly Available Access

GPFS supports a variety of disk hardware and offers three configuration options ranging from a full-access SAN implementation for ultimate performance to a shared-disk server model that’s less expensive as the cluster gets very large. GPFS is a proven solution for virtually any environment requiring extremely reliable, high-bandwidth shared data access.

------------------------------------------------------------------------------------------------------------------

Andrew Naiberg is the product-marketing manager for pSeries software. He’s also been a software engineer and service delivery specialist since joining IBM in 1997. Andrew can be reached at anaiberg@us.ibm.com.

Figure 1