The Google File System : part6 MEASUREMENTS

来源：互联网发布：北京软件技术学院编辑：程序博客网时间：2024/06/06 17:52

6. MEASUREMENTS
In this section we present a few micro-benchmarks to illustrate the bottlenecks inherent in the GFS architecture and implementation, and also some numbers from real clusters in use at Google.

6.测量
在本节中，我们提供了一些微观基准，以说明GFS体系结构和实现中固有的瓶颈，还有一些来自Google使用的真实集群的数字。

6.1 Micro-benchmarks
We measured performance on a GFS cluster consisting of one master, two master replicas, 16 chunkservers, and 16 clients.
Note that this configuration was set up for ease of testing.
Typical clusters have hundreds of chunkservers and hundreds of clients.
All the machines are configured with dual 1.4 GHz PIII processors, 2 GB of memory, two 80 GB 5400 rpm disks, and a 100 Mbps full-duplex Ethernet connection to an HP 2524 switch.
All 19 GFS server machines are connected to one switch, and all 16 client machines to the other.
The two switches are connected with a 1 Gbps link.

6.1微基准
我们测量了由一个master，两个master 副本，16个块服务器和16个客户端组成的GFS群集的性能。
请注意，此配置是为了便于测试而设置的。
典型的集群有数百个chunkserver和数百个客户端。
所有这些机器配置了双1.4 GHz PIII处理器，2 GB内存，两个80 GB 5400 rpm磁盘和100 Mbps全双工以太网连接到HP 2524交换机。
所有19台GFS服务器机器连接到一个交换机，所有16台客户端机器连接到另一台交换机。
两个交换机与1 Gbps链路相连。

6.1.1 Reads
N clients read simultaneously from the file system.
Each client reads a randomly selected 4 MB region from a 320 GB file set.
This is repeated 256 times so that each client ends up reading 1 GB of data.
The chunkservers taken together have only 32 GB of memory, so we expect at most a 10% hit rate in the Linux buffer cache.
Our results should be close to cold cache results.

Figure 3(a) shows the aggregate read rate for N clients and its theoretical limit.

The limit peaks at an aggregate of 125 MB/s when the 1 Gbps link between the two switches is saturated, or 12.5 MB/s per client when its 100 Mbps network interface gets saturated, whichever applies.
The observed read rate is 10 MB/s, or 80% of the per-client limit, when just one client is reading.
The aggregate read rate reaches 94 MB/s, about 75% of the 125 MB/s link limit, for 16 readers, or 6 MB/s per client.
The efficiency drops from 80% to 75% because as the number of readers increases, so does the probability that multiple readers simultaneously read from the same chunkserver.

6.1.1 read
N个客户端从文件系统同时读取。
每个客户端从320 GB文件集中读取随机选择的4 MB区域。
这是重复256次，所以每个客户端最终读取1 GB的数据。
一起使用的chunkserver只有32 GB的内存，所以我们期望Linux缓冲区缓存中最多达到10％的命中率。
我们的结果应该接近冷缓存结果。
图3（a）显示了N个客户端的总体读取速率及其理论极限。
当两个交换机之间的1 Gbps链路饱和时，总共为125 MB / s的峰值峰值，或者当其100 Mbps网络接口饱和时，每个客户端为12.5 MB / s（以适用者为准）。
观察的读取速率为10 MB / s，或每个客户端限制的80％，当只有一个客户端读取时。
总读取速率达到94 MB / s，约为125 MB / s链接限制的75％，16位读取器，或每个客户端6 MB / s。
效率从80％下降到75％，因为随着读卡器的数量的增加，多个读卡器同时从同一个chunkserver读取的概率也是如此。

6.1.2 Writes
N clients write simultaneously to N distinct files.
Each client writes 1 GB of data to a new file in a series of 1 MB writes.

The aggregate write rate and its theoretical limit are shown in Figure 3(b).

The limit plateaus at 67 MB/s because we need to write each byte to 3 of the 16 chunkservers, each with a 12.5 MB/s input connection.
The write rate for one client is 6.3 MB/s, about half of the limit.
The main culprit for this is our network stack.
It does not interact very well with the pipelining scheme we use for pushing data to chunk replicas.
Delays in propagating data from one replica to another reduce the overall write rate.
Aggregate write rate reaches 35 MB/s for 16 clients (or 2.2 MB/s per client), about half the theoretical limit.
As in the case of reads, it becomes more likely that multiple clients write concurrently to the same chunkserver as the number of clients increases.
Moreover, collision is more likely for 16 writers than for 16 readers because each write involves three different replicas.
Writes are slower than we would like.
In practice this has not been a major problem because even though it increases the latencies as seen by individual clients, it does not significantly affect the aggregate write bandwidth delivered by the system to a large number of clients.

6.1.2 写
N个客户端同时写入N个不同的文件。
每个客户端将1 GB的数据写入一系列1 MB写入的新文件。

总写入速率及其理论极限如图3（b）所示。

限制在67 MB / s的高原，因为我们需要将每个字节写入16个chunkserver中的3个，每个具有12.5 MB / s的输入连接。
一个客户端的写入速率为6.3 MB / s，约为限制的一半。
这是我们的网络堆栈的主要原因。
它与我们用于将数据推送到块副本的流水线方案没有很好的交互。
将数据从一个副本传播到另一个副本的延迟会降低整体写入速率。
16个客户端（或每个客户端的2.2 MB / s）的总写入速率达到35 MB / s，约为理论极限的一半。
如在读取的情况下，多个客户端更可能并发地写入与客户端数量增加相同的块服务器。
此外，与16位读者相比，16位作家的碰撞更有可能发生，因为每次写作涉及到三个不同的副本。
写得比我们想要的慢。
实际上，这并不是一个主要的问题，因为尽管它增加了单个客户端所看到的延迟，但并不会显着影响系统向大量客户端传递的总体写入带宽。

6.1.3 Record Appends

Figure 3(c) shows record append performance.

N clients append simultaneously to a single file.
Performance is limited by the network bandwidth of the chunkservers that store the last chunk of the file, independent of the number of clients.
It starts at 6.0 MB/s for one client and drops to 4.8 MB/s for 16 clients, mostly due to congestion and variances in network transfer rates seen by different clients.
Our applications tend to produce multiple such files concurrently.
In other words, N clients append to M shared files simultaneously where both N and M are in the dozens or hundreds.
Therefore, the chunkserver network congestion in our experiment is not a significant issue in practice because a client can make progress on writing one file while the chunkservers for another file are busy.

6.1.3记录追加
图3（c）显示了记录追加性能。
N个客户端同时附加到单个文件。
性能受限于存储文件最后一个块的块服务器的网络带宽，与客户端数无关。
一个客户端以6.0 MB / s开始，16个客户端下降到4.8 MB / s，主要是由于不同客户端看到的网络传输速率的拥塞和差异。
我们的应用往往会同时产生多个这样的文件。
换句话说，N个客户端同时附加到M个共享文件中，N和M都在数十或数百个之间。
因此，我们实验中的chunkserver网络拥塞在实践中不是一个重要的问题，因为客户端可以在另一个文件的chunkserver忙的时候写一个文件时取得进展。

6.2 Real World Clusters
We now examine two clusters in use within Google that are representative of several others like them.
Cluster A is used regularly for research and development by over a hundred engineers.
A typical task is initiated by a human user and runs up to several hours.
It reads through a few MBs to a few TBs of data, transforms or analyzes the data, and writes the results back to the cluster.
Cluster B is primarily used for production data processing.
The tasks last much longer and continuously generate and process multi-TB data sets with only occasional human intervention.
In both cases, a single “task” consists of many processes on many machines reading and writing many files simultaneously.

6.2现实世界群集
我们现在检查Google内使用的两个代表其他几个的集群。
A组由一百多名工程师定期用于研究和开发。
一个典型的任务是由用户启动，运行长达数小时。
它读取几MB到几TB的数据，转换或分析数据，并将结果写回集群。
集群B主要用于生产数据处理。
任务持续时间更长，不断生成和处理多TB数据集，只能偶尔进行人为干预。
在这两种情况下，单个“任务”由许多机器上的许多进程组成，同时读取和写入许多文件。

6.2.1 Storage
As shown by the first five entries in the table, both clusters have hundreds of chunkservers, support many TBs of disk space, and are fairly but not completely full.
“Used space” includes all chunk replicas.
Virtually all files are replicated three times.
Therefore, the clusters store 18 TB and 52 TB of file data respectively.
The two clusters have similar numbers of files, though B has a larger proportion of dead files, namely files which were deleted or replaced by a new version but whose storage have not yet been reclaimed.
It also has more chunks because its files tend to be larger.

6.2.1存储
如表中前五个条目所示，两个集群都有数百个块服务器，支持许多TB磁盘空间，并且相当但不完全。
“二手空间”包括所有块副本。
几乎所有的文件都被复制了三次。
因此，集群分别存储18 TB和52 TB的文件数据。
这两个集群具有相似数量的文件，尽管B具有较大比例的死文件，即被删除或被新版本替换但其存储尚未被回收的文件。
它也有更多的块，因为它的文件往往更大。

6.2.2 Metadata
The chunkservers in aggregate store tens of GBs of metadata, mostly the checksums for 64 KB blocks of user data.
The only other metadata kept at the chunkservers is the chunk version number discussed in Section 4.5.
The metadata kept at the master is much smaller, only tens of MBs, or about 100 bytes per file on average.
This agrees with our assumption that the size of the master’s memory does not limit the system’s capacity in practice.
Most of the per-file metadata is the file names stored in a prefix-compressed form.
Other metadata includes file ownership and permissions, mapping from files to chunks, and each chunk’s current version.
In addition, for each chunk we store the current replica locations and a reference count for implementing copy-on-write.
Each individual server, both chunkservers and the master, has only 50 to 100 MB of metadata.
Therefore recovery is fast:
it takes only a few seconds to read this metadata from disk before the server is able to answer queries.
However, the master is somewhat hobbled for a period – typically 30 to 60 seconds – until it has fetched chunk location information from all chunkservers.

6.2.2元数据
聚合块中的块服务器存储数十GB的元数据，大多数是64 KB的用户数据块的校验和。
保留在chunkserver中的唯一其他元数据是第4.5节中讨论的块版本号。
保存在主服务器上的元数据小得多，只有几十MB，或平均每个文件大约100个字节。
这符合我们的假设，即master记忆的大小并不限制系统的实际能力。
大多数每个文件元数据是以前缀压缩形式存储的文件名。
其他元数据包括文件所有权和权限，从文件到块的映射以及每个块的当前版本。
此外，对于每个块，我们存储当前副本位置和用于实现写时复制的引用计数。
每个单独的服务器（chunkserver和master）只有50到100 MB的元数据。
因此恢复快：
在服务器能够回答查询之前，只需几秒钟就可以从磁盘读取此元数据。
但是，master在一段时间内（通常为30到60秒）有些时候会被抓住，直到它从所有chunkserver中获取了块位置信息。

6.2.3 Read and Write Rates
Table 3 shows read and write rates for various time periods.
Both clusters had been up for about one week when these measurements were taken.
(The clusters had been restarted recently to upgrade to a new version of GFS.)
The average write rate was less than 30MB/s since the restart.
When we took these measurements, B was in the middle of a burst of write activity generating about 100 MB/s of data, which produced a 300 MB/s network load because writes are propagated to three replicas.

Figure 3: Aggregate Throughputs.
Top curves show theoretical limits imposed by our network topology.
Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.

Table 3: Performance Metrics for Two GFS Clusters
The read rates were much higher than the write rates.
The total workload consists of more reads than writes as we have assumed.
Both clusters were in the middle of heavy read activity.
In particular, A had been sustaining a read rate of 580MB/s for the preceding week.
Its network configuration can support 750MB/s, so it was using its re-
sources efficiently. Cluster B can support peak read rates of
1300 MB/s, but its applications were using just 380MB/s.

6.2.3读写速率

表3显示了各种时间段的读写速率。
当进行这些测量时，两组都已经上升了大约一周。
（最近已重新启动了群集，以升级到新版本的GFS。）
自重启以来，平均写入速率小于30MB / s。
当我们进行这些测量时，B处于一系列写入活动的中间，产生约100 MB / s的数据，由于写入传播到三个副本，因此产生了300 MB / s的网络负载。

图3：总体吞吐量。
顶部曲线显示了我们的网络拓扑结构的理论限制。
底部曲线显示测量的通量。它们具有显示95％置信区间的误差条，这在某些情况下由于测量方差小而难以识别。

表3：两个GFS群集的性能指标
读取速率远高于写入速率。
总工作量包括比我们假设的写入更多的读数。
两个集群都处于重读操作的中间。
特别是，A在上周一直维持580MB / s的读取速度。
其网络配置可支持750MB / s，
来源有效。群集B可以支持峰值读取速率
1300 MB / s，但其应用程序仅使用380MB / s。

6.2.4 Master Load
Table 3 also shows that the rate of operations sent to the master was around 200 to 500 operations per second.
The master can easily keep up with this rate, and therefore is not a bottleneck for these workloads.
In an earlier version of GFS, the master was occasionally a bottleneck for some workloads.
It spent most of its time sequentially scanning through large directories
(which contained hundreds of thousands of files) looking for particular files.
We have since changed the master data structures to allow efficient binary searches through the namespace.
It can now easily support many thousands of file accesses persecond.
If necessary, we could speed it up further by placing name lookup caches in front of the namespace data structures.

6.2.4 master 负载
表3还显示，发送给主服务器的操作速率为每秒大约200到500个操作。
master 可以很容易地跟上这个速度，因此不是这些工作负载的瓶颈。
在早期版本的GFS中，master 偶尔会成为某些工作负载的瓶颈。
它花了大部分时间顺序扫描大型目录
（其中包含数十万个文件）寻找特定的文件。
因此，我们改变了主数据结构，以便通过命名空间进行有效的二进制搜索。
它现在可以轻松地支持数千个文件访问。
如有必要，我们可以通过在名称空间数据结构之前放置名称查找缓存来进一步加速。

6.2.5 Recovery Time
After a chunkserver fails, some chunks will become under-replicated and must be cloned to restore their replication levels.
The time it takes to restore all such chunks depends on the amount of resources.
In one experiment, we killed a single chunkserver in cluster B.
The chunkserver had about 15,000 chunks containing 600 GB of data.
To limit the impact on running applications and provide leeway for scheduling decisions, our default parameters limit this cluster to 91 concurrent clonings (40% of the number of chunkservers) where each clone operation is allowed to consume at most 6.25 MB/s (50 Mbps).
All chunks were restored in 23.2 minutes, at an effective replication rate of 440 MB/s.
In another experiment, we killed two chunkservers each with roughly 16,000 chunks and 660 GB of data.
This double failure reduced 266 chunks to having a single replica.
These 266 chunks were cloned at a higher priority, and were all restored to at least 2x replication within 2 minutes, thus putting the cluster in a state where it could tolerate another chunkserver failure without data loss.

6.2.5恢复时间
在chunkserver失败后，某些块将被复制，并且必须克隆才能恢复其复制级别。
恢复所有这些块所需的时间取决于资源的数量。
在一个实验中，我们在群集B中杀死了一个chunkserver。
chunkserver有大约15,000个数据块，包含600 GB的数据。
为了限制对正在运行的应用程序的影响并为调度决策提供了余地，我们的默认参数将此集群限制为允许每个克隆操作最多使用6.25 MB / s（50 Mbps）的91个并发克隆（占总服务器数的40％））。
所有块在23.2分钟内恢复，有效复制率为440 MB / s。
在另一个实验中，我们杀死了两个chunkserver，每个chunkserver大约有16,000个块和660 GB的数据。
这种双重故障减少了266个块以获得单个副本。
这些266块被更高优先级地克隆，并且在2分钟内都恢复到至少2倍的复制，从而使集群处于可以容忍另一个chunkserver故障而没有数据丢失的状态。

6.3 Workload Breakdown
In this section, we present a detailed breakdown of the workloads on two GFS clusters comparable but not identical to those in Section 6.2. Cluster X is for research and development while cluster Y is for production data processing.

6.3工作负载分解
在本节中，我们详细列出了两个与第6.2节相似但不完全相同的两个GFS集群的工作量。集群X用于研究和开发，而集群Y用于生产数据处理。

6.3.1 Methodology and Caveats
These results include only client originated requests so that they reflect the workload generated by our applications for the file system as a whole.
They do not include inter-server requests to carry out client requests or internal background activities, such as forwarded writes or rebalancing.
Statistics on I/O operations are based on information heuristically reconstructed from actual RPC requests logged by GFS servers.
For example, GFS client code may break a read into multiple RPCs to increase parallelism, from which we infer the original read.
Since our access patterns are highly stylized, we expect any error to be in the noise.
Explicit logging by applications might have provided slightly more accurate data, but it is logistically impossible to recompile and restart thousands of running clients to do so and cumbersome to collect the results from as many machines.

One should be careful not to overly generalize from our workload.
Since Google completely controls both GFS and its applications, the applications tend to be tuned for GFS, and conversely GFS is designed for these applications.
Such mutual influence may also exist between general applications

Table 4: Operations Breakdown by Size (%).

For reads, the size is the amount of data actually read and transferred, rather than the amount requested.
and file systems, but the effect is likely more pronounced in our case.

6.3.1方法和注意事项
这些结果仅包括客户端发起的请求，以便它们反映我们的应用程序为整个文件系统生成的工作量。
它们不包括执行客户端请求或内部后台活动（如转发写入或重新平衡）的服务器间请求。
I / O操作统计基于由GFS服务器记录的实际RPC请求启发式重构的信息。
例如，GFS客户端代码可能会破坏读取到多个RPC以增加并行性，从中我们推断原始读取。
由于我们的访问模式是高度风格化的，我们预计任何错误都会在噪音中。
应用程序的显式日志记录可能提供了更准确的数据，但是在逻辑上不可能重新编译并重新启动数千个正在运行的客户端，从而从很多机器收集结果很麻烦。

我们应该小心，不要过分地推翻我们的工作量。
由于Google完全控制了GFS及其应用程序，所以应用程序往往会针对GFS进行调整，相反，GFS是为这些应用程序设计的。
这种相互影响也可能存在于一般应用之间

表4：按尺寸划分的操作细目（％）。
对于读取，大小是实际读取和传输的数据量，而不是所请求的数量。
和文件系统，但在我们的案例中效果可能更明显。

6.3.2 Chunkserver Workload

Table 4 shows the distribution of operations by size.

Read sizes exhibit a bimodal distribution.
The small reads (under 64 KB) come from seek-intensive clients that look up small pieces of data within huge files.
The large reads (over 512 KB) come from long sequential reads through entire files.

A significant number of reads return no data at all in cluster Y. Our applications, especially those in the production systems, often use files as producer-consumer queues.
Producers append concurrently to a file while a consumer reads the end of file.
Occasionally, no data is returned when the consumer outpaces the producers.
Cluster X shows this less often because it is usually used for short-lived data analysis tasks rather than long-lived distributed applications.
Write sizes also exhibit a bimodal distribution.

The large writes (over 256 KB) typically result from significant buffering within the writers.
Writers that buffer less data, checkpoint or synchronize more often, or simply generate less data account for the smaller writes (under 64 KB).
As for record appends, cluster Y sees a much higher percentage of large record appends than cluster X does because our production systems, which use cluster Y, are more aggressively tuned for GFS.

Table 5 shows the total amount of data transferred in operations of various sizes.

For all kinds of operations, the larger operations (over 256 KB) generally account for most of the bytes transferred.
Small reads (under 64 KB) do transfer a small but significant portion of the read data because of the random seek workload.

6.3.2 Chunkserver工作负载
表4按大小显示操作的分布情况。
读取尺寸显示双峰分布。
小写（64 KB以下）来自寻找密集型客户端，在庞大的文件中查找小数据。
大写（超过512 KB）来自整个文件的长时间读取。

大量的读取在集群Y中根本不会返回任何数据。我们的应用程序，特别是生产系统中的应用程序通常使用文件作为生产者 - 消费者队列。
生产者在消费者读取文件结束时并发附加到文件。
偶尔，当消费者超过生产者时，不会返回任何数据。
集群X显示的较少，因为它通常用于短期数据分析任务而不是长寿命的分布式应用程序。
写入大小也表现出双峰分布。

大写（超过256 KB）通常来自写入程序中的显着缓冲。
缓冲较少数据，检查点或更频繁同步的作者，或者简单地生成较小写入的数据帐户（64 KB以下）。
对于记录追加，集群Y看到比X集成的大的记录追加的百分比要高得多，因为使用集群Y的生产系统对GFS进行了更为积极的调整。

表5显示了在各种尺寸的操作中传送的数据的总量。
对于各种操作，较大的操作（超过256 KB）通常占传输的大部分字节。
由于随机查找工作量，小读取（64 KB以下）会传输读取数据的一小部分。

6.3.3 Appends versus Writes
Record appends are heavily used especially in our production systems.
For cluster X, the ratio of writes to record appends is 108:1 by bytes transferred and 8:1 by operation counts.
For cluster Y, used by the production systems, the ratios are 3.7:1 and 2.5:1 respectively.
Moreover, these ratios suggest that for both clusters record appends tend to be larger than writes.
For cluster X, however, the overall usage of record append during the measured period is fairly low and so the results are likely skewed by one or two applications with particular buffer size choices.
As expected, our data mutation workload is dominated by appending rather than overwriting.
We measured the amount of data overwritten on primary replicas.
This approximates the case where a client deliberately overwrites previous written data rather than appends new data.
For cluster X, overwriting accounts for under 0.0001% of bytes mutated and under 0.0003% of mutation operations.
For cluster Y, the ratios are both 0.05%. Although this is minute,it is still higher than we expected.
It turns out that most of these overwrites came from client retries due to errors or timeouts.
They are not part of the workload per se but a consequence of the retry mechanism.

6.3.3追加与写入
记录附件特别用于我们的生产系统。
对于集群X，写入到记录追加的比例为108：1，传输字节数为8：1，按运算计数计算为8：1。
对于生产系统使用的集群Y，分别为3.7：1和2.5：1。
此外，这些比率表明，对于两个集群记录追加往往大于写入。
然而，对于集群X，在测量期间记录附加的总体使用量相当低，因此结果可能会被特定缓冲区大小选择的一个或两个应用程序偏移。
正如预期的那样，我们的数据突变工作负载主要是通过附加而不是覆盖。
我们测量了在主要副本上覆盖的数据量。
这近似于客户端故意覆盖以前的书面数据而不是追加新数据的情况。
对于集群X，覆盖占突变的字节的0.0001％以及0.0003％的变异操作。
对于簇Y，比率均为0.05％。虽然这是分钟，但仍高于我们预期。
事实证明，这些覆盖大部分来自客户端重试，由于错误或超时。
它们不是工作负载本身的一部分，而是重试机制的结果。

6.3.4 Master Workload

Table 6 shows the breakdown by type of requests to the master.

Most requests ask for chunk locations (FindLocation) for reads and lease holder information (FindLease-Locker) for data mutations.
Clusters X and Y see significantly different numbers of Delete requests because cluster Y stores production data sets that are regularly regenerated and replaced with newer versions.
Some of this difference is further hidden in the difference in Open requests because an old version of a file may be implicitly deleted by being opened for write from scratch (mode “w” in Unix open terminology).
FindMatchingFiles is a pattern matching request that supports “ls” and similar file system operations.
Unlike other requests for the master, it may process a large part of the namespace and so may be expensive.
Cluster Y sees it much more often because automated data processing tasks tend to examine parts of the file system to understand global application state.
In contrast, cluster X’s applications are under more explicit user control and usually know the names of all needed files in advance.

6.3.4 master 工作量
表6显示了对 master 的请求类型的细分。
大多数请求要求读取和租赁持有人信息（FindLease-Locker）的块位置（FindLocation）用于数据突变。
集群X和Y看到显着不同的删除请求数量，因为集群Y存储定期重新生成的数据集，并替换为较新版本。
一些这种差异进一步隐藏在开放请求的差异中，因为旧的版本的文件可以通过从头开始写入（Unix开放术语中的“w”）来隐式删除。
FindMatchingFiles是支持“ls”和类似文件系统操作的模式匹配请求。
与其他对 master 的请求不同，它可能会处理大部分名称空间，因此可能会昂贵。
集群Y更经常地看到，因为自动数据处理任务倾向于检查文件系统的部分以了解全局应用程序状态。
相比之下，集群X的应用程序在更明确的用户控制下，通常提前知道所有需要的文件的名称。

阅读全文

0 0