Bigtable: A Distributed Storage System for Structured Data : part9 Lessons

来源：互联网发布：网络舆论引导的重要性编辑：程序博客网时间：2024/05/16 02:18

9 Lessons
In the process of designing, implementing, maintaining,and supporting Bigtable, we gained useful experience and learned several interesting lessons.
One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols.
For example, we have seen problems due to all of the following causes:
memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions,bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.
As we have gained more experience with these problems, we have addressed them by changing various protocols.
For example, we added checksumming to our RPC mechanism.
We also handled some problems by removing assumptions made by one part of the system about another part.
For example, we stopped assuming a given Chubby operation could return only one of a fixed set of errors.
Another lesson we learned is that it is important to delay adding new features until it is clear how the new features will be used.
For example, we initially planned to support general-purpose transactions in our API.

9课
在设计，实施，维护和支持Bigtable的过程中，我们获得了有用的经验，并学到了许多有趣的经验教训。
我们学到的一个教训是，大型分布式系统容易受到许多类型的故障的影响，而不仅仅是许多分布式协议中假定的标准网络分区和故障停止故障。
例如，由于以下所有原因，我们看到了问题：
内存和网络损坏，大时钟偏移，挂机，扩展和非对称网络分区，我们正在使用的其他系统中的错误（例如Chubby），GFS配额溢出以及计划和计划外的硬件维护。
随着我们在这些问题上获得更多的经验，我们通过改变各种协议来解决这些问题。
例如，我们添加了对RPC机制的校验和。
我们还处理了一些问题，消除了系统的一部分对另一部分的假设。
例如，我们停止假定给定的Chubby操作可能只返回一组固定的错误。
我们学到的另一个教训是，重要的是延迟添加新功能，直到明确如何使用新功能。
例如，我们最初计划在我们的API中支持通用交易。

Because we did not have an0 immediate use for them, however, we did not implement them.
Now that we have many real applications running on Bigtable, we have been able to examine their actual needs, and have discovered that most applications require only single-row transactions.
Where people have requested distributed transactions, the most important use is for maintaining secondary indices, and we plan to add a specialized mechanism to satisfy this need.
The new mechanism will be less general than distributed transactions, but will be more efficient (especially for updates that span hundreds of rows or more) and will also interact better with our scheme for optimistic cross-data-center replication.

因为我们没有立即使用它们，所以我们没有实现它们。
现在我们在BigTable上运行了许多真正的应用程序，我们已经能够检查他们的实际需求，并发现大多数应用程序只需要单行交易。
人们要求分散交易的地方，最重要的用途就是维护二级指标，我们计划增加专门的机制来满足这种需求。
新的机制将比分布式事务更为普遍，但会更有效（特别是对于数百行或更多行的更新），并且还将更好地与我们的方案进行更好的交互数据中心复制。

A practical lesson that we learned from supporting Bigtable is the importance of proper system-level monitoring
(i.e., monitoring both Bigtable itself, as well as the client processes using Bigtable).
For example, we extended our RPC system so that for a sample of the RPCs, it keeps a detailed trace of the important actions done on behalf of that RPC.
This feature has allowed us to detect and fix many problems such as lock contention on tablet data structures, slow writes to GFS while committing Bigtable mutations, and stuck accesses to the METADATA table when METADATA tablets are unavailable.
Another example of useful monitoring is that every Bigtable cluster is registered in Chubby.
This allows us to track down all clusters, discover how big they are, see which versions of our software they are running, how much traffic they are receiving, and whether or not there are any problems such as unexpectedly large latencies.
The most important lesson we learned is the value of simple designs.
Given both the size of our system (about 100,000 lines of non-test code), as well as the fact that code evolves over time in unexpected ways, we
have found that code and design clarity are of immense help in code maintenance and debugging.

我们从支持Bigtable中学到的实际教训是适当的系统级监控的重要性
（即监控Bigtable本身以及使用BigTable的客户端进程）。
例如，我们扩展了RPC系统，以便对于RPC的示例，它会对代表该RPC的重要操作进行详细的跟踪。
此功能使我们能够检测并修复许多问题，例如 tablet 数据结构上的锁定争用，提交Bigtable突变时缓慢写入GFS，以及在METADATA tablet 不可用时卡住了METADATA表。
有用的监控的另一个例子是每个Bigtable集群都在Chubby中注册。
这样我们可以跟踪所有的集群，发现它们有多大，看看我们的软件版本正在运行，他们收到的流量以及是否有任何问题，如意外的大延迟。
我们学到的最重要的一课是简单设计的价值。
考虑到我们的系统的大小（大约100,000行的非测试代码）以及代码随着时间的推移意外地发生的事实，我们
已经发现代码和设计清晰度在代码维护和调试方面有很大的帮助。

One example of this is our tablet-server membership protocol.
Our first protocol was simple:
the master periodically issued leases to tablet servers, and tablet servers killed themselves if their lease expired.
Unfortunately, this protocol reduced availability significantly in the presence of network problems, and was also sensitive to master recovery time.
We redesigned the protocol several times until we had a protocol that performed well.
However,the resulting protocol was too complex and depended on the behavior of Chubby features that were seldom exercised by other applications.
We discovered that we were spending an inordinate amount of time debugging obscure corner cases, not only in Bigtable code, but also in Chubby code.
Eventually, we scrapped this protocol and moved to a newer simpler protocol that depends solely on widely-used Chubby features.

其中一个例子是我们的 tablet 服务器会员协议。
我们的第一个协议很简单：
主人定期向 tablet 服务器发放租赁，如果租赁期满， tablet 服务器将自行死亡。
不幸的是，该协议在存在网络问题的情况下显着降低了可用性，并且对主恢复时间也很敏感。
我们重新设计了协议，直到我们有一个协议表现良好。
然而，由此产生的协议太复杂，取决于其他应用程序很少执行的Chubby特性的行为。
我们发现，我们花费的时间太多，调试了晦涩的角色，不仅在BigTable代码中，还在Chubby代码中。
最终，我们废止了这个协议，并转移到一个更简单的协议，这完全取决于广泛使用的Chubby功能。

阅读全文

0 0