Bigtable: A Distributed Storage System for Structured Data : part1 Abstract and Introduction

来源：互联网发布：周杰伦婚礼进行曲知乎编辑：程序博客网时间：2024/05/22 16:41

Abstract
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.
Many projects at Google store data in Bigtable,including web indexing, Google Earth, and Google Finance.
These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products.
In this paper we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.

摘要
BigTable是一种用于管理结构化数据的分布式存储系统，旨在将数据扩展到数千个商品服务器上的庞大数据量。
Google的许多项目都会在Bigtable中存储数据，包括网络索引，Google地球和Google财经。
在数据大小（从URL到网页到卫星图像）和延迟要求（从后端批量处理到实时数据服务）方面，这些应用对Bigtable有非常不同的要求。
尽管有这些不同的需求，Bigtable已经为所有这些Google产品成功提供了灵活，高性能的解决方案。
在本文中，我们描述了Bigtable提供的简单数据模型，它为客户端动态控制数据布局和格式，并描述了BigTable的设计和实现。

1 Introduction
Over the last two and a half years we have designed,implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable.
Bigtable is designed to reliably scale to petabytes of data and thousands of machines.
Bigtable has achieved several goals: wide applicability, scalability, high performance, and high availability.
Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.
These products use Bigtable for a variety of demanding workloads,which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.
The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares many implementation strategies with databases.
Parallel databases and main-memory databases have achieved scalability and high performance, but Bigtable provides a different interface than such systems.
Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and allows clients to reason about the locality properties of the data represented in the underlying storage.
Data is indexed using row and column names that can be arbitrary strings.
Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of structured and semi-structured data into these strings.
Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema parameters let clients dynamically control whether to serve data out of memory or from disk.

1介绍
在过去的两年半中，我们设计，实施和部署了一个分布式存储系统，用于管理Google的结构化数据，称为Bigtable。
Bigtable旨在可靠地扩展到数千亿的数据和数千台机器。
Bigtable已经实现了几个目标：
(1)广泛的适用性，
(2)可扩展性，
(3)高性能和高可用性。
Bigtable被Google Analytics（分析），Google财经，Orkut，个性化搜索，Writely和Google Earth等六十多个Google产品和项目所使用。
这些产品使用BigTable来应对各种苛刻的工作负载，这些工作负载范围从面向吞吐量的批量处理作业到延迟敏感的数据服务。
这些产品使用的Bigtable集群涵盖范围广泛的配置，从少数到数千台服务器，并存储多达数百兆的数据。
在许多方面，Bigtable类似于数据库：它与数据库共享许多实现策略。
并行数据库和主内存数据库实现了可扩展性和高性能，但Bigtable提供了与这些系统不同的接口。
Bigtable不支持完整的关系数据模型;相反，它为客户端提供了支持动态控制数据布局和格式的简单数据模型，并允许客户端对基础存储中表示的数据的位置属性进行推理。
使用可以是任意字符串的行和列名称对数据进行索引。
尽管客户端经常将各种形式的结构化和半结构化数据序列化为这些字符串，但Bigtable也将数据视为无解码字符串。
客户可以通过对其模式的仔细选择来控制其数据的位置。
最后，BigTable模式参数让客户机动态地控制是否从内存或磁盘中提供数据。

Section 2 describes the data model in more detail, and Section 3 provides an overview of the client API.
Section 4 briefly describes the underlying Google infrastructure on which Bigtable depends.
Section 5 describes the fundamentals of the Bigtable implementation, and Section 6 describes some of the refinements that we made to improve Bigtable’s performance.
Section 7 provides measurements of Bigtable’s performance.
We describe several examples of how Bigtable is used at Google in Section 8, and discuss some lessons we learned in designing and supporting Bigtable in Section 9.
Finally, Section 10 describes related work, and Section 11 presents our conclusions.

第2节更详细地描述了数据模型，
第3节提供了客户端API的概述。
第4节简要介绍了Bigtable所依赖的基础Google基础架构。
第5节描述了Bigtable实现的基本原理，第6节介绍了我们为改进BigTable的性能而做的一些改进。
第7节提供了Bigtable的性能测量。
我们将在第8节中介绍Google在Bigtable中的几个示例，并讨论了我们在第9节中设计和支持Bigtable时学到的一些经验教训。
最后，第10节介绍相关工作，第11节介绍了我们的结论。

阅读全文

0 0