Bigtable: A Distributed Storage System for Structured Data

来源：互联网发布：docker 连接数据库编辑：程序博客网时间：2024/05/02 01:38

Bigtable: A Distributed Storage System for Structured Data
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
ffay,jeff,sanjay,wilsonh,kerr,m3b,tushar,kes,gruberg@google.com
Google, Inc.
Abstract
Bigtable is a distributed storage system for managing
structured data that is designed to scale to a very large
size: petabytes of data across thousands of commodity
servers. Many projects at Google store data in Bigtable,
including web indexing, Google Earth, and Google Finance.
These applications place very different demands
on Bigtable, both in terms of data size (from URLs to
web pages to satellite imagery) and latency requirements
(from backend bulk processing to real-time data serving).
Despite these varied demands, Bigtable has successfully
provided a exible, high-performance solution for all of
these Google products. In this paper we describe the simple
data model provided by Bigtable, which gives clients
dynamic control over data layout and format, and we describe
the design and implementation of Bigtable.
1 Introduction
Over the last two and a half years we have designed,
implemented, and deployed a distributed storage system
for managing structured data at Google called Bigtable.
Bigtable is designed to reliably scale to petabytes of
data and thousands of machines. Bigtable has achieved
several goals: wide applicability, scalability, high performance,
and high availability. Bigtable is used by
more than sixty Google products and projects, including
Google Analytics, Google Finance, Orkut, Personalized
Search, Writely, and Google Earth. These products
use Bigtable for a variety of demanding workloads,
which range from throughput-oriented batch-processing
jobs to latency-sensitive serving of data to end users.
The Bigtable clusters used by these products span a wide
range of congurations, from a handful to thousands of
servers, and store up to several hundred terabytes of data.
In manyways, Bigtable resembles a database: it shares
many implementation strategies with databases. Parallel
databases [14] and main-memory databases [13] have
achieved scalability and high performance, but Bigtable
provides a different interface than such systems. Bigtable
does not support a full relational data model; instead, it
provides clients with a simple data model that supports
dynamic control over data layout and format, and allows
clients to reason about the locality properties of the
data represented in the underlying storage. Data is indexed
using row and column names that can be arbitrary
strings. Bigtable also treats data as uninterpreted strings,
although clients often serialize various forms of structured
and semi-structured data into these strings. Clients
can control the locality of their data through careful
choices in their schemas. Finally, Bigtable schema parameters
let clients dynamically control whether to serve
data out of memory or from disk.
Section 2 describes the data model in more detail, and
Section 3 provides an overview of the client API. Section
4 briey describes the underlying Google infrastructure
on which Bigtable depends. Section 5 describes the
fundamentals of the Bigtable implementation, and Section
6 describes some of the renements that we made
to improve Bigtable's performance. Section 7 provides
measurements of Bigtable's performance. We describe
several examples of how Bigtable is used at Google
in Section 8, and discuss some lessons we learned in
designing and supporting Bigtable in Section 9. Finally,
Section 10 describes related work, and Section 11
presents our conclusions.
2 Data Model
A Bigtable is a sparse, distributed, persistent multidimensional
sorted map. The map is indexed by a row
key, column key, and a timestamp; each value in the map
is an uninterpreted array of bytes.
(row:string, column:string, time:int64) ! string
To appear in OSDI 2006 1
"CNN" "CNN.com"
"<html>..."
"<html>..."
"<html>..."
t9
t6
t3 t5 t8
"anchor:cnnsi.com"
"com.cnn.www"
"contents:" "anchor:my.look.ca"
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family contains
the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN's home page
is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com
and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.
We settled on this data model after examining a variety
of potential uses of a Bigtable-like system. As one concrete
example that drove some of our design decisions,
suppose we want to keep a copy of a large collection of
web pages and related information that could be used by
many different projects; let us call this particular table
the Webtable. In Webtable, we would use URLs as row
keys, various aspects of web pages as column names, and
store the contents of the web pages in the contents: column
under the timestamps when they were fetched, as
illustrated in Figure 1.
Rows
The row keys in a table are arbitrary strings (currently up
to 64KB in size, although 10-100 bytes is a typical size
for most of our users). Every read or write of data under
a single row key is atomic (regardless of the number of
different columns being read or written in the row), a
design decision that makes it easier for clients to reason
about the system's behavior in the presence of concurrent
updates to the same row.
Bigtable maintains data in lexicographic order by row
key. The row range for a table is dynamically partitioned.
Each row range is called a tablet, which is the unit of distribution
and load balancing. As a result, reads of short
row ranges are efcient and typically require communication
with only a small number of machines. Clients
can exploit this property by selecting their row keys so
that they get good locality for their data accesses. For
example, in Webtable, pages in the same domain are
grouped together into contiguous rows by reversing the
hostname components of the URLs. For example, we
store data for maps.google.com/index.html under the
key com.google.maps/index.html. Storing pages from
the same domain near each other makes some host and
domain analyses more efcient.
Column Families
Column keys are grouped into sets called column families,
which form the basic unit of access control. All data
stored in a column family is usually of the same type (we
compress data in the same column family together). A
column family must be created before data can be stored
under any column key in that family; after a family has
been created, any column key within the family can be
used. It is our intent that the number of distinct column
families in a table be small (in the hundreds at most), and
that families rarely change during operation. In contrast,
a table may have an unbounded number of columns.
A column key is named using the following syntax:
family:qualier. Column family names must be printable,
but qualiers may be arbitrary strings. An example
column family for the Webtable is language, which
stores the language in which a web page was written. We
use only one column key in the language family, and it
stores each web page's language ID. Another useful column
family for this table is anchor; each column key in
this family represents a single anchor, as shown in Figure
1. The qualier is the name of the referring site; the
cell contents is the link text.
Access control and both disk and memory accounting
are performed at the column-family level. In our
Webtable example, these controls allow us to manage
several different types of applications: some that add new
base data, some that read the base data and create derived
column families, and some that are only allowed to view
existing data (and possibly not even to view all of the
existing families for privacy reasons).
Timestamps
Each cell in a Bigtable can contain multiple versions of
the same data; these versions are indexed by timestamp.
Bigtable timestamps are 64-bit integers. They can be assigned
by Bigtable, in which case they represent real
time in microseconds, or be explicitly assigned by client
To appear in OSDI 2006 2
// Open the table
Table *T = OpenOrDie("/bigtable/web/webtable");
// Write a new anchor and delete an old anchor
RowMutation r1(T, "com.cnn.www");
r1.Set("anchor:www.c-span.org", "CNN");
r1.Delete("anchor:www.abc.com");
Operation op;
Apply(&op, &r1);
Figure 2: Writing to Bigtable.
applications. Applications that need to avoid collisions
must generate unique timestamps themselves. Different
versions of a cell are stored in decreasing timestamp order,
so that the most recent versions can be read rst.
To make the management of versioned data less onerous,
we support two per-column-family settings that tell
Bigtable to garbage-collect cell versions automatically.
The client can specify either that only the last n versions
of a cell be kept, or that only new-enough versions be
kept (e.g., only keep values that were written in the last
seven days).
In our Webtable example, we set the timestamps of
the crawled pages stored in the contents: column to
the times at which these page versions were actually
crawled. The garbage-collection mechanism described
above lets us keep only the most recent three versions of
every page.
3 API
The Bigtable API provides functions for creating and
deleting tables and column families. It also provides
functions for changing cluster, table, and column family
metadata, such as access control rights.
Client applications can write or delete values in
Bigtable, look up values from individual rows, or iterate
over a subset of the data in a table. Figure 2 shows
C++ code that uses a RowMutation abstraction to perform
a series of updates. (Irrelevant details were elided
to keep the example short.) The call to Apply performs
an atomic mutation to the Webtable: it adds one anchor
to www.cnn.com and deletes a different anchor.
Figure 3 shows C++ code that uses a Scanner abstraction
to iterate over all anchors in a particular row.
Clients can iterate over multiple column families, and
there are several mechanisms for limiting the rows,
columns, and timestamps produced by a scan. For example,
we could restrict the scan above to only produce
anchors whose columns match the regular expression
anchor:*.cnn.com, or to only produce anchors whose
timestamps fall within ten days of the current time.
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily("anchor");
stream->SetReturnAllVersions();
scanner.Lookup("com.cnn.www");
for (; !stream->Done(); stream->Next()) {
printf("%s %s %lld %s/n",
scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
}
Figure 3: Reading from Bigtable.
Bigtable supports several other features that allow the
user to manipulate data in more complex ways. First,
Bigtable supports single-row transactions, which can be
used to perform atomic read-modify-write sequences on
data stored under a single row key. Bigtable does not currently
support general transactions across row keys, although
it provides an interface for batching writes across
row keys at the clients. Second, Bigtable allows cells
to be used as integer counters. Finally, Bigtable supports
the execution of client-supplied scripts in the address
spaces of the servers. The scripts are written in a
language developed at Google for processing data called
Sawzall [28]. At the moment, our Sawzall-based API
does not allow client scripts to write back into Bigtable,
but it does allow various forms of data transformation,
ltering based on arbitrary expressions, and summarization
via a variety of operators.
Bigtable can be used with MapReduce [12], a framework
for running large-scale parallel computations developed
at Google. We have written a set of wrappers
that allow a Bigtable to be used both as an input source
and as an output target for MapReduce jobs.
4 Building Blocks
Bigtable is built on several other pieces of Google infrastructure.
Bigtable uses the distributed Google File
System (GFS) [17] to store log and data les. A Bigtable
cluster typically operates in a shared pool of machines
that run a wide variety of other distributed applications,
and Bigtable processes often share the same machines
with processes from other applications. Bigtable depends
on a cluster management system for scheduling
jobs, managing resources on shared machines, dealing
with machine failures, and monitoring machine status.
The Google SSTable le format is used internally to
store Bigtable data. An SSTable provides a persistent,
ordered immutable map from keys to values, where both
keys and values are arbitrary byte strings. Operations are
provided to look up the value associated with a specied
To appear in OSDI 2006 3
key, and to iterate over all key/value pairs in a specied
key range. Internally, each SSTable contains a sequence
of blocks (typically each block is 64KB in size, but this
is congurable). A block index (stored at the end of the
SSTable) is used to locate blocks; the index is loaded
into memory when the SSTable is opened. A lookup
can be performed with a single disk seek: we rst nd
the appropriate block by performing a binary search in
the in-memory index, and then reading the appropriate
block from disk. Optionally, an SSTable can be completely
mapped into memory, which allows us to perform
lookups and scans without touching disk.
Bigtable relies on a highly-available and persistent
distributed lock service called Chubby [8]. A Chubby
service consists of ve active replicas, one of which is
elected to be the master and actively serve requests. The
service is live when a majority of the replicas are running
and can communicate with each other. Chubby uses the
Paxos algorithm [9, 23] to keep its replicas consistent in
the face of failure. Chubby provides a namespace that
consists of directories and small les. Each directory or
le can be used as a lock, and reads and writes to a le
are atomic. The Chubby client library provides consistent
caching of Chubby les. Each Chubby client maintains
a session with a Chubby service. A client's session
expires if it is unable to renew its session lease within the
lease expiration time. When a client's session expires, it
loses any locks and open handles. Chubby clients can
also register callbacks on Chubby les and directories
for notication of changes or session expiration.
Bigtable uses Chubby for a variety of tasks: to ensure
that there is at most one active master at any time; to
store the bootstrap location of Bigtable data (see Section
5.1); to discover tablet servers and nalize tablet
server deaths (see Section 5.2); to store Bigtable schema
information (the column family information for each table);
and to store access control lists. If Chubby becomes
unavailable for an extended period of time, Bigtable becomes
unavailable. We recently measured this effect
in 14 Bigtable clusters spanning 11 Chubby instances.
The average percentage of Bigtable server hours during
which some data stored in Bigtable was not available due
to Chubby unavailability (caused by either Chubby outages
or network issues) was 0.0047%. The percentage
for the single cluster that was most affected by Chubby
unavailability was 0.0326%.
5 Implementation
The Bigtable implementation has three major components:
a library that is linked into every client, one master
server, and many tablet servers. Tablet servers can be
dynamically added (or removed) from a cluster to accomodate
changes in workloads.
The master is responsible for assigning tablets to tablet
servers, detecting the addition and expiration of tablet
servers, balancing tablet-server load, and garbage collection
of les in GFS. In addition, it handles schema
changes such as table and column family creations.
Each tablet server manages a set of tablets (typically
we have somewhere between ten to a thousand tablets per
tablet server). The tablet server handles read and write
requests to the tablets that it has loaded, and also splits
tablets that have grown too large.
As with many single-master distributed storage systems
[17, 21], client data does not move through the master:
clients communicate directly with tablet servers for
reads and writes. Because Bigtable clients do not rely on
the master for tablet location information, most clients
never communicate with the master. As a result, the master
is lightly loaded in practice.
A Bigtable cluster stores a number of tables. Each table
consists of a set of tablets, and each tablet contains
all data associated with a row range. Initially, each table
consists of just one tablet. As a table grows, it is automatically
split into multiple tablets, each approximately
100-200 MB in size by default.
5.1 Tablet Location
We use a three-level hierarchy analogous to that of a B+-
tree [10] to store tablet location information (Figure 4).
.. .
...
...
.. .
...
.. .
tablets
METADATA
Other
Chubby file
...
UserTable1
UserTableN
...
...
...
...
...
Root tablet
(1st METADATA tablet)
Figure 4: Tablet location hierarchy.
The rst level is a le stored in Chubby that contains
the location of the root tablet. The root tablet contains
the location of all tablets in a special METADATA table.
Each METADATA tablet contains the location of a set of
user tablets. The root tablet is just the rst tablet in the
METADATA table, but is treated speciallyit is never
splitto ensure that the tablet location hierarchy has no
more than three levels.
The METADATA table stores the location of a tablet
under a row key that is an encoding of the tablet's table
To appear in OSDI 2006 4
identier and its end row. Each METADATA row stores
approximately 1KB of data in memory. With a modest
limit of 128 MB METADATA tablets, our three-level location
scheme is sufcient to address 234 tablets (or 261
bytes in 128 MB tablets).
The client library caches tablet locations. If the client
does not know the location of a tablet, or if it discovers
that cached location information is incorrect, then
it recursively moves up the tablet location hierarchy.
If the client's cache is empty, the location algorithm
requires three network round-trips, including one read
from Chubby. If the client's cache is stale, the location
algorithm could take up to six round-trips, because stale
cache entries are only discovered upon misses (assuming
that METADATA tablets do not move very frequently).
Although tablet locations are stored in memory, so no
GFS accesses are required, we further reduce this cost
in the common case by having the client library prefetch
tablet locations: it reads the metadata for more than one
tablet whenever it reads the METADATA table.
We also store secondary information in the
METADATA table, including a log of all events pertaining
to each tablet (such as when a server begins
serving it). This information is helpful for debugging
and performance analysis.
5.2 Tablet Assignment
Each tablet is assigned to one tablet server at a time. The
master keeps track of the set of live tablet servers, and
the current assignment of tablets to tablet servers, including
which tablets are unassigned. When a tablet is
unassigned, and a tablet server with sufcient room for
the tablet is available, the master assigns the tablet by
sending a tablet load request to the tablet server.
Bigtable uses Chubby to keep track of tablet servers.
When a tablet server starts, it creates, and acquires an
exclusive lock on, a uniquely-named le in a specic
Chubby directory. The master monitors this directory
(the servers directory) to discover tablet servers. A tablet
server stops serving its tablets if it loses its exclusive
lock: e.g., due to a network partition that caused the
server to lose its Chubby session. (Chubby provides an
efcient mechanism that allows a tablet server to check
whether it still holds its lock without incurring network
trafc.) A tablet server will attempt to reacquire an exclusive
lock on its le as long as the le still exists. If the
le no longer exists, then the tablet server will never be
able to serve again, so it kills itself. Whenever a tablet
server terminates (e.g., because the cluster management
system is removing the tablet server's machine from the
cluster), it attempts to release its lock so that the master
will reassign its tablets more quickly.
The master is responsible for detecting when a tablet
server is no longer serving its tablets, and for reassigning
those tablets as soon as possible. To detect when a
tablet server is no longer serving its tablets, the master
periodically asks each tablet server for the status of its
lock. If a tablet server reports that it has lost its lock,
or if the master was unable to reach a server during its
last several attempts, the master attempts to acquire an
exclusive lock on the server's le. If the master is able to
acquire the lock, then Chubby is live and the tablet server
is either dead or having trouble reaching Chubby, so the
master ensures that the tablet server can never serve again
by deleting its server le. Once a server's le has been
deleted, the master can move all the tablets that were previously
assigned to that server into the set of unassigned
tablets. To ensure that a Bigtable cluster is not vulnerable
to networking issues between the master and Chubby,
the master kills itself if its Chubby session expires. However,
as described above, master failures do not change
the assignment of tablets to tablet servers.
When a master is started by the cluster management
system, it needs to discover the current tablet assignments
before it can change them. The master executes
the following steps at startup. (1) The master grabs
a unique master lock in Chubby, which prevents concurrent
master instantiations. (2) The master scans the
servers directory in Chubby to nd the live servers.
(3) The master communicates with every live tablet
server to discover what tablets are already assigned to
each server. (4) The master scans the METADATA table
to learn the set of tablets. Whenever this scan encounters
a tablet that is not already assigned, the master adds the
tablet to the set of unassigned tablets, which makes the
tablet eligible for tablet assignment.
One complication is that the scan of the METADATA
table cannot happen until the METADATA tablets have
been assigned. Therefore, before starting this scan (step
4), the master adds the root tablet to the set of unassigned
tablets if an assignment for the root tablet was not discovered
during step 3. This addition ensures that the root
tablet will be assigned. Because the root tablet contains
the names of all METADATA tablets, the master knows
about all of them after it has scanned the root tablet.
The set of existing tablets only changes when a table
is created or deleted, two existing tablets are merged
to form one larger tablet, or an existing tablet is split
into two smaller tablets. The master is able to keep
track of these changes because it initiates all but the last.
Tablet splits are treated specially since they are initiated
by a tablet server. The tablet server commits the
split by recording information for the new tablet in the
METADATA table. When the split has committed, it noti-
es the master. In case the split notication is lost (either
To appear in OSDI 2006 5
because the tablet server or the master died), the master
detects the new tablet when it asks a tablet server to load
the tablet that has now split. The tablet server will notify
the master of the split, because the tablet entry it nds in
the METADATA table will specify only a portion of the
tablet that the master asked it to load.
5.3 Tablet Serving
The persistent state of a tablet is stored in GFS, as illustrated
in Figure 5. Updates are committed to a commit
log that stores redo records. Of these updates, the recently
committed ones are stored in memory in a sorted
buffer called a memtable; the older updates are stored in a
sequence of SSTables. To recover a tablet, a tablet server
tablet log
GFS
Memory
Write Op
SSTable Files
memtable Read Op
Figure 5: Tablet Representation
reads its metadata from the METADATA table. This metadata
contains the list of SSTables that comprise a tablet
and a set of a redo points, which are pointers into any
commit logs that may contain data for the tablet. The
server reads the indices of the SSTables into memory and
reconstructs the memtable by applying all of the updates
that have committed since the redo points.
When a write operation arrives at a tablet server, the
server checks that it is well-formed, and that the sender
is authorized to perform the mutation. Authorization is
performed by reading the list of permitted writers from a
Chubby le (which is almost always a hit in the Chubby
client cache). A valid mutation is written to the commit
log. Group commit is used to improve the throughput of
lots of small mutations [13, 16]. After the write has been
committed, its contents are inserted into the memtable.
When a read operation arrives at a tablet server, it is
similarly checked for well-formedness and proper authorization.
A valid read operation is executed on a merged
view of the sequence of SSTables and the memtable.
Since the SSTables and the memtable are lexicographically
sorted data structures, the merged view can be
formed efciently.
Incoming read and write operations can continue
while tablets are split and merged.
5.4 Compactions
As write operations execute, the size of the memtable increases.
When the memtable size reaches a threshold, the
memtable is frozen, a new memtable is created, and the
frozen memtable is converted to an SSTable and written
to GFS. This minor compaction process has two goals:
it shrinks the memory usage of the tablet server, and it
reduces the amount of data that has to be read from the
commit log during recovery if this server dies. Incoming
read and write operations can continue while compactions
occur.
Every minor compaction creates a new SSTable. If this
behavior continued unchecked, read operations might
need to merge updates from an arbitrary number of
SSTables. Instead, we bound the number of such les
by periodically executing a merging compaction in the
background. A merging compaction reads the contents
of a few SSTables and the memtable, and writes out a
new SSTable. The input SSTables and memtable can be
discarded as soon as the compaction has nished.
A merging compaction that rewrites all SSTables
into exactly one SSTable is called a major compaction.
SSTables produced by non-major compactions can contain
special deletion entries that suppress deleted data in
older SSTables that are still live. A major compaction,
on the other hand, produces an SSTable that contains
no deletion information or deleted data. Bigtable cycles
through all of its tablets and regularly applies major
compactions to them. These major compactions allow
Bigtable to reclaim resources used by deleted data, and
also allow it to ensure that deleted data disappears from
the system in a timely fashion, which is important for
services that store sensitive data.
6 Renements
The implementation described in the previous section
required a number of renements to achieve the high
performance, availability, and reliability required by our
users. This section describes portions of the implementation
in more detail in order to highlight these renements.
Locality groups
Clients can group multiple column families together into
a locality group. A separate SSTable is generated for
each locality group in each tablet. Segregating column
families that are not typically accessed together into separate
locality groups enables more efcient reads. For
example, page metadata in Webtable (such as language
and checksums) can be in one locality group, and the
contents of the page can be in a different group: an ap-
To appear in OSDI 2006 6
plication that wants to read the metadata does not need
to read through all of the page contents.
In addition, some useful tuning parameters can be
specied on a per-locality group basis. For example, a locality
group can be declared to be in-memory. SSTables
for in-memory locality groups are loaded lazily into the
memory of the tablet server. Once loaded, column families
that belong to such locality groups can be read
without accessing the disk. This feature is useful for
small pieces of data that are accessed frequently: we
use it internally for the location column family in the
METADATA table.
Compression
Clients can control whether or not the SSTables for a
locality group are compressed, and if so, which compression
format is used. The user-specied compression
format is applied to each SSTable block (whose size
is controllable via a locality group specic tuning parameter).
Although we lose some space by compressing
each block separately, we benet in that small portions
of an SSTable can be read without decompressing
the entire le. Many clients use a two-pass custom
compression scheme. The rst pass uses Bentley and
McIlroy's scheme [6], which compresses long common
strings across a large window. The second pass uses a
fast compression algorithm that looks for repetitions in
a small 16 KB window of the data. Both compression
passes are very fastthey encode at 100200 MB/s, and
decode at 4001000 MB/s on modern machines.
Even though we emphasized speed instead of space reduction
when choosing our compression algorithms, this
two-pass compression scheme does surprisingly well.
For example, in Webtable, we use this compression
scheme to store Web page contents. In one experiment,
we stored a large number of documents in a compressed
locality group. For the purposes of the experiment, we
limited ourselves to one version of each document instead
of storing all versions available to us. The scheme
achieved a 10-to-1 reduction in space. This is much
better than typical Gzip reductions of 3-to-1 or 4-to-1
on HTML pages because of the way Webtable rows are
laid out: all pages from a single host are stored close
to each other. This allows the Bentley-McIlroy algorithm
to identify large amounts of shared boilerplate in
pages from the same host. Many applications, not just
Webtable, choose their row names so that similar data
ends up clustered, and therefore achieve very good compression
ratios. Compression ratios get even better when
we store multiple versions of the same value in Bigtable.
Caching for read performance
To improve read performance, tablet servers use two levels
of caching. The Scan Cache is a higher-level cache
that caches the key-value pairs returned by the SSTable
interface to the tablet server code. The Block Cache is a
lower-level cache that caches SSTables blocks that were
read from GFS. The Scan Cache is most useful for applications
that tend to read the same data repeatedly. The
Block Cache is useful for applications that tend to read
data that is close to the data they recently read (e.g., sequential
reads, or random reads of different columns in
the same locality group within a hot row).
Bloom lters
As described in Section 5.3, a read operation has to read
from all SSTables that make up the state of a tablet.
If these SSTables are not in memory, we may end up
doing many disk accesses. We reduce the number of
accesses by allowing clients to specify that Bloom lters
[7] should be created for SSTables in a particular
locality group. A Bloom lter allows us to ask
whether an SSTable might contain any data for a speci
ed row/column pair. For certain applications, a small
amount of tablet server memory used for storing Bloom
lters drastically reduces the number of disk seeks required
for read operations. Our use of Bloom lters
also implies that most lookups for non-existent rows or
columns do not need to touch disk.
Commit-log implementation
If we kept the commit log for each tablet in a separate
log le, a very large number of les would be written
concurrently in GFS. Depending on the underlying le
system implementation on each GFS server, these writes
could cause a large number of disk seeks to write to the
different physical log les. In addition, having separate
log les per tablet also reduces the effectiveness of the
group commit optimization, since groups would tend to
be smaller. To x these issues, we append mutations
to a single commit log per tablet server, co-mingling
mutations for different tablets in the same physical log
le [18, 20].
Using one log provides signicant performance bene
ts during normal operation, but it complicates recovery.
When a tablet server dies, the tablets that it served
will be moved to a large number of other tablet servers:
each server typically loads a small number of the original
server's tablets. To recover the state for a tablet,
the new tablet server needs to reapply the mutations for
that tablet from the commit log written by the original
tablet server. However, the mutations for these tablets
To appear in OSDI 2006 7
were co-mingled in the same physical log le. One approach
would be for each new tablet server to read this
full commit log le and apply just the entries needed for
the tablets it needs to recover. However, under such a
scheme, if 100 machines were each assigned a single
tablet from a failed tablet server, then the log le would
be read 100 times (once by each server).
We avoid duplicating log reads by rst sorting
the commit log entries in order of the keys
htable; row name; log sequence numberi. In the
sorted output, all mutations for a particular tablet are
contiguous and can therefore be read efciently with one
disk seek followed by a sequential read. To parallelize
the sorting, we partition the log le into 64 MB segments,
and sort each segment in parallel on different
tablet servers. This sorting process is coordinated by the
master and is initiated when a tablet server indicates that
it needs to recover mutations from some commit log le.
Writing commit logs to GFS sometimes causes performance
hiccups for a variety of reasons (e.g., a GFS server
machine involved in the write crashes, or the network
paths traversed to reach the particular set of three GFS
servers is suffering network congestion, or is heavily
loaded). To protect mutations from GFS latency spikes,
each tablet server actually has two log writing threads,
each writing to its own log le; only one of these two
threads is actively in use at a time. If writes to the active
log le are performing poorly, the log le writing is
switched to the other thread, and mutations that are in
the commit log queue are written by the newly active log
writing thread. Log entries contain sequence numbers
to allow the recovery process to elide duplicated entries
resulting from this log switching process.
Speeding up tablet recovery
If the master moves a tablet from one tablet server to
another, the source tablet server rst does a minor compaction
on that tablet. This compaction reduces recovery
time by reducing the amount of uncompacted state in
the tablet server's commit log. After nishing this compaction,
the tablet server stops serving the tablet. Before
it actually unloads the tablet, the tablet server does another
(usually very fast) minor compaction to eliminate
any remaining uncompacted state in the tablet server's
log that arrived while the rst minor compaction was
being performed. After this second minor compaction
is complete, the tablet can be loaded on another tablet
server without requiring any recovery of log entries.
Exploiting immutability
Besides the SSTable caches, various other parts of the
Bigtable system have been simplied by the fact that all
of the SSTables that we generate are immutable. For example,
we do not need any synchronization of accesses
to the le system when reading from SSTables. As a result,
concurrency control over rows can be implemented
very efciently. The only mutable data structure that is
accessed by both reads and writes is the memtable. To reduce
contention during reads of the memtable, we make
each memtable row copy-on-write and allow reads and
writes to proceed in parallel.
Since SSTables are immutable, the problem of permanently
removing deleted data is transformed to garbage
collecting obsolete SSTables. Each tablet's SSTables are
registered in the METADATA table. The master removes
obsolete SSTables as a mark-and-sweep garbage collection
[25] over the set of SSTables, where the METADATA
table contains the set of roots.
Finally, the immutability of SSTables enables us to
split tablets quickly. Instead of generating a new set of
SSTables for each child tablet, we let the child tablets
share the SSTables of the parent tablet.
7 Performance Evaluation
We set up a Bigtable cluster with N tablet servers to
measure the performance and scalability of Bigtable as
N is varied. The tablet servers were congured to use 1
GB of memory and to write to a GFS cell consisting of
1786 machines with two 400 GB IDE hard drives each.
N client machines generated the Bigtable load used for
these tests. (We used the same number of clients as tablet
servers to ensure that clients were never a bottleneck.)
Each machine had two dual-core Opteron 2 GHz chips,
enough physical memory to hold the working set of all
running processes, and a single gigabit Ethernet link.
The machines were arranged in a two-level tree-shaped
switched network with approximately 100-200 Gbps of
aggregate bandwidth available at the root. All of the machines
were in the same hosting facility and therefore the
round-trip time between any pair of machines was less
than a millisecond.
The tablet servers and master, test clients, and GFS
servers all ran on the same set of machines. Every machine
ran a GFS server. Some of the machines also ran
either a tablet server, or a client process, or processes
from other jobs that were using the pool at the same time
as these experiments.
R is the distinct number of Bigtable row keys involved
in the test. R was chosen so that each benchmark read or
wrote approximately 1 GB of data per tablet server.
The sequential write benchmark used row keys with
names 0 to R