Hypertable - 架构

来源：互联网发布：pc短信群发软件编辑：程序博客网时间：2024/06/05 14:43

Architecture

(http://hypertable.com/documentation/architecture/)

架构

Google Architecture

Goolge的架构

Hypertable is a massively scalable databasemodeled after Google's Bigtable database. Bigtable is part of a group of scalable computing technologies developedby Google which is depicted in the following diagram.

Hypertable是取材于Google Bigtable的一个高度可伸缩的数据库。Bigtable是Google开发的一系列可伸缩计算技术中的一个，如下图所示。

Google File System (GFS) - This is thelowest layer of the Google scalable computing stack. It is a filesystem much like any other andallows for the creation of files and directories. The primary innovation of the Google filesystemis that it is massively scalable and highly available. It achieves high availability by replicatingfile data across three physical machines which means that it can lose up to twoof the machines holding replicas and the data is still available. Hadoop provides an open source implementationof the GFS called HDFS.

Google File System(GFS)-这是Google可伸缩计算层次中最底层。它是一个文件系统，同其他文件系统很像，考虑了文件和目录的创建。GFS的创新在于它是高可伸缩和高可靠的。它的高可靠性通过文件数据复制到3台物理机器中来实现，这样，即使有2台复制的数据失效，数据仍然可用。Hadoop提供了GFS的一个开源实现，叫做HDFS。

MapReduce - This is a parallel computationframework designed to efficiently process data in the GFS. It provides a way to run a large amount ofdata through a piece of code (map) in parallel by pushing the code out to themachines where the data resides. It alsoincludes a final aggregation step (reduce) which provides a way to re-order thedata based on any arbitrary field. Hadoop provides an open source implementation of MapReduce.

MapReduce – 这是一个平行计算架构，目的是高效地处理GFS中的数据。它将一段程序（map）放到存放数据的计算机上运行，从而提供了一种平行处理大容量数据的方法。它也包含最终的聚集步骤（reduce），这一步骤将各节点数据再次排序。Hadoop提供了一个MapReduce的开源实现。

Bigtable - This is Google's scalabledatabase. It provides a way to createmassive tables of information indexed by a primary key. As of this writing, over 90% of Google's webservices are built on top of Bigtable, including Search, Google Earth, GoogleAnalytics, Google Maps, Gmail, Orkut, YouTube, and many more. Hypertable is a high performance, open sourceimplementation of Bigtable.

Bigtable – 这个是Google的可伸缩数据库。它提供了一种存储大量表数据的方法，这些数据由一个主键索引。本文写作的时候，Google超过90%的web应用都建立在Bigtable之上，包括Search, Google Earth, Google Analytics, Google Maps, Gmail, Orkut,YouTube等等。Hypertable是Bigtable的一个高性能的开源实现。

Sawzall - This is a runtime scriptinglanguage that sits on top of the whole stack and provides the ability toperform statistical analysis in an easily expressible way over large datasets. Open source projects such as Hiveand Pig provide similar functionality.

Sawzall – 这是一个运行时脚本语言，它位于计算层次的最上层，它用一种简单的表达方式，对大容量数据集进行统计分析。开源项目Pig和Hive提供了类似的功能。

Hypertable SystemOverview

Hypertable系统概述

The diagram below provides a high-leveloverview of the Hypertable system followed by a brief description of eachsystem component.

下图是Hypertable系统的高层次结构图，下面内容是各组件的简介。

Hyperspace- This is Hypertable's equivalent to Google'sChubby service. Hyperspace is a highly available lock manager and provides afilesystem for storing small amounts of metadata. Exclusive or shared locks may be obtained onany created file or directory. Highavailability is achieved by running in a distributed configuration with replicasrunning on different physical machines. Consistency is achieved through a distributed consensus protocol. Google refers to Chubby as, "the root ofall distributed data structures" which is a good way to think of thissystem.

Hyperspace– 它相当于Google's Chubby服务，是一个高可用的锁管理，并提供了一个文件系统，用于存储小容量的元数据（metadata）。从它得到排他锁或共享锁，来管理文件或目录。它通过一个分布式配置（复制到不同机器上），来获得高可用性，通过分布式协商协议（distributed consensus protocol）来获得一致性。Google形容Chubby为“所有分布式数据结构的基础”，这也是一个对Hyperspace的有效描述。

Master- The master handles all meta operations such ascreating and deleting tables. Client data does not move through the Master, sothe Master can be down for short periods of time without clients being aware.The master is also responsible for detecting range server failures and re-assigningranges if necessary. The master is also responsible for range server loadbalancing. Currently there is a single Master process, but high availability isachieved through hot standbys.

Master– Master处理所有对元数据的操作，例如创建和删除表。用户的数据不会传到Master上，所以Master停机一会，用户也不会觉察。Master也负责检查range server故障，在需要的时候重新分配range。目前，只有一个Master进程，但可以通过热备达到高可靠性。

RangeServer - Range servers are responsible for managingranges of table data, handling all reading and writing of data. They can manage up to potentially thousandsof ranges and are agnostic to the set of ranges that they manage or the tablesof which they're a part. Ranges can movefreely from one range server to another, an operation that is mostlyorchestrated by the Master.

RangeServer – Range server管理表数据的区段(range)，包括读、写数据。区段的数目有可能达到成千上万，服务器并不意识到它们管理的range是表的一部分。Range有可能从一个Range server自由移动到另一个Range server，当然，这一操作是在Master的精心控制下完成的。

DFSBroker - Hypertable is capable of running on top ofany filesystem. To achieve this, the system has abstracted the interface to thefilesystem by sending all filesystem requests through a Distributed File System(DFS) broker process. The DFS broker provides a normalized filesystem interfaceand translates normalized filesystem requests into native filesystem requestsand vice-versa. DFS brokers have beendeveloped for HDFS, MapR, Ceph, KFS, and local (for running on top of a localfilesystem).

DFSBroker – Hypertable能够运行在任何文件系统之上，为实现此目标，系统抽象出文件系统的接口，对文件系统的请求通过分布式文件系统代理（DFS broker）来完成。该代理提供了一个标准化的文件系统接口，并将标准化的文件操作与实际的文件系统操作相互转换。目前已开发出HDFS, MapR, Ceph, KFS和本地文件系统的DFS broker。

Data Representation

数据表现

Like a relational database, Hypertablerepresents data as tables of information. Each row in a table has cellscontaining related information, and each cell is identified, in part, by a rowkey and column name. Support for up to255 column names is provided when the table is created. Hypertable provides twoadditional features:

column qualifier - The column names defined in the table schema arereferred to as column families. Applications may supply an optional column qualifier, with each distinctqualifier representing a qualified column instance belonging to the columnfamily . The application can define an unlimited number of qualified instancesof a column family. The applicationsupplied column name has the format family:qualifier, and column data is storedin a sparse format such that one row may have millions of qualified instancesof a column family, while another row may have none or just a few instances.

timestamp - This is a 64-bit field associated with each cell thatallows for different cell versions. Thevalue represents nanoseconds since the Unix epoch and can be supplied by theapplication or auto-assigned by the server. The number of versions stored can be configured in the table schema andthe number of versions returned can be specified in the query predicate. The versions are stored inreverse-chronological order, so that the newest version of the cell is returnedfirst.

与关系数据库一样，Hypertable也将数据展现为表，表中每一行都有单元存储相应的数据，单元通过行关键字和列名来标识。表最多支持255个列。除此以外，Hypertable还有两个特性：

列标识：表中的列名实际上是指列族的名字。应用程序可以提供一个可选的列标识，在列族中唯一标识出一个列实例，在列族中，应用程序可以定义不限数目的列实例，列实例的格式为：列族:列标识。列数据不保存空值，因而是稀疏格式，一行中的列族中，可能有成千上万个有效的列，也可能没有或只有几个列。

时间戳：这是一个与数据单元关联的64位的域，它可以代表单元数据的版本。这个值是Unix元年后的纳秒数，可以由应用程序指定或服务器自动生成。单元数据的版本数目可以在表的结构中指定，在查询时，也可以指定返回数据的版本数。由于版本值反时间存储，所以越新的版本越先返回。

The following diagram illustrates how datais represented in Hypertable. The tableis an example taken from a web crawler that stores information for each pagethat it crawls in a row of the table.

下图显示了Hypertable中数据是如何表示的，图中的表是一个web爬虫存储网页的信息的例子，每一行保存了一个页的信息。

The above diagram illustrates the use of the column qualifier. A Web search engine builds an index (muchlike the one in the back of a book) that points words to the Web documents thatcontain them. Included in this index arenot only the words included in the Web page, but also words included in theanchor text of the remote links that point to the Web page. This is how imageresults can appear in Web search results. For example, given an image of a Ferrari (which contains no text), ifthere are enough links pointing to the image that contain the word"Ferrari" in the anchor text, then the page may get a high score forthe query "Ferrari" and appear in the search results.

上图中显示了类标识的用法。Web页搜索引擎建立了一个索引表（很像一本书最后的索引），

这个索引表将单词指向包含它们的页文档。表中，不只包含了页面中的单词，也包含指向这个页面的链接中定位点的单词。这是图片也能出现在Web搜索中的原因。例如，有一个叫Ferrari的图片（它没有文字），如果有足够多的链接指向这个图片，其定位点文字中包含“Ferrari”，则这一页中文字“Ferrari”有一个高的得分，就可能出现在搜索结果中。

The one dimension that is missing from theabove diagram is the timestamp. Imaginethat each cell in the diagram above has a z-axis that contains timestampedversions of the cell. Thismulti-dimensional table gets flattened out, under the hood, as sorted lists ofkey/value pairs as illustrated in the following diagram.

上图中，没有显示时间戳维度，可以想像一下，每个数据单元都有一个Z轴，包含时间戳化的版本信息。如果将这个多维表平面化，表达成后台存储的key/value对，这个列表就如下图所示。

Anatomy of a Key

Key的剖析

The following diagram illustrates theformat of the key that Hypertable uses internally.

下图显示了Hypertable内部使用的Key的结构。

control- This field is consists of bit flags that describethe format of the remaining fields. There are certain circumstances where the timestamp or revision numbermay be absent, or where they are identical, in which case, they're collapsedinto a single field. This field containsthat information and tells Hypertable how to properly interpret the key.

control– 这个域是一些位标志，表明后面域的结构。有些时候没有时间戳或修改数域，有些时候它们是相同的并合成一个域。Control域就包含那些告诉Hypertable如何解析Key的信息。

rowkey - This field contains a '\0' terminated stringthat represents the row key.

rowkey –这个域是一个以'\0'为结尾的字符串，表示行关键字。

columnfamily - This field is a single-byte field thatindicates the column family code.

columnfamily – 这个域是一个单字节域，表示列族号。

columnqualifier - This field contains a '\0' terminatedstring that represents the column qualifier.

columnqualifier –这个域是一个以'\0'为结尾的字符串，表示列标识。

flag- Deletes are handled through the insertion ofspecial "delete" records (or tombstones) that indicate that someportion of a row's cells have been deleted. These delete records are applied at query time and the deleted cells aregarbage collected during major compactions.

flag– 通过插入特别的“删除”标识（或墓碑）来表明一行中的某些单元被删掉了。在查询时，将用到这些删除标识，被删除的单元在大的合并数据时被回收掉。

timestamp- This field is an 8-byte (64-bit) field thatcontains the cell timestamp, represented as nanoseconds since the Unixepoch. By default, the timestamp isstored big-endian, ones-compliment so that within a given cell, versions arestored newest to oldest.

timestamp– 这是个8个字节的域，含有单元的时间戳，即UNIX元年后的纳秒数。缺省情况下，时间戳是大端对齐（big-endian）的，所以单元的版本也就是从新到旧。

revision- This field is an 8-byte (64-bit) field thatcontains a high resolution timestamp that currently is used internally toprovide snapshot isolation for queries.

revision– 这是个8个字节的域，它是内部使用的一个高精度的时间戳，在查询时，提供了快照的隔离。

Access Groups

存取组

Access Groups provide a way to control thephysical storage of column data to optimize disk I/O. Access Groups are defined in the table schemaand instruct Hypertable to physically store all data for columns within thesame access group together on disk. Thisfeature allows you optimize queries for columns that are accessed with highfrequency by reducing the amount of data transferred from disk during query execution. Disk I/O is limited to just the data from theaccess groups of the columns specified in the query. For example, consider the following schema.

存取组提供了一种为优化磁盘IO而控制列数据保存的方法，存取组在表结构定义时定义，它告诉Hypertable将同一个存取组中的所有列数据保存在一起。这一特性允许你优化查询那些高度关联的列，减少磁盘读取。例如，考虑以下的表结构定义。

CREATE TABLE User (

name,

address,

photo,

profile,

ACCESS GROUP default (name, address, photo),

ACCESS GROUP profile (profile)

);

Hypertable will create two physical groupingsof column data, one for the name, address, and photo columns, and another forthe profile column. The followingdiagram illustrates this physical grouping.

Hypertable创建两个物理上的数据列组，一个是name,address, and photo，另一个是profile，如下图所示

Consider the following query for theprofile column of the User table.

考虑以下profile列的查询。

SELECT profile fromUser;

The execution of this query will beefficient because only the data for the profile column will be transferred fromdisk during query execution.

这个查询很高效，因为在查询时，只有profile列被读取。

RangeServer InsertHandling

在RangeServer上的插入处理

The following diagram illustrates howinserts are handled inside the RangeServer.

下图描述了在RangeServer上插入操作是如何处理的。

Step 1: Commit Log - Inserts are appended to the Commit log which resides in thedistributed filesystem (DFS) and followed by a sync operations that tells thefilesystem to persist any buffered writes to disk. If multiple insert requests are pending, or aGROUP_COMMIT_INTERVAL is configured for the table, then the sync operation is performed after multipleCommit log appends to improve throughput.

Step 2: Add to map - The inserts are added to the in-memory CellCache (equivalent tothe Memtable in the Bigtable paper).

Step 3: Acknowledge - Acknowledgement is sent back to the application.

Background MaintenanceThreads - Over time, as the CellCaches fill memory,background maintenance threads will "spill" the in-memory CellCachedata to on-disk CellStore files which frees up memory inside the RangeServerwhich allows it to accept more inserts.

步骤1: 提交日志 – 插入被添加到提交日志中，该日志采用分布式文件系统，然后产生一个同步操作，使缓存的文件保存到磁盘上。如果多步插入暂停了，或该表被设置成GROUP_COMMIT_INTERVAL，则同步操作将在多个插入被添加到提交日志后才产生，这样提高了性能。

步骤2: 加入到图 – 插入被添加到内存单元缓存区（CellCaches）（等效于Bigtable中的Memtable）。

步骤3：确认 – 向应用程序发送确认信息。

后台维护线程– 一段时间后，当单元缓存区（CellCaches）满了，后台维护线程将把缓存区的数据“溢出“到磁盘的单元数据文件中，这样RangeServer的单元缓存区就空了，可以接受新的插入。

This design makes Hypertable writes durableand consistent because inserts are not acknowledged until the Commit log hasbeen successfully written to.

因为只有在成功写入提交日志后才返回确认信息，因而这种设计使Hypertable的写入保持一致性。

RangeServer QueryHandling

在RangeServer上的查询处理

The following diagram illustrates howqueries are handled inside the RangeServer.

下图描述了在Rangeserver上的如何进行查询处理。

Data for a range can reside in thein-memory CellCache as well as in some number of on-disk CellStores (seefollowing section). To evaluate a queryover a table range, the RangeServer must create a unified view of the data,which it does through the use of a MergeScanner object, which merges togetherthe sorted key/value pairs coming from the CellCache and CellStores. This unified stream of key/value pairs isthen filtered to produce the desired results.

Range中的数据可以存在于内存中CellCache，也可以存在于磁盘上的若干个CellStore中（见下节）。为执行查询，RangeServer需要建立一个统一的视图，它采用一个叫做MergeScanner的对象，该对象将CellCache和若干CellStore中的已排序的key/value对合并起来，这一个统一的结果被过滤，然后产生期望的结果。

CellStore Format

单元数据的存储格式

Over time, the RangeServers will writein-memory CellCaches to on-disk files, called CellStores, whose format isillustrated in the illustration to the right. The following describes the sections of the CellStore file format.

过一段时候，RangeServer会将内存中CellCache的数据写入叫做CellStore的磁盘文件，这个磁盘文件如下所示。下面也将介绍CellStore文件的格式。

Compressed blocks of cells(key/value pairs) - This section consists of aseries of sorted blocks of compressed sorted key/value pairs. By default, the compressed blocks areapproximately 64KB in size. This sizecan be controlled by the Hypertable.RangeServer.CellStore.DefaultBlockSize property.These blocks are the minimum unit of data transfer from disk.

Bloom Filter - After the compressed blocks of key/value pairs comes the bloomfilter. This is a probabalistic datastructure that describes the keys that exist (with high likelihood) in theCellStore. It also signals if a key isdefinitively not present, which helps the RangeServer avoid unnecessary blocktransfer and decompression.

Block Index - After the bloom filter comes the block index. This index lists, for each block, the lastkey in the block followed by the block offset.

Trailer - At the end of the CellStore is the trailer. The trailer contains general statistics aboutthe CellStore and includes the version number of the CellStore format so thatthe RangeServer can interpret it correctly.

压缩的单元数据区 (key/value pairs) – 这个区由若干压缩块组成，压缩块保存了一系列已排序和压缩的key/pair对，缺省情况下，压缩块的大小为64K，该大小可以通过Hypertable.RangeServer.CellStore.DefaultBlockSize设置，块是磁盘数据读取的最小单位。

过滤区– 压缩块之后是过滤区，它描述了key存在于该CellStore的几率，它也表达出某key是否肯定不再该CellStore，这有助于避免RangeServer做无意义的块读取和解压缩。

块索引区– 过滤区之后是块索引区，给出了每个块最后一个索引及其偏移量。

块尾 - CellStore最后的是块尾，给出了CellStore通用的统计信息，包含CellStore格式的版本号，以便RangeServer能正确解析它。

Query Routing

查询传递

The following diagram illustrates the datastructures that support the query routing algorithm which is how queries getsent to the relevant RangeServers.

下图描述了支持查询传递（查询被传递到相关RangeServer）算法的数据结构。

METADATA Table

元数据表

There exists a special table in Hypertablecalled the METADATA table that contains a row for each range in thesystem. There is a column Location, thatindicates which RangeServer is currently serving the range. Though the diagram shows IP addresses in theLocation column, the system stores a proxy name for the RangeServer in thatcolumn so that the system can be run on public clouds such as Amazon's EC2 andoperate correctly in the face of server restarts and IP address changes. A two-level hierarchy is overlaid on top ofthe METADATA table. The first range isthe ROOT range which contains pointers to the second-level ranges which, inturn, contain pointers to the USER ranges, which are the ranges that make upregular user or application defined tables.

Hypertable有一个特殊的表，叫做元数据表(METADATAtable)，它的行包含了系统中range的信息，表中有一个叫做Location的列，表明哪个RangeServer处理这个range。尽管图中Location列显示的IP地址，但系统存储的是RangeServer的代理名，这样系统可以运行在例如Amazon EC2的公有云上，当服务器重启或IP发生变化时，依然能正常工作。METADATA表有两层结构，第一个Range叫做根Range，包含指向第二层Range的指针，第二层Range包含指向用户Range的指针，用户Range就是那些正常用户或应用定义的表构成的Range。

Client Library

客户端程序库

TheClient Library provides the application programming interface (API) that allowsan application to talk to Hypertable. This library is linked into each Hypertable application and handlesquery routing. The client libraryincludes a METADATA cache which contains the range location informationobtained by walking the METADATA hierarchy. Most application range location requests are served directly out of thiscache. The ThriftBroker, which providesa high-level language interface to Hypertable, links against the client libraryand is a long-lived process, so its METADATA cache is usually fresh andpopulated. For this reason, we recommendthat short lived applications (e.g. CGI programs) use the Thrift interface toavoid having to walk the METADATA hierarchy for each request.

客户端程序库提供应用程序API，允许应用程序与Hypertable交互。这个客户端库连接到每个Hypertable应用，并且处理查询传递。这个客户端库包含一个METADATA缓存，因此通过遍历METADATA层次，就知道所有Range的位置信息。大部分关于Range位置的请求都可以在cache中处理。ThriftBroker，提供了一个连接Hypertable的高级语言接口，它是长期运行，所以它上面的METADATA通常都是最新的，因此，对于需要短期连接到Hypertable的应用程序（例如CGI程序），建议使用Thrift，这样在查询时就无需每次都遍历METADATA。

Adaptive MemoryAllocation

灵活的内存分配

The following diagram illustrates how theRangeServer adapts its memory usage based on changes in workload.

下图描述了在负荷变化时，RangeServer是如何调整它的内存使用的。

Under write-heavy workload, the RangeServerwill give more memory to the CellCaches so that they can grow as large aspossible, which minimizes the amount of spilling and merging work required.Under read-heavy workload, the system gives most of the memory to the blockcache, which significantly improves query throughput and latency.

在大量写负荷下，RangeServer会给CellCache更多的内存容量，使其尽量大，减少“溢出”写和合并工作的次数。在大量读负荷下，系统会给块更多内存，这样能显著地提高查询的吞吐量。