Cockroach Design 翻译 ( 十一) range元数据

来源：互联网发布：腾讯高级php面试题编辑：程序博客网时间：2024/06/05 22:44

15 Range Metadata （range元数据）

The default approximatesize of a range is 64M (2^26 B). In order to support 1P (2^50 B) oflogical data, metadata is needed for roughly 2^(50 - 26) = 2^24 ranges. Areasonable upper bound on range metadata size is roughly 256 bytes (3*12 bytesfor the triplicated node locations and 220 bytes for the range key itself).2^24 ranges *2^8 Bwould require roughly 4G(2^32 B) to store--too much to duplicate between machines. Our conclusion isthat range metadata must be distributed for large installations.

一个range默认大小接近64M (2^26 B)，要支持1PB(2^50 B)的逻辑数据，大概需要2^(50 - 26) = 2^24个range的元数据。一个元数据的最大合理上限大约是256字节（其中3*12字节保存3个节点的位置，余下220字节保存range本身的key）。 2^24 range *（2^8 B）大概需要4G (2^32 B)字节存储，这太大而不能在机器间进行复制。我们的结论是，对于大型集群部署，range的元数据必须是分布式的。

To keep keylookups relatively fast in the presence of distributedmetadata, we store all the top-level metadata in a single range (the firstrange). These top-level metadata keys are known as meta1 keys,and are prefixed such that they sort to the beginning of the key space. Giventhe metadata size of 256 bytes given above, a single 64M range would support64M/256B = 2^18 ranges, which gives a total storage of 64M * 2^18 = 16T. Tosupport the 1P quoted above, we need two levels ofindirection, where the first level addresses the second, and the secondaddresses user data. With two levels of indirection, we can address 2^(18 + 18)= 2^36 ranges; each range addresses 2^26 B, and altogether we address 2^(36+26)B = 2^62 B = 4E of user data.

面对分布的元数据，为了保持key检索的相对高效，我们把所有顶层的元数据保存在单一range里（第一个range）。这些顶层元数据的key被称为 meta1 key，并加上前缀以使得它们排序时在key空间的起始位置。前述给定一个元数据大小是256字节，一个64M的range可以保存 64M/256B=2^18 个range元数据，总共可以提供64M *（2^18） =16T的存储空间。为了提供上述1P的存储空间，我们需要两层寻址，第一层用来定位第二层的地址，第二层用来保存用户数据。采用两层寻址，我们可以寻址2^(18 + 18) = 2^36个range；每个range寻址2^26 B ，则总共可寻址2^(36+26) B = 2^62 B = 4E的用户数据。

For a givenuser-addressable key1, the associated meta1 recordis found at thesuccessorkey to key1 in the meta1 space. Since the meta1 spaceis sparse, the successor key is defined as the next key which is present. The meta1 recordidentifies the range containing the meta2 record, which isfound using the same process. The meta2 record identifies therange containingkey1, which is again found the same way (seeexamples below).

对于一个给定的用户地址 key1 ，其对应的 meta1 记录位于meta1空间中key1的后驱key中。因为meta1空间是稀疏的，后驱key被定义为下一个存在的key。Meta1记录标识了包含meta2记录的range，查找方式相同。Meta2记录标识了包含key1的range，查找过程与前面也采用了相同方法（参见下面的例子）。

Concretely,metadata keys are prefixed by \x02 (meta1)and \x03 (meta2); the prefixes \x02 and \x03 provide for the desired sorting behaviour. Thus, key1's meta1 record will reside at the successor keyto \x02<key1>.

具体的，key的元数据被增加了前缀： \x02 (meta1)、 \x03 (meta2)；前缀 \x02 和 \x03是为了得到期望的排序结果。这样，key1的meta1记录就被保存在 \x02<key1>的后驱key中。

Note: weappend the end key of each range to meta{1,2} records because the RocksDBiterator only supports a Seek() interface which acts as a Ceil(). Using thestart key of the range would cause Seek() to find the key after themeta indexing record we’re looking for, which would result in having to backthe iterator up, an option which is both less efficient and notavailable in all cases.

注意：我们在每个range的最后追加一个key到meta{1,2}，这是因为RocksDB迭代器仅支持Seek()接口，其功能类似Ceil()。使用range开始的Key将会使Seek()函数找到的是我们寻找的元数据索引记录后面的key，这导致不得不使迭代回溯，这不方便并且在所有情况下都不可用。

The followingexample shows the directory structure for a map with three ranges worth ofdata. Ellipses indicate additional key/value pairs to fill an entire range ofdata. For clarity, the examples use meta1 and meta2 to refer to the prefixes \x02 and\x03. Except for the fact that splitting ranges requires updates to therange metadata with knowledge of the metadata layout, the range metadata itselfrequires no special treatment or bootstrapping.

下面的例子展示了有三个range的map目录结构。省略号表示补充的填满数据整个range的key/value对。为了清晰，例子使用meta1和meta2来指代前缀 \x02 和\x03。除了有切分range需求时需要更新range的元数据，需要知道元数据的分布信息，range元数据本身不需要特殊对待或者自举。

Range 0 (located onservers dcrama1:8000, dcrama2:8000, dcrama3:8000)

l meta1\xff: dcrama1:8000, dcrama2:8000, dcrama3:8000

l meta2<lastkey0>: dcrama1:8000, dcrama2:8000, dcrama3:8000

l meta2<lastkey1>: dcrama4:8000, dcrama5:8000, dcrama6:8000

l meta2\xff: dcrama7:8000, dcrama8:8000, dcrama9:8000

l ...

l <lastkey0>: <lastvalue0>

Range 1 (located onservers dcrama4:8000, dcrama5:8000, dcrama6:8000)

l ...

l <lastkey1>: <lastvalue1>

Range 2 (located on servers dcrama7:8000, dcrama8:8000, dcrama9:8000)

l ...

l <lastkey2>: <lastvalue2>

Consider asimpler example of a map containing less than a single range of data. In thiscase, all range metadata and all data are located in the same range:

对于一个map包含的内容小于单一数据range的简单例子，所有range元数据和数据都会位于相同的range里。

Range 0 (located onservers dcrama1:8000, dcrama2:8000, dcrama3:8000)*

l meta1\xff: dcrama1:8000, dcrama2:8000, dcrama3:8000

l meta2\xff: dcrama1:8000, dcrama2:8000, dcrama3:8000

l <key0>: <value0>

l ...

Finally, amap large enough to need both levels of indirection would look like (note thatinstead of showing range replicas, this example is simplified to just showrange indexes):

最终，如果一个map足够大则需要两层索引，看起来像这样(注意：该例子为了简单明了只写了range序号，没有显示range副本)：

Range 0

l meta1<lastkeyN-1>: Range 0

l meta1\xff: Range 1

l meta2<lastkey1>: Range 1

l meta2<lastkey2>: Range 2

l meta2<lastkey3>: Range 3

l ...

l meta2<lastkeyN-1>: Range 262143

Range 1

l meta2<lastkeyN>: Range 262144

l meta2<lastkeyN+1>: Range 262145

l ...

l meta2\xff: Range 500,000

l ...

l <lastkey1>: <lastvalue1>

Range 2

l ...

l <lastkey2>: <lastvalue2>

Range 3

l ...

l <lastkey3>: <lastvalue3>

Range 262144

l ...

l <lastkeyN>: <lastvalueN>

Range 262145

l ...

l <lastkeyN+1>: <lastvalueN+1>

Note that thechoice of range 262144 is just anapproximation. The actual number of ranges addressable via a single metadatarange is dependent on the size of the keys. If efforts are made to keep keysizes small, the total number of addressable ranges would increase and viceversa.

注意：选择range262144只是一个近似值。通过单一元数据range可寻址的range的实际数量依赖于key的大小。如果努力保持key的尺寸越小，则可寻址的range越多，反之亦然。

From theexamples above it’s clear that key location lookups require at most three readsto get the value for <key>:

从上面的例子可以清楚的看到，至多3次key寻址就可获取 <key> 对应的值：

1. lower bound of meta1<key>

meta1中获取meta2地址；

2. lower bound of meta2<key>,

meta2中获取key地址；

3. <key>.

由key得到值。

For smallmaps, the entire lookup is satisfied in a single RPC to Range 0. Mapscontaining less than 16T of data would require two lookups. Clients cache bothlevels of range metadata, and we expect that data locality for individualclients will be high. Clients may end up with stale cache entries. If on alookup, the range consulted does not match the client’s expectations, theclientevictsthe stale entries and possibly does a newlookup.

对于小map，可以在Range 0上一次RPC调用内完成所有检索。包含16T以下的Map需要2次检索。客户端缓存range元数据的各层，我们期望客户端各自都具有很高的数据局部性。如果在一次检索中，协商好的range中没有匹配客户端的期望，客户端将移除这些过期条目并可能重新检索。

0 0