
来源:互联网 发布:互联网数据开发是什么 编辑:程序博客网 时间:2024/04/30 13:03




应用随机盐:a b c d,使得同一时间的同一地区降低4倍的吞吐量

可以使上面的rowkey--->foo0003 类似的数据分布到相似的region中,这样检索就更快


二、Monotonically Increasing Row Keys/Timeseries Data

三、Try to minimize row and column sizes(尽量减少行和列的大小


(1) Column Families(列簇尽可能地小,而且不超过3个)

Try to keep the ColumnFamily names as small as possible, preferably one character (e.g. "d" for data/default).

(2) Attributes(虽然可读性提高了,但是更短的attribute更利于hbase存储)

Although verbose attribute names (e.g., "myVeryImportantAttribute") are easier to read, prefer shorter attribute names (e.g., "via") to store in HBase.

(3) Rowkey Length(rowkey越短,使用get绝对比scan性能好)

Keep them as short as is reasonable such that they can still be useful for required data access (e.g. Get vs. Scan). A short key that is useless for data access is not better than a longer key with better get/scan properties. Expect tradeoffs when designing rowkeys.

(4) Byte Patterns(存储方式是字节存储,这种方式很好)

A long is 8 bytes. You can store an unsigned number up to 18,446,744,073,709,551,615 in those eight bytes. If you stored this number as a String — presuming a byte per character — you need nearly 3x the bytes.

Not convinced? Below is some sample code that you can run on your own.

// longlong l = 1234567890L;byte[] lb = Bytes.toBytes(l);System.out.println("long bytes length: " + lb.length);        // returns 8String s = String.valueOf(l);byte[] sb = Bytes.toBytes(s);System.out.println("long as string length: " + sb.length);    // returns 10// hash//MessageDigest md = MessageDigest.getInstance("MD5");byte[] digest = md.digest(Bytes.toBytes(s));System.out.println("md5 digest bytes length: " + digest.length);    // returns 16String sDigest = new String(digest);byte[] sbDigest = Bytes.toBytes(sDigest);System.out.println("md5 digest as string length: " + sbDigest.length);    // returns 26

Unfortunately, using a binary representation of a type will make your data harder to read outside of your code. For example, this is what you will see in the shell when you increment a value:

hbase(main):001:0> incr 't', 'r', 'f:q', 1COUNTER VALUE = 1hbase(main):002:0> get 't', 'r'COLUMN                                        CELL f:q                                          timestamp=1369163040570, value=\x00\x00\x00\x00\x00\x00\x00\x011 row(s) in 0.0310 seconds

The shell makes a best effort to print a string, and it this case it decided to just print the hex. The same will happen to your row keys inside the region names. It can be okay if you know what’s being stored, but it might also be unreadable if arbitrary data can be put in the same cells. This is the main trade-off.

四、 Reverse Timestamps(反转时间戳)

Reverse Scan API

HBASE-4811 implements an API to scan a table or a range within a table in reverse, reducing the need to optimize your schema for forward or reverse scanning. This feature is available in HBase 0.98 and later. Seehttps://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html#setReversed%28booleanfor more information.

A common problem in database processing is quickly finding the most recent version of a value. A technique using reverse timestamps as a part of the key can help greatly with a special case of this problem. Also found in the HBase chapter of Tom White’s book Hadoop: The Definitive Guide (O’Reilly), the technique involves appending (Long.MAX_VALUE - timestamp) to the end of any key, e.g. [key][reverse_timestamp].

The most recent value for [key] in a table can be found by performing a Scan for [key] and obtaining the first record. Since HBase keys are in sorted order, this key sorts before any older row-keys for [key] and thus is first.

This technique would be used instead of using Number of Versions where the intent is to hold onto all versions "forever" (or a very long time) and at the same time quickly obtain access to any other version by using the same Scan technique.

五、 Rowkeys and ColumnFamilies

Rowkeys are scoped to ColumnFamilies. Thus, the same rowkey could exist in each ColumnFamily that exists in a table without collision.

六、Immutability of Rowkeys

Rowkeys cannot be changed. The only way they can be "changed" in a table is if the row is deleted and then re-inserted. This is a fairly common question on the HBase dist-list so it pays to get the rowkeys right the first time (and/or before you’ve inserted a lot of data).

七、 Relationship Between RowKeys and Region Splits

If you pre-split your table, it is critical to understand how your rowkey will be distributed across the region boundaries. As an example of why this is important, consider the example of using displayable hex characters as the lead position of the key (e.g., "0000000000000000" to "ffffffffffffffff"). Running those key ranges through Bytes.split (which is the split strategy used when creating regions inAdmin.createTable(byte[] startKey, byte[] endKey, numRegions) for 10 regions will generate the following splits…​

48 48 48 48 48 48 48 48 48 48 48 48 48 48 48 48                                // 054 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10 -10                 // 661 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -67 -68                 // =68 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -124 -126  // D75 75 75 75 75 75 75 75 75 75 75 75 75 75 75 72                                // K82 18 18 18 18 18 18 18 18 18 18 18 18 18 18 14                                // R88 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -40 -44                 // X95 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -97 -102                // _102 102 102 102 102 102 102 102 102 102 102 102 102 102 102 102                // f

(note: the lead byte is listed to the right as a comment.) Given that the first split is a '0' and the last split is an 'f', everything is great, right? Not so fast.

The problem is that all the data is going to pile up in the first 2 regions and the last region thus creating a "lumpy" (and possibly "hot") region problem. To understand why, refer to an ASCII Table. '0' is byte 48, and 'f' is byte 102, but there is a huge gap in byte values (bytes 58 to 96) that will never appear in this keyspace because the only values are [0-9] and [a-f]. Thus, the middle regions will never be used. To make pre-splitting work with this example keyspace, a custom definition of splits (i.e., and not relying on the built-in split method) is required.

Lesson #1: Pre-splitting tables is generally a best practice, but you need to pre-split them in such a way that all the regions are accessible in the keyspace. While this example demonstrated the problem with a hex-key keyspace, the same problem can happen with any keyspace. Know your data.

Lesson #2: While generally not advisable, using hex-keys (and more generally, displayable data) can still work with pre-split tables as long as all the created regions are accessible in the keyspace.

To conclude this example, the following is an example of how appropriate splits can be pre-created for hex-keys:.

public static boolean createTable(Admin admin, HTableDescriptor table, byte[][] splits)throws IOException {  try {    admin.createTable( table, splits );    return true;  } catch (TableExistsException e) {    logger.info("table " + table.getNameAsString() + " already exists");    // the table already exists...    return false;  }}public static byte[][] getHexSplits(String startKey, String endKey, int numRegions) {  byte[][] splits = new byte[numRegions-1][];  BigInteger lowestKey = new BigInteger(startKey, 16);  BigInteger highestKey = new BigInteger(endKey, 16);  BigInteger range = highestKey.subtract(lowestKey);  BigInteger regionIncrement = range.divide(BigInteger.valueOf(numRegions));  lowestKey = lowestKey.add(regionIncrement);  for(int i=0; i < numRegions-1;i++) {    BigInteger key = lowestKey.add(regionIncrement.multiply(BigInteger.valueOf(i)));    byte[] b = String.format("%016x", key).getBytes();    splits[i] = b;  }  return splits;}

0 0