hbase组合rowkey和partial key scan

来源：互联网发布：淘宝是白号,该怎么升心编辑：程序博客网时间：2024/05/17 01:20

partial key scan并没有反应其特点，应该叫prefix key scan更好些，也就是说必须作为前缀才有意义，若是中间的key，就不行了。

比如rowkey形式为<key1>-<key2>-<key3>

以key2或key3无法做partial scan。

对于该问题几种解决办法：

1）冗余。建另外一张表，以要查询的子key比如key2放在组合rowkey开始位置。

2）利用某子key数据少的特点。比如若key3数据较少，可以将其放在rowkey开始位置：<key3>-<key2>-<key1>，若有对key2的查询，可以枚举key3来依次构造key3-key2前缀进行partial scan。

参见http://stackoverflow.com/questions/12908378/hbase-searching-by-part-of-a-key

3）fuzzy row filter。

可以构建通配符形式的中间子key的scan。（但匹配key必须为固定长度）

本质上还是full scan，但是由于略过一部分数据，scan性能提到提升。---能提升多少取决于能略过多少数据，若要过滤key的集合很大对应row很多，基本上没法略过，要一一匹配，就没太大意义了。

参见http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/

Performance of the scan based on FuzzyRowFilter usually depends on the cardinality of the fuzzy part. E.g. in the example above, if users number is several hundreds to several thousand, the scan should be very fast: there will only be several hundreds or thousand “jumps” and huge amount of rows might be skipped. If the cardinality is high then scan can take a lot of time.The worst-case scenario is when you have N records and N users, i.e. one record per user. In this case there’s simply nothing to skip.

http://grokbase.com/t/hbase/user/13agwrnmej/how-to-query-based-on-partial-row-key

You can use RowFilter in combination with RegexStringComparator.
See RegexStringComparator's javadoc:

* This comparator is for use with {@link CompareFilter} implementations,
such

* as {@link RowFilter}, {@link QualifierFilter}, and {@link ValueFilter},
for

See also TestFilter#testRowFilter()

You can also try Phoenix, which does this automatically for you.
(https://github.com/forcedotcom/phoenix)

rowkey 设计要点：

参见<<hbase definitive guide>>里的描述。

2.另外官网上也有一些。

（http://hbase.apache.org/book/schema.html，http://hbase.apache.org/book/schema.casestudies.html）

1)column family要尽量少（<=2），尽量使用1个。---compaction是per region来做，一个family要flush，其他family都要。

2）rowkey,columnfamily, column key长度尽量短。这是hbase存储格式决定的。每个cell都有rowkey, columnfamily, column key前缀。（有了压缩应该能解决这个问题？）

另外数据类型比如int, String的存储空间是不同的，int要省空间。

0 0