Hbase 布隆过滤器BloomFilter介绍

来源：互联网发布：电脑网络格斗游戏编辑：程序博客网时间：2024/06/05 21:49

1、主要功能

提高随机读的性能

2、存储开销

bloom filter的数据存在StoreFile的meta中，一旦写入无法更新，因为StoreFile是不可变的。Bloomfilter是一个列族（cf）级别的配置属性，如果你在表中设置了Bloomfilter，那么HBase会在生成StoreFile时包含一份bloomfilter结构的数据，称其为MetaBlock；MetaBlock与DataBlock（真实的KeyValue数据）一起由LRUBlockCache维护。所以，开启bloomfilter会有一定的存储及内存cache开销。

3、控制粒度

a)ROW

根据KeyValue中的row来过滤storefile

举例：假设有2个storefile文件sf1和sf2，

sf1包含kv1（r1 cf:q1 v）、kv2（r2 cf:q1 v）

sf2包含kv3（r3 cf:q1 v）、kv4（r4 cf:q1 v）

如果设置了CF属性中的bloomfilter为ROW，那么get(r1)时就会过滤sf2，get(r3)就会过滤sf1

b)ROWCOL

根据KeyValue中的row+qualifier来过滤storefile

举例：假设有2个storefile文件sf1和sf2，

sf1包含kv1（r1 cf:q1 v）、kv2（r2 cf:q1 v）

sf2包含kv3（r1 cf:q2 v）、kv4（r2 cf:q2 v）

如果设置了CF属性中的bloomfilter为ROW，无论get(r1,q1)还是get(r1,q2)，都会读取sf1+sf2；而如果设置了CF属性中的bloomfilter为ROWCOL，那么get(r1,q1)就会过滤sf2，get(r1,q2)就会过滤sf1

4、常用场景

1、根据key随机读时，在StoreFile级别进行过滤

2、读数据时，会查询到大量不存在的key，也可用于高效判断key是否存在

5、举例说明

假设x、y、z三个key存在于table中，W不存在

使用Bloom Filter可以帮助我们减少为了判断key是否存在而去做Scan操作的次数

step1）分别对x、y、z运算hash函数取得bit mask，写到Bloom Filter结构中

step2）对W运算hash函数，从Bloom Filter查找bit mask

如果不存在：三个Bit位至少有一个为0，W肯定不存在该（Bloom Filter不会漏判）

如果存在：三个Bit位全部全部等于1，路由到负责W的Region执行scan，确认是否真的存在（Bloom Filter有极小的概率误判）

6、源码解析

1.get操作会enable bloomfilter帮助剔除掉不会用到的Storefile

在scan初始化时（get会包装为scan）对于每个storefile会做shouldSeek的检查，如果返回false，则表明该storefile里没有要找的内容，直接跳过

[java] view plain copy print?

if (memOnly == false
&& ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {
scanners.add(kvs);
}

if (memOnly == false              && ((StoreFileScanner) kvs).shouldSeek(scan, columns)) {            scanners.add(kvs);  }

shouldSeek方法：如果是scan直接返回true表明不能跳过，然后根据bloomfilter类型检查。

[java] view plain copy print?

if (!scan.isGetScan()) {
return true;
}
byte[] row = scan.getStartRow();
switch (this.bloomFilterType) {
case ROW:
return passesBloomFilter(row, 0, row.length, null, 0, 0);
case ROWCOL:
if (columns != null && columns.size() == 1) {
byte[] column = columns.first();
return passesBloomFilter(row, 0, row.length, column, 0, column.length);
}
// For multi-column queries the Bloom filter is checked from the
// seekExact operation.
return true;
default:
return true;
}

if (!scan.isGetScan()) {          return true;  }  byte[] row = scan.getStartRow();  switch (this.bloomFilterType) {    case ROW:      return passesBloomFilter(row, 0, row.length, null, 0, 0);     case ROWCOL:      if (columns != null && columns.size() == 1) {        byte[] column = columns.first();        return passesBloomFilter(row, 0, row.length, column, 0, column.length);      }      // For multi-column queries the Bloom filter is checked from the      // seekExact operation.      return true;     default:      return true;}

2.指明qualified的scan在配了rowcol的情况下会剔除不会用掉的StoreFile。

对指明了qualify的scan或者get进行检查：seekExactly

[java] view plain copy print?

// Seek all scanners to the start of the Row (or if the exact matching row
// key does not exist, then to the start of the next matching Row).
if (matcher.isExactColumnQuery()) {
for (KeyValueScanner scanner : scanners)
scanner.seekExactly(matcher.getStartKey(), false);
} else {
for (KeyValueScanner scanner : scanners)
scanner.seek(matcher.getStartKey());
}

// Seek all scanners to the start of the Row (or if the exact matching row  // key does not exist, then to the start of the next matching Row).  if (matcher.isExactColumnQuery()) {    for (KeyValueScanner scanner : scanners)    scanner.seekExactly(matcher.getStartKey(), false);  } else {    for (KeyValueScanner scanner : scanners)    scanner.seek(matcher.getStartKey());  }

如果bloomfilter没命中，则创建一个很大的假的keyvalue，表明该storefile不需要实际的scan

[java] view plain copy print?

public boolean seekExactly(KeyValue kv, boolean forward)
throws IOException {
if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||
kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {
return forward ? reseek(kv) : seek(kv);
}
boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),
kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),
kv.getQualifierOffset(), kv.getQualifierLength());
if (isInBloom) {
// This row/column might be in this store file. Do a normal seek.
return forward ? reseek(kv) : seek(kv);
}
// Create a fake key/value, so that this scanner only bubbles up to the top
// of the KeyValueHeap in StoreScanner after we scanned this row/column in
// all other store files. The query matcher will then just skip this fake
// key/value and the store scanner will progress to the next column.
cur = kv.createLastOnRowCol();
return true;
}

public boolean seekExactly(KeyValue kv, boolean forward)        throws IOException {      if (reader.getBloomFilterType() != StoreFile.BloomType.ROWCOL ||          kv.getRowLength() == 0 || kv.getQualifierLength() == 0) {        return forward ? reseek(kv) : seek(kv);      }        boolean isInBloom = reader.passesBloomFilter(kv.getBuffer(),          kv.getRowOffset(), kv.getRowLength(), kv.getBuffer(),          kv.getQualifierOffset(), kv.getQualifierLength());      if (isInBloom) {        // This row/column might be in this store file. Do a normal seek.        return forward ? reseek(kv) : seek(kv);      }        // Create a fake key/value, so that this scanner only bubbles up to the top      // of the KeyValueHeap in StoreScanner after we scanned this row/column in      // all other store files. The query matcher will then just skip this fake      // key/value and the store scanner will progress to the next column.      cur = kv.createLastOnRowCol();      return true;  }

这边为什么是rowcol才能剔除storefile纳，很简单，scan是一个范围，如果是row的bloomfilter不命中只能说明该rowkey不在此storefile中，但next rowkey可能在。而rowcol的bloomfilter就不一样了，如果rowcol的bloomfilter没有命中表明该qualifiy不在这个storefile中，因此这次scan就不需要scan此storefile了！

7、总结

1.任何类型的get（基于rowkey或row+col）Bloom Filter的优化都能生效，关键是get的类型要匹配Bloom Filter的类型

2.基于row的scan是没办法走Bloom Filter的。因为Bloom Filter是需要事先知道过滤项的。对于顺序scan是没有事先办法知道rowkey的。而get是指明了rowkey所以可以用Bloom Filter，scan指明column同理。

3.row+col+qualify的scan可以去掉不存在此qualify的storefile，也算是不错的优化了，而且指明qualify也能减少流量，因此scan尽量指明qualify。

阅读全文

0 0