How are bloom filters used in HBase?

来源:互联网 发布:腾讯课堂软件 编辑:程序博客网 时间:2024/05/09 18:05

link:http://www.quora.com/How-are-bloom-filters-used-in-HBase?q=hbase+bl


The bloom filters in HBase are good in a few different use-cases. One is access patterns where you will have a lot of misses during reads. The other is to speed up reads by cutting down internal lookups.


They are stored in the meta data of each HFile when it is written and then never need to be updated because HFiles are immutable. While I have no empirical data as to how much extra space they require (this also depends on the error rate you choose etc.) they do add some overhead obviously. When a HFile is opened, typically when a region is deployed to a RegionServer, the bloom filter is loaded into memory and used to determine if a given key is in that store file. They can be scoped on a row key or column key level, where the latter needs more space as it has to store many more keys compared to just using the row keys (unless you only have exactly one column per row). 

In terms of computational overhead the bloom filters in HBase are very efficient, they employ folding to keep the size down and combinatorial generation to speed up their creation. They add about 1 byte per entry and are mainly useful when your entry size is on the larger end, say a few kilobytes. Otherwise the size of filter compared to the size of the data is prohibitive. 

What also matters is how you actually update data as regular changes on the cells will spread them across all store files, which means you will have to scan all files anyways. Better suited are some sort of batched updates for entities so that you have a chance for specific row keys to be in only a few store files. That way and given you have larger stores, for example 1GB, you can skip a substantial amount of disk IO during the low level scan to find a specific row. 

Keep in mind that HBase only has a block index per file, which is rather course grained and tells the reader that a key *may* be in the file because it falls into a start and end key range in the block index. But if the key is actually present can only be determined by loading that block and scanning it.

This also places a burden on the block cache and you may create a lot of unnecessary churn that the bloom filters would help avoid. To perform that actual check the RegionServer has to load the matching block and scan it to check if the key is actually present.

In a very busy system using bloom filters with the matching update or read patterns can save a huge amount of IO obviously. Also bloom filters are easily turned on or off so you can try them out and closely observe how they improve your read performance. Let us know how you do!