BloomFilter of Guava

来源:互联网 发布:怎么在淘宝注册店铺 编辑:程序博客网 时间:2024/06/13 22:34
When the Guava project released version 11.0, one of the new additions was the BloomFilter class. A BloomFilter is a unique data-structure used to indicate if an element is contained in a set. What makes a BloomFilter interesting is it will indicate if an element is absolutely not contained, or may becontained in a set. This property of never having a false negative makes the BloomFilter a great candidate for use as a guard condition to help prevent performing unnecessary and expensive operations. While BloomFilters have received good exposure lately, using one meant rolling your own, or doing a Google search for code. The trouble with rolling your own BloomFilter is getting the correct hash function to make the filter effective. Considering Guava uses the Murmur Hash for its’ implementation, we now have the usefulness of an effective BloomFilter just a library away.译者信息在 Guava 项目的11.0版中,一个新的类添加了进来—— BloomFilter(布隆过滤器)类。布隆过滤器是一种独特的数据结构,用以表明元素是否被保存在一个集合(Set)中。有趣的是,布隆过滤器能够明确指出元素 绝对不存在于一个集合中,或是 可能存在于一个集合中。由于布隆过滤器从不漏报的特性,使得它成为避免不必要和昂贵操作的约束条件的极好选择。 近来对布隆过滤器的关注开始增多,如要使用它,你可以自己写代码,也可以谷歌之。编写自己的布隆过滤器的问题在于使用正确的哈希函数来确保过滤器生效。鉴于 Guava 是使用 Murmur Hash 来实现的,仅需一个库,你就能获得这个非常有效的布隆过滤器。

BloomFilter Crash Course

BloomFilters are essentially bit vectors. At a high level BloomFilters work in the following manner:

  1. Add the element to the filter.
  2. Hash it a few times, than set the bits to 1 where the index matches the results of the hash.

When testing if an element is in the set, you follow the same hashing procedure and check if the bits are set to 1 or 0. This process is how a BloomFilter can guarantee an element does not exist. If the bits aren’t set, it’s simply impossible for the element to be in the set. However, a positive answer means the element is in the set or a hashing collision occurred. A more detaild description of a BloomFilter can be found here and a good tutorial on BloomFilters here. According to wikipedia, Google uses BloomFilters in BigTable to avoid disk lookups for non-existent items. Another interesting usage is using a BloomFilter to optimize a sql querry.

译者信息

布隆过滤器速成

布隆过滤器在本质上是二进制向量。在高层级上,布隆过滤器以下面的方式工作:

  1. 添加元素到过滤器。
  2. 对元素进行几次哈希运算,当索引匹配哈希的结果时,将该位设置为 1 的。

如果要检测元素是否属于集合,使用相同的哈希运算步骤,检查相应位的值是1还是0。这就是布隆过滤器明确元素不存在的过程。如果位未被设置,则元素绝不可能存在于集合中。当然,一个肯定的答案意味着,要不就是元素存在于集合中,要不就是遇见了哈希冲突。这里有一份很好的对布隆过滤器细节的描述,还有一份很好的教程。依据维基百科,谷歌在 BigTable 中使用了布隆过滤器,以避免在硬盘中寻找不存在的条目。另一个有趣的用法是使用布隆过滤器优化SQL查询。

Using the Guava BloomFilter

A Guava BloomFilter is created by calling the static method create on the BloomFilter class,
passing in a Funnel object and an int representing the expected number of insertions. A Funnel, also new in Guava 11, is an object that can send data into a Sink. The following example is the default implementation and has a percentage of false positives of 3%. Guava provides a Funnels class containing two static methods providing implementations of the Funnel interface for inserting a CharSequence or byte Array into a filter.

1//Creating the BloomFilter
2BloomFilter bloomFilter =
BloomFilter.create(Funnels.byteArrayFunnel(), 
1000);
3 
4//Putting elements into the filter
5//A BigInteger representing a key of some sort
6bloomFilter.put(bigInteger.toByteArray());
7 
8//Testing for element in set
9boolean mayBeContained = bloomFilter.
mayContain(bitIntegerII.toByteArray());

UPDATE: based on the comment from Louis Wasserman, here’s how to create a BloomFilter for BigIntegers with a custom Funnel implementation:

01//Create the custom filter
02class BigIntegerFunnel implements Funnel<BigInteger> {
03        @Override
04        public void funnel(BigInteger from, Sink into) {
05            into.putBytes(from.toByteArray());
06        }
07    }
08 
09//Creating the BloomFilter
10BloomFilter bloomFilter = BloomFilter.
create(
new BigIntegerFunnel(), 1000);
11 
12//Putting elements into the filter
13//A BigInteger representing a key of some sort
14bloomFilter.put(bigInteger);
15 
16//Testing for element in set
17boolean mayBeContained = bloomFilter.
mayContain(bitIntegerII);

译者信息

使用 Guava 的布隆过滤器

Guava 的布隆过滤器通过调用 BloomFilter 类中的静态函数创建, 传递一个 Funnel 对象以及一个代表预期插入数量整数。同样来自于 Guava 11 中的 Funnel 对象,用于将数据发送给一个接收器(Sink)。 下面的例子是一个默认的实现,有着3%的误报率。Guava 提供的 Funnels 类拥有两个静态方法提供了将CharSequence 或byte数组插入到过滤器的 Funnel 接口的实现。

1//Creating the BloomFilter
2BloomFilter bloomFilter = BloomFilter.
create(Funnels.byteArrayFunnel(), 
1000);
3 
4//Putting elements into the filter
5//A BigInteger representing a key of some sort
6bloomFilter.put(bigInteger.toByteArray());
7 
8//Testing for element in set
9boolean mayBeContained = bloomFilter.
mayContain(bitIntegerII.toByteArray());

更新:根据来自 Louis Wasserman 的回复,下面是如何为 BigIntegers 创建一个带有自定义 Funnel 实现的布隆过滤器:

01//Create the custom filter
02class BigIntegerFunnel implements Funnel<BigInteger> {
03        @Override
04        public void funnel(BigInteger from, Sink into) {
05            into.putBytes(from.toByteArray());
06        }
07    }
08 
09//Creating the BloomFilter
10BloomFilter bloomFilter = BloomFilter.
create(
new BigIntegerFunnel(), 1000);
11 
12//Putting elements into the filter
13//A BigInteger representing a key of some sort
14bloomFilter.put(bigInteger);
15 
16//Testing for element in set
17boolean mayBeContained = bloomFilter.
mayContain(bitIntegerII);

Considerations

It’s critical to estimate the number of expected insertions correctly. As insertions into the filter approach or exceeds the expected number, the BloomFilter begins to fill up and as a result will generate more false positives to the point of being useless. There is another version of the BloomFilter.create method taking an additional parameter, a double representing the desired level of false hit probability (must be greater than 0 and less than one). The level of false hit probability affects the number of hashes for storing or searching for elements. The lower the desired percentage, the higher number of hashes performed.

译者信息

注意事项

正确估计预期插入数量是非常关键的。当插入的数量接近或高于预期值的时候,布隆过滤器将会填满,这样的话,它会产生很多无用的误报点。这里有另一个版本的 BloomFilter.create 方法,它额外接收一个参数,一个代表假命中概率水平的双精度数字(必须大于零且小于1)。假命中概率等级影响哈希表储存或搜索元素的数量。百分比越低,哈希表的性能越好。

Conclusion

A BloomFilter is a useful item for a developer to have in his/her toolbox. The Guava project now makes it very simple to begin using a BloomFilter when the need arises. I hope you enjoyed this post. Helpful comments and suggestions are welcomed.

References

  • Unit Test Demo of Guava BloomFilter.
  • BloomFilter class
  • All You Want to Know about BloomFilters.
  • BloomFilter Tutorial.
  • BloomFilter on Wikipedia.
译者信息

总结

对于开发者来说,布隆过滤器是一件很有用的工具。Guava 项目能够让你在需要时更方便的使用布隆过滤器。我希望你喜欢这篇文章。希望听到您的意见或建议。

参考

  • Guava 布隆过滤器的单元测试演示。.
  • 布隆过滤器类
  • 所有你想知道的布隆过滤器。.
  • 布隆过滤器教程。.
  • 维基百科上的布隆过滤器。.
0 0