BitSet和布隆过滤器(Bloom Filter)

来源：互联网发布：php获取数组第一个key 编辑：程序博客网时间：2024/05/20 22:41

布隆过滤器

Bloom Filter 是由Howard Bloom 在 1970 年提出的二进制向量数据结构，它具有很好的空间和时间效率，被用来检测一个元素是不是集合中的一个成员。如果检测结果为是，该元素不一定在集合中；但如果检测结果为否，该元素一定不在集合中。因此Bloom filter具有100%的召回率。这样每个检测请求返回有“在集合内（可能错误）”和“不在集合内（绝对不在集合内）”两种情况，可见 Bloom filter 是牺牲了正确率和时间以节省空间。

当然布隆过滤器也有缺点，主要是误判的问题，随着数据量的增加，误判率也随着增大，解决办法：可以建立一个列表，保存哪些数值是容易被误算的。

Bloom Filter最大的特点是不会存在false negative，即：如果contains()返回false，则该元素一定不在集合中，但会存在一定的true negative，即：如果contains()返回true，则该元素可能在集合中。

Bloom Filter在很多开源框架都有实现，例如：

Elasticsearch：org.elasticsearch.common.util.BloomFilter

guava：com.google.common.hash.BloomFilter

Hadoop：org.apache.hadoop.util.bloom.BloomFilter（基于BitSet实现）

有兴趣可以看看源码。

BitSet的基本原理

最后再了解一下BitSet的基本原理，BitSet是位操作的对象，值只有0或1，内部实现是一个long数组，初始只有一个long数组，所以BitSet最小的size是64，当存储的数据增加，初始化的Long数组已经无法满足时，BitSet内部会动态扩充，最终内部是由N个long来存储，BitSet的内部扩充和List，Set，Map等得实现差不多，而且都是对于用户透明的。
1G的空间，有 8*1024*1024*1024=8589934592bit，也就是可以表示85亿个不同的数。

BitSet用1位来表示一个数据是否出现过，0为没有出现过，1表示出现过。在long型数组中的一个元素可以存放64个数组，因为Java的long占8个byte=64bit，具体的实现，看看源码：

首先看看set方法的实现：

public void set(int bitIndex) {   if (bitIndex < 0)   //set的数不能小于0        throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);   int wordIndex = wordIndex(bitIndex);//将bitIndex右移6位，这样可以保证每64个数字在long型数组中可以占一个坑。   expandTo(wordIndex);   words[wordIndex] |= (1L << bitIndex); // Restores invariants   checkInvariants();}

get命令实现：

public boolean get(int bitIndex) {   if (bitIndex < 0)       throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);   checkInvariants();   int wordIndex = wordIndex(bitIndex);//和get一样获取数字在long型数组的那个位置。   return (wordIndex < wordsInUse)        && ((words[wordIndex] & (1L << bitIndex)) != 0);//在指定long型数组元素中获取值。}

BitSet容量动态扩展：

private void ensureCapacity(int wordsRequired) {   if (words.length < wordsRequired) {        // Allocate larger of doubled size or required size        int request = Math.max(2 * words.length, wordsRequired);//默认是扩大一杯的容量，如果传入的数字大于两倍的，则以传入的为准。        // wordsRequired = 传入的数值右移6位 + 1        words = Arrays.copyOf(words, request);        sizeIsSticky = false;   }}

BitSet中实现了Cloneable接口，并定义在表中列出的方法：

SNMethods with 描述1void and(BitSet bitSet)
与运算调用的内容BitSet中对象与那些指定bitSet。结果存放到调用对象。2void andNot(BitSet bitSet)
对于bitSet每1位，在调用BitSet中的相应位清零。3int cardinality( )
返回BitSet的容量。4void clear( )
所有位清零。5void clear(int index)
index指定的位清零。6void clear(int startIndex, int endIndex)
将从startIndex到endIndex清零。7Object clone( )
重复调用BitSet中对象。8boolean equals(Object bitSet)
返回true如果调用位设置相当于一个在bitSet通过。否则，该方法返回false。9void flip(int index)
逆转由index指定的位。 10void flip(int startIndex, int endIndex)
反转将从startIndex位到endIndex.11boolean get(int index)
返回指定索引处的位的当前状态。12BitSet get(int startIndex, int endIndex)
返回一个BitSet中，它包含的比特将从startIndex到endIndex.1。调用对象不被改变。13int hashCode( )
返回调用对象的哈希代码。14boolean intersects(BitSet bitSet)
如果至少有一个对调用对象和bitSet内相应位为1，则返回true。15boolean isEmpty( )
返回true如果在调用对象中的所有位均为零。16int length( )
返回到持有调用BitSet中的内容所需的比特数。这个值是由最后1位的位置决定的。17int nextClearBit(int startIndex)
返回下个清零位的索引，（即，下一个零位），从由startIndex指定的索引开始18int nextSetBit(int startIndex)
返回下一组位（即，下一个1比特）的索引，从由startIndex指定的索引开始。如果没有位被设置，则返回1。19void or(BitSet bitSet)
OR值调用的内容BitSet中对象，通过BitSet指定。结果被放置到调用对象。 20void set(int index)
设置由index指定的位。21void set(int index, boolean v)
设置由index指定在v. true为传递的值的位设置位，false则清除该位。22void set(int startIndex, int endIndex)
设置位将从startIndex到endIndex.1。23void set(int startIndex, int endIndex, boolean v)
设置位从startIndex到endIndex.1，在真正传递的值v设置位，清除位为false。24int size( )
返回位在调用BitSet中对象的数量。25String toString( )
返回字符串相当于调用BitSet中的对象。26void xor(BitSet bitSet)

在异或调用BitSet中对象的内容与由BitSet指定。结果存放到调用对象。

BloomFilter的使用场景

1，爬虫的URL过滤。

2，日志分析

3，用户数统计等等等

总之使用布隆过滤器应该是可能容忍小概率误判的场景，不然慎用。。。

0 0