一个用于白名单服务的布隆过滤器(bloom filter)

来源：互联网发布：游戏美工培训班编辑：程序博客网时间：2024/05/22 12:14

bloom filter这种数据结构用于判断一个元素是否在集合内，当然，这种功能也可以由HashMap来实现。bloom filter与HashMap的区别在于，HashMap会储存代表这个元素的key自身(如key为"IKnow7"，那么HashMap将存储"IKnow7"这12个字节(java)，其实还需要包括引用大小，但java中相同string只存一份)，而bloom filter在底层只会使用几个bit来代表这个元素。在速度上，bloom filter对比与HashMap相差不大，底层同样是hash+随机访问。由于bloom filter对空间节省的特性，bloom filter适合判断一个元素是否在海量数据集合中。

bloom filter的一些概念

bloom filter并非十全十美。bloom filter在添加元素时，会将对象hash到底层位图数组的k个位上，对这些位，bloom filter会将其值设为1。由于hash函数特性以及位图数组长度有限，不同的对象可能在某些位上有重叠。bloom filter在检查元素是否存在时，会检查该对象所对应的k个位是否为1，如果全部都为1表示存在，这里就出现问题了，这些位上的1未必是该元素之前设置的，有可能是别的元素所设置的，所以会造成一些误判，即原本不在bloom filter中的一些元素也被判别在bloom filter中。bloom filter的这种误判被称为"积极的误判"，即存在的元素的一定会通过，不存在的元素也有可能通过，而不会造成对存在的元素结果为否的判定。

可以简单猜测，误判的概率与hash的选择、位图数组的大小、当前元素的数量以及K(映射位的个数)有关。一般来说，hash值越平均、位图数组越大、元素数量越少那么误判的概率就越低。

这是一个大牛写的关于bloom filter设计与误判率的理论分析，大伙可以去看看：http://www.cnblogs.com/allensun/archive/2011/02/16/1956532.html。

bloom filter在web上的应用

在web应用中我们经常需要使用白名单来过滤一些请求，用以避免一些无效的数据库访问或者恶意攻击。对于允许一些误判率且存在海量数据的白名单来说，使用bloom filter是不二的选择。

使用bloom filter实现一个支持增量请求的白名单

白名单通常是需要更新的，更新的方式一般有全量和增量更新。全量不必说，重新定义个bloom filter将当前所有数据放入其中即可。增量更新的话，一般会提供一段时间内新增和删除的数据，所以需要在白名单中将数据进行合并，该添加的添加，该删除的删除。

可是...... 原生的bloom filter并不支持元素的删除操作，因为某一位可能为多个元素所用。一种不切实际的想法是为bloom filter的每一位设置一个引用计数，每删除一个元素减1。

一种可行的做法是，另外使用一个map来保存已删除的元素，在判断元素是否存在时先判断在该deletemap中是否存在，如果存在，直接false。如果不存在，再通过bloom filter进行判断。在新添加元素时，如果deletemap中存在，删除该deletemap中的该元素，再添加到bloom filter中。在实际应用中，使用白名单的场景需要删除的元素一般是较少的，所以这种方式从效率是可行的。这种方式存在一个问题，当deletemap中元素过多时，势必会造成bloom filter的误判率上升，因为某些原本被删除元素设置为1的位并没有被归0。该问题的解决措施是，当deletemap的容量到达的一个界线时，使用全量同步更新该bloom filter。

白名单bloom filter的实现

这类构件复用性很强，可以轻松的集成到现有的代码之上。下面直接贴出来：

public class BloomFilter<E> implements Serializable {        private static final long serialVersionUID = 3507830443935243576L;    private long timestamp;//用于时间戳更新机制    private HashMap<E, Boolean> deleteMap ; //储存已删除元素    private BitSet bitset;//位图存储    private int bitSetSize;     // expected (maximum) number of elements to be added    private int expectedNumberOfFilterElements;      // number of elements actually added to the Bloom filter    private int numberOfAddedElements;     private int k;     //每一个元素对应k个位     // encoding used for storing hash values as strings    static Charset charset = Charset.forName("UTF-8");      // MD5 gives good enough accuracy in most circumstances.      // Change to SHA1 if it's needed    static String hashName = "MD5";     static final MessageDigest digestFunction;    static { // The digest method is reused between instances to provide higher entropy.        MessageDigest tmp;        try {            tmp = java.security.MessageDigest.getInstance(hashName);        } catch (NoSuchAlgorithmException e) {            tmp = null;        }        digestFunction = tmp;    }    /**     * Constructs an empty Bloom filter.     *     * @param bitSetSize defines how many bits should be used for the filter.     * @param expectedNumberOfFilterElements defines the maximum      *           number of elements the filter is  expected to contain.     */    public BloomFilter(int bitSetSize, int expectedNumberOfFilterElements) {        this.expectedNumberOfFilterElements = expectedNumberOfFilterElements;        this.k = (int) Math.round(               (bitSetSize / expectedNumberOfFilterElements) * Math.log(2.0));        bitset = new BitSet(bitSetSize);        deleteMap = new HashMap<E, Boolean>();        this.bitSetSize = bitSetSize;        numberOfAddedElements = 0;    }    /**     * Generates a digest based on the contents of a String.     *     * @param val specifies the input data.     * @param charset specifies the encoding of the input data.     * @return digest as long.     */    public static long createHash(String val, Charset charset) {        try {            return createHash(val.getBytes(charset.name()));        }        catch (UnsupportedEncodingException e) {            e.printStackTrace();            // Ingore        }        return -1;    }    /**     * Generates a digest based on the contents of a String.     *     * @param val specifies the input data. The encoding is expected to be UTF-8.     * @return digest as long.     */    public static long createHash(String val) {        return createHash(val, charset);    }    /**     * Generates a digest based on the contents of an array of bytes.     *     * @param data specifies input data.     * @return digest as long.     */    public static long createHash(byte[] data) {        long h = 0;        byte[] res;        synchronized (digestFunction) {            res = digestFunction.digest(data);        }        for (int i = 0; i < 4; i++) {            h <<= 8;            h |= ((int) res[i]) & 0xFF;        }        return h;    }    /**     * Compares the contents of two instances to see if they are equal.     *     * @param obj is the object to compare to.     * @return True if the contents of the objects are equal.     */    @SuppressWarnings("unchecked")    @Override    public boolean equals(Object obj) {        if (obj == null) {            return false;        }        if (getClass() != obj.getClass()) {            return false;        }        final BloomFilter<E> other = (BloomFilter<E>) obj;                if (this.expectedNumberOfFilterElements !=                other.expectedNumberOfFilterElements) {            return false;        }        if (this.k != other.k) {            return false;        }        if (this.bitSetSize != other.bitSetSize) {            return false;        }        if (this.bitset != other.bitset &&                (this.bitset == null || !this.bitset.equals(other.bitset))) {            return false;        }        return true;    }    /**     * Calculates a hash code for this class.     * @return hash code representing the contents of an instance of this class.     */    @Override    public int hashCode() {        int hash = 7;        hash = 61 * hash + (this.bitset != null ? this.bitset.hashCode() : 0);        hash = 61 * hash + this.expectedNumberOfFilterElements;        hash = 61 * hash + this.bitSetSize;        hash = 61 * hash + this.k;        return hash;    }    /**     * Calculates the expected probability of false positives based on     * the number of expected filter elements and the size of the Bloom filter.     * <br /><br />     * The value returned by this method is the <i>expected</i> rate of false     * positives, assuming the number of inserted elements equals the number of     * expected elements. If the number of elements in the Bloom filter is less     * than the expected value, the true probability of false positives will be lower.     *     * @return expected probability of false positives.     */    public double expectedFalsePositiveProbability() {        return getFalsePositiveProbability(expectedNumberOfFilterElements);    }    /**     * Calculate the probability of a false positive given the specified     * number of inserted elements.     *     * @param numberOfElements number of inserted elements.     * @return probability of a false positive.     */    public double getFalsePositiveProbability(double numberOfElements) {        // (1 - e^(-k * n / m)) ^ k        return Math.pow((1 - Math.exp(-k * (double) numberOfElements                        / (double) bitSetSize)), k);    }    /**     * Get the current probability of a false positive. The probability is calculated from     * the size of the Bloom filter and the current number of elements added to it.     *     * @return probability of false positives.     */    public double getFalsePositiveProbability() {        return getFalsePositiveProbability(numberOfAddedElements);    }    /**     * Returns the value chosen for K.<br />     * <br />     * K is the optimal number of hash functions based on the size     * of the Bloom filter and the expected number of inserted elements.     *     * @return optimal k.     */    public int getK() {        return k;    }    /**     * Sets all bits to false in the Bloom filter.     */    public void clear() {        bitset.clear();        numberOfAddedElements = 0;    }    /**     * Adds an object to the Bloom filter. The output from the object's     * toString() method is used as input to the hash functions.     *     * @param element is an element to register in the Bloom filter.     */    public void add(E element) {        deleteMap.remove(element);       long hash;       String valString = element.toString();       for (int x = 0; x < k; x++) {           hash = createHash(valString + Integer.toString(x));           hash = hash % (long)bitSetSize;           bitset.set(Math.abs((int)hash), true);       }       numberOfAddedElements ++;    }    /**     * Remove all elements from a Collection to the Bloom filter.     * @param c Collection of elements.     */    public void removeAll(Collection<? extends E> c) {        for (E element : c)            remove(element);    }            public void remove(E element) {        deleteMap.put(element, Boolean.TRUE);    }            public int getDeleteMapSize(){        return deleteMap.size();    }    /**     * Adds all elements from a Collection to the Bloom filter.     * @param c Collection of elements.     */    public void addAll(Collection<? extends E> c) {        for (E element : c) {            if (element != null)                add(element);        }    }    /**     * Returns true if the element could have been inserted into the Bloom filter.     * Use getFalsePositiveProbability() to calculate the probability of this     * being correct.     *     * @param element element to check.     * @return true if the element could have been inserted into the Bloom filter.     */    public boolean contains(E element) {        Boolean contains = deleteMap.get(element);        if (contains != null && contains)            return false;        long hash;        String valString = element.toString();        for (int x = 0; x < k; x++) {            hash = createHash(valString + Integer.toString(x));            hash = hash % (long) bitSetSize;            if (!bitset.get(Math.abs((int) hash)))                return false;        }        return true;    }    /**     * Returns true if all the elements of a Collection could have been inserted     * into the Bloom filter. Use getFalsePositiveProbability() to calculate the     * probability of this being correct.     * @param c elements to check.     * @return true if all the elements in c could have been inserted into the Bloom filter.     */    public boolean containsAll(Collection<? extends E> c) {        for (E element : c)            if (!contains(element))                return false;        return true;    }    /**     * Read a single bit from the Bloom filter.     * @param bit the bit to read.     * @return true if the bit is set, false if it is not.     */    public boolean getBit(int bit) {        return bitset.get(bit);    }    /**     * Set a single bit in the Bloom filter.     * @param bit is the bit to set.     * @param value If true, the bit is set. If false, the bit is cleared.     */    public void setBit(int bit, boolean value) {        bitset.set(bit, value);    }    /**     * Return the bit set used to store the Bloom filter.     * @return bit set representing the Bloom filter.     */    public BitSet getBitSet() {        return bitset;    }    /**     * Returns the number of bits in the Bloom filter. Use count() to retrieve     * the number of inserted elements.     *     * @return the size of the bitset used by the Bloom filter.     */    public int size() {        return this.bitSetSize;    }    /**     * Returns the number of elements added to the Bloom filter after it     * was constructed or after clear() was called.     *     * @return number of elements added to the Bloom filter.     */    public int count() {        return this.numberOfAddedElements;    }    /**     * Returns the expected number of elements to be inserted into the filter.     * This value is the same value as the one passed to the constructor.     *     * @return expected number of elements.     */    public int getExpectedNumberOfElements() {        return expectedNumberOfFilterElements;    }    /**     * 返回更新的时间戳机制     * @return     */    public long getTimestamp() {        return timestamp;    }    /**     * 设置跟新的时间戳     * @param timestamp     */    public void setTimestamp(long timestamp) {        this.timestamp = timestamp;    }    @Override    public String toString() {        return "BloomFilter [timestamp=" + timestamp + ", bitSetSize=" + bitSetSize                + ", expectedNumberOfFilterElements="                 + expectedNumberOfFilterElements + ", numberOfAddedElements="                + numberOfAddedElements + ", k="                 + k +",deleteMapSize=" +getDeleteMapSize()+"]";    }}

0 0