BloomFilter（大数据去重）+Redis（持久化）策略

来源：互联网发布：php在线文件管理系统编辑：程序博客网时间：2024/05/22 06:25

BloomFilter（大数据去重）+Redis（持久化）策略

背景

之前在重构一套文章爬虫系统时，其中有块逻辑是根据文章标题去重，原先去重的方式是，插入文章之前检查待插入文章的标题是否在ElasticSearch中存在，这无疑加重了ElasticSearch的负担也势必会影响程序的性能！

BloomFilter算法

简介：布隆过滤器实际上是一个很长的二进制向量和一系列随机映射函数。布隆过滤器可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法，缺点是有一定的误识别率和删除困难。
原理：当一个元素被加入集合时，通过K个散列函数将这个元素映射成一个位数组中的K个点，把它们置为1。检索时，我们只要看看这些点是不是都是1就（大约）知道集合中有没有它了：如果这些点有任何一个0，则被检元素一定不在；如果都是1，则被检元素很可能在。
优点：相比于其它的数据结构，布隆过滤器在空间和时间方面都有巨大的优势。布隆过滤器存储空间和插入/查询时间都是常数（O(k)）。而且它不存储元素本身，在某些对保密要求非常严格的场合有优势。
缺点：一定的误识别率和删除困难。
结合以上几点及去重需求（容忍误判，会误判在，在则丢，无妨），决定使用BlomFilter。

思想

位数组和k个散列函数

位数组
初始状态时，BloomFilter是一个长度为m的位数组，每一位都置为0。
添加元素（k个独立的hash函数）
添加元素时，对x使用k个哈希函数得到k个哈希值，对m取余，对应的bit位设置为1。
判断元素是否存在
判断y是否属于这个集合，对y使用k个哈希函数得到k个哈希值，对m取余，所有对应的位置都是1，则认为y属于该集合（哈希冲突，可能存在误判），否则就认为y不属于该集合。
图中y1不是集合中的元素，y2属于这个集合或者是一个false positive。

BloomFilter有以下参数:
- m 位数组的长度
- n 加入其中元素的数量
- k 哈希函数的个数
- f False Positive

问题：如何根据输入元素个数n，确定位数组的大小m和哈希函数的个数k?

BloomFilter的f满足下列公式：
这里写图片描述
在给定m和n时，能够使f最小化的k值为：

此时给出的f为：

根据以上公式，对于任意给定的f，我们有：

同时，我们需要k个hash来达成这个目标：

由于k必须取整数，我们在Bloom Filter的程序实现中，还应该使用上面的公式来求得实际的f：
这里写图片描述
以上3个公式是程序实现Bloom Filter的关键公式。
故可以通过调节参数，使用Hash函数的个数，位数组的大小来降低失误率。

实现

可以使用JDK自带的BitSet来实现，但存在两个问题：OOM和持久化问题。
结合Redis的BitMap能够解决，唯一需要注意的是Redis的BitMap只支持2^32大小，对应到内存也就是512MB,数组的下标最大只能是2^32-1。不过这个限制可以通过构建多个Redis的Bitmap通过hash取模的方式分散一下即可。万分之一的误判率，512MB可以放下2亿条数据。
好了，扯了这么多，贴代码！(注：在MagnusS/Java-BloomFilter的基础上加上了Redis持久化的实现)

@Componentpublic class BloomFilter<E> {    @Autowired    private RedisTemplate<String, E> redisTemplate;    @Value("${bloomfilter.expireDays}")    private long expireDays;    // total length of the Bloom filter    private int sizeOfBloomFilter;    // expected (maximum) number of elements to be added    private int expectedNumberOfFilterElements;    // number of hash functions    private int numberOfHashFunctions;    // encoding used for storing hash values as strings    private final Charset charset = Charset.forName("UTF-8");    // MD5 gives good enough accuracy in most circumstances. Change to SHA1 if it's needed    private static final String hashName = "MD5";    private static final MessageDigest digestFunction;    // The digest method is reused between instances    static {        MessageDigest tmp;        try {            tmp = java.security.MessageDigest.getInstance(hashName);        } catch (NoSuchAlgorithmException e) {            tmp = null;        }        digestFunction = tmp;    }    public BloomFilter() {        this(0.0001, 600000);    }    /**     * Constructs an empty Bloom filter.     *     * @param m is the total length of the Bloom filter.     * @param n is the expected number of elements the filter will contain.     * @param k is the number of hash functions used.     */    public BloomFilter(int m, int n, int k) {        this.sizeOfBloomFilter = m;        this.expectedNumberOfFilterElements = n;        this.numberOfHashFunctions = k;    }    /**     * Constructs an empty Bloom filter with a given false positive probability.     * The size of bloom filter and the number of hash functions is estimated     * to match the false positive probability.     *     * @param falsePositiveProbability is the desired false positive probability.     * @param expectedNumberOfElements is the expected number of elements in the Bloom filter.     */    public BloomFilter(double falsePositiveProbability, int expectedNumberOfElements) {        this((int) Math.ceil((int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2))) * expectedNumberOfElements / Math.log(2)), // m = ceil(kn/ln2)                expectedNumberOfElements,                (int) Math.ceil(-(Math.log(falsePositiveProbability) / Math.log(2)))); // k = ceil(-ln(f)/ln2)    }    /**     * Adds an object to the Bloom filter. The output from the object's     * toString() method is used as input to the hash functions.     *     * @param element is an element to register in the Bloom filter.     */    public void add(E element) {        add(element.toString().getBytes(charset));    }    /**     * Adds an array of bytes to the Bloom filter.     *     * @param bytes array of bytes to add to the Bloom filter.     */    public void add(byte[] bytes) {        if (redisTemplate.opsForValue().get(RedisConsts.CRAWLER_BLOOMFILTER) == null) {            redisTemplate.opsForValue().setBit(RedisConsts.CRAWLER_BLOOMFILTER, 0, false);            redisTemplate.expire(RedisConsts.CRAWLER_BLOOMFILTER, expireDays, TimeUnit.DAYS);        }        int[] hashes = createHashes(bytes, numberOfHashFunctions);        for (int hash : hashes) {            redisTemplate.opsForValue().setBit(RedisConsts.CRAWLER_BLOOMFILTER, Math.abs(hash % sizeOfBloomFilter), true);        }    }    /**     * Adds all elements from a Collection to the Bloom filter.     *     * @param c Collection of elements.     */    public void addAll(Collection<? extends E> c) {        for (E element : c) {            add(element);        }    }    /**     * Returns true if the element could have been inserted into the Bloom filter.     * Use getFalsePositiveProbability() to calculate the probability of this     * being correct.     *     * @param element element to check.     * @return true if the element could have been inserted into the Bloom filter.     */    public boolean contains(E element) {        return contains(element.toString().getBytes(charset));    }    /**     * Returns true if the array of bytes could have been inserted into the Bloom filter.     * Use getFalsePositiveProbability() to calculate the probability of this     * being correct.     *     * @param bytes array of bytes to check.     * @return true if the array could have been inserted into the Bloom filter.     */    public boolean contains(byte[] bytes) {        int[] hashes = createHashes(bytes, numberOfHashFunctions);        for (int hash : hashes) {            if (!redisTemplate.opsForValue().getBit(RedisConsts.CRAWLER_BLOOMFILTER, Math.abs(hash % sizeOfBloomFilter))) {                return false;            }        }        return true;    }    /**     * Returns true if all the elements of a Collection could have been inserted     * into the Bloom filter. Use getFalsePositiveProbability() to calculate the     * probability of this being correct.     *     * @param c elements to check.     * @return true if all the elements in c could have been inserted into the Bloom filter.     */    public boolean containsAll(Collection<? extends E> c) {        for (E element : c) {            if (!contains(element)) {                return false;            }        }        return true;    }    /**     * Generates digests based on the contents of an array of bytes and splits the result into 4-byte int's and store them in an array. The     * digest function is called until the required number of int's are produced. For each call to digest a salt     * is prepended to the data. The salt is increased by 1 for each call.     *     * @param data   specifies input data.     * @param hashes number of hashes/int's to produce.     * @return array of int-sized hashes     */    public static int[] createHashes(byte[] data, int hashes) {        int[] result = new int[hashes];        int k = 0;        byte salt = 0;        while (k < hashes) {            byte[] digest;            synchronized (digestFunction) {                digestFunction.update(salt);                salt++;                digest = digestFunction.digest(data);            }            for (int i = 0; i < digest.length / 4 && k < hashes; i++) {                int h = 0;                for (int j = (i * 4); j < (i * 4) + 4; j++) {                    h <<= 8;                    h |= ((int) digest[j]) & 0xFF;                }                result[k] = h;                k++;            }        }        return result;    }    public int getSizeOfBloomFilter() {        return this.sizeOfBloomFilter;    }    public int getExpectedNumberOfElements() {        return this.expectedNumberOfFilterElements;    }    public int getNumberOfHashFunctions() {        return this.numberOfHashFunctions;    }    /**     * Compares the contents of two instances to see if they are equal.     *     * @param obj is the object to compare to.     * @return True if the contents of the objects are equal.     */    @Override    public boolean equals(Object obj) {        if (obj == null) {            return false;        }        if (getClass() != obj.getClass()) {            return false;        }        final BloomFilter<E> other = (BloomFilter<E>) obj;        if (this.sizeOfBloomFilter != other.sizeOfBloomFilter) {            return false;        }        if (this.expectedNumberOfFilterElements != other.expectedNumberOfFilterElements) {            return false;        }        if (this.numberOfHashFunctions != other.numberOfHashFunctions) {            return false;        }        return true;    }    /**     * Calculates a hash code for this class.     *     * @return hash code representing the contents of an instance of this class.     */    @Override    public int hashCode() {        int hash = 7;        hash = 61 * hash + this.sizeOfBloomFilter;        hash = 61 * hash + this.expectedNumberOfFilterElements;        hash = 61 * hash + this.numberOfHashFunctions;        return hash;    }    public static void main(String[] args) {        BloomFilter<String> bloomFilter = new BloomFilter<>(0.0001, 600000);        System.out.println(bloomFilter.getSizeOfBloomFilter());        System.out.println(bloomFilter.getNumberOfHashFunctions());    }}

阅读全文

0 0