LevelDB：Bloom源码精读——数据结构

来源：互联网发布：剪辑音乐软件编辑：程序博客网时间：2024/06/06 10:41

原文地址：https://yq.aliyun.com/articles/5833

一、原理分析

BloomFiler(布隆过滤器)是由Howard Bloom在1970年提出的二进制向量数据结构，怎么来理解“二进制向量数据结构”？

我们将其分解成“二进制”、“向量”和“数据结构”来分别理解。

1、二进制：用0和1来表示的数。

2、向量：是指位向量或者比特向量，即向量的坐标系的X轴是位列(连续的内存地址)，Y轴是0和1两个值。

3、数据结构：存储和组织数据的方式。

我们可以这样形象理解BloomFiler，它是一段位列，位列上每一位以0或1表示着BloomFiler组织数据的意义。

而BloomFiler组织数据是将数据通过K个哈希函数分别映射到位列上，并将位列相应位置的位值赋值为1。位值为1的意义是表示数据在BloomFiler中存在。

如图：

1、位列，开始没有数据

2、将数据a哈希函数分别映射到位列上，并将位列相应位置的位赋值为1。

3、将数据b哈希函数分别映射到位列上，并将位列相应位置的位赋值为1。

查询一个数据是否存在于BloomFiler中，即将数据通过K个哈希函数分别映射到位列上，看位列相应的位置上的位值是否都为1，

如果都为1，则说明存在；如果不都为1，则说明不存在。

由于哈希存在冲突，存在的情况下，有一定的误识别率。即一个数本来不存在于BloomFiler中，而被告诉存在。

二、代码实现

        static uint32_t BloomHash(const Slice& key) // 哈希函数        {            return Hash(key.data(), key.size(), 0xbc9f1d34);        }                class BloomFilterPolicy : public FilterPolicy        {        private:            size_t bits_per_key_; // 一个key占多少位            size_t k_; // 哈希函数个数                    public:            explicit BloomFilterPolicy(int bits_per_key): bits_per_key_(bits_per_key)            {                // We intentionally round down to reduce probing cost a little bit                k_ = static_cast<size_t>(bits_per_key * 0.69);  // 0.69 =~ ln(2)                if (k_ < 1) k_ = 1;                if (k_ > 30) k_ = 30;            }                        virtual const char* Name() const            {                return "leveldb.BuiltinBloomFilter2";            }                        // n:key的个数；dst:存放过滤器处理的结果            virtual void CreateFilter(const Slice* keys, int n, std::string* dst) const            {                // Compute bloom filter size (in both bits and bytes)                size_t bits = n * bits_per_key_;                                // For small n, we can see a very high false positive rate.  Fix it                // by enforcing a minimum bloom filter length.                // 位列bits最小64位，8个字节                if (bits < 64) bits = 64;                                // bits位占多少个字节                size_t bytes = (bits + 7) / 8;                // 得到真实的位列bits                bits = bytes * 8;                                const size_t init_size = dst->size();                dst->resize(init_size + bytes, 0);                // 在过滤器集合最后记录需要k_次哈希                dst->push_back(static_cast<char>(k_));  // Remember # of probes in filter                char* array = &(*dst)[init_size];                for (size_t i = 0; i < n; i++)                {                    // Use double-hashing to generate a sequence of hash values.                    // See analysis in [Kirsch,Mitzenmacher 2006].                    uint32_t h = BloomHash(keys[i]);                    const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits                    // 使用k个哈希函数，计算出k位，每位都赋值为1。                    // 为了减少哈希冲突，减少误判。                    for (size_t j = 0; j < k_; j++)                    {                        // 得到元素在位列bits中的位置                        const uint32_t bitpos = h % bits;                        /*                         bitpos/8计算元素在第几个字节；                         (1 << (bitpos % 8))计算元素在字节的第几位；                         例如：                         bitpos的值为3， 则元素在第一个字节的第三位上，那么这位上应该赋值为1。                         bitpos的值为11，则元素在第二个字节的第三位上，那么这位上应该赋值为1。                         为什么要用|=运算，因为字节位上的值可能为1，那么新值赋值，还需要保留原来的值。                         */                        array[bitpos/8] |= (1 << (bitpos % 8));                        h += delta;                    }                }            }                        virtual bool KeyMayMatch(const Slice& key, const Slice& bloom_filter) const            {                const size_t len = bloom_filter.size();                if (len < 2) return false;                                const char* array = bloom_filter.data();                const size_t bits = (len - 1) * 8;                                // Use the encoded k so that we can read filters generated by                // bloom filters created using different parameters.                const size_t k = array[len-1];                if (k > 30)                {                    // 为短bloom filter保留，当前认为直接match                     // Reserved for potentially new encodings for short bloom filters.                    // Consider it a match.                    return true;                }                                uint32_t h = BloomHash(key);                const uint32_t delta = (h >> 17) | (h << 15);  // Rotate right 17 bits                for (size_t j = 0; j < k; j++)                {                    const uint32_t bitpos = h % bits;                    // 只要有一位为0，说明元素肯定不在过滤器集合内。                    if ((array[bitpos/8] & (1 << (bitpos % 8))) == 0) return false;                    h += delta;                }                return true;            }        };

0 0