BitSet数据结构以及jdk中实现源码分析

来源：互联网发布：php集成化安装包编辑：程序博客网时间：2024/05/01 02:46

一. Bitset 基础

Bitset，也就是位图，由于可以用非常紧凑的格式来表示给定范围的连续数据而经常出现在各种算法设计中。上面的图来自c++库中bitset的一张图。

基本原理是，用1位来表示一个数据是否出现过，0为没有出现过，1表示出现过。使用用的时候既可根据某一个是否为0表示此数是否出现过。

一个1G的空间，有 8*1024*1024*1024=8.58*10^9bit，也就是可以表示85亿个不同的数。

常见的应用是那些需要对海量数据进行一些统计工作的时候，比如日志分析等。

面试题中也常出现，比如：统计40亿个数据中没有出现的数据，将40亿个不同数据进行排序等。

又如：现在有1千万个随机数，随机数的范围在1到1亿之间。现在要求写出一种算法，将1到1亿之间没有在随机数中的数求出来(百度)。

programming pearls上也有一个关于使用bitset来查找电话号码的题目。

Bitmap的常见扩展，是用2位或者更多为来表示此数字的更多信息，比如出现了多少次等。

二. java中bitset的实现

Bitset这种结构虽然简单，实现的时候也有一些细节需要主要。其中的关键是一些位操作，比如如何将指定位进行反转、设置、查询指定位的状态（0或者1）等。

本文，分析一下java中bitset的实现，抛砖引玉，希望给那些需要自己设计位图结构的需要的程序员有所启发。

Bitmap的基本操作有：

初始化一个bitset，指定大小。
清空bitset。
反转某一指定位。
设置某一指定位。
获取某一位的状态。
当前bitset的bit总位数。

1. 声明

在java中，bitset的实现，位于java.util这个包中，从jdk 1.0就引入了这个数据结构。在多个jdk的演变中，bitset也不断演变。本文参照的是jdk 7.0 源代码中的实现。

声明如下：

package java.util;import java.io.*;import java.nio.ByteBuffer;import java.nio.ByteOrder;import java.nio.LongBuffer;public class BitSet implements Cloneable, java.io.Serializable {、

  private long[] words;........

同时我们也看到使用long数组来作为内部存储结构。这个决定了，Bitset至少为一个long的大小。下面的构造函数中也会有所体现。

2. 初始化函数

 public BitSet() {        initWords(BITS_PER_WORD);        sizeIsSticky = false;    }    public BitSet(int nbits) {        // nbits can't be negative; size 0 is OK        if (nbits < 0)            throw new NegativeArraySizeException("nbits < 0: " + nbits);        initWords(nbits);    private void initWords(int nbits) {        words = new long[wordIndex(nbits-1) + 1];    }    private static int wordIndex(int bitIndex) {        return bitIndex >> ADDRESS_BITS_PER_WORD;    }    private final static int ADDRESS_BITS_PER_WORD = 6;    private final static int BITS_PER_WORD = 1 << ADDRESS_BITS_PER_WORD;

两个构造函数，分别是一个指定了初始大小，一个没指定。如果没指定，我们可以看到默认的初始大小为, 2^6 = 64-1=63 bit. 我们知道java中long的大小就是8个字节，也就是8*8=64bit。也就是说，bitset默认的是一个long整形的大小。初始化函数指定了必要的大小。
注意：如果指定了bitset的初始化大小，那么会把他规整到一个大于或者等于这个数字的64的整倍数。比如64位，bitset的大小是1个long，而65位时，bitset大小是2个long，即128位。做这么一个规定，主要是为了内存对齐，同时避免考虑到不要处理特殊情况，简化程序。

3. 清空bitset

a. 清空所有的bit位，即全部置0。通过循环方式来以此以此置0。如果是c语言，使用memset会不会快点？

    public void clear() {        while (wordsInUse > 0)            words[--wordsInUse] = 0;    }

b. 清空某一位

   public void clear(int bitIndex) {        if (bitIndex < 0)            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);        int wordIndex = wordIndex(bitIndex);        if (wordIndex >= wordsInUse)            return;        words[wordIndex] &= ~(1L << bitIndex);        recalculateWordsInUse();        checkInvariants();    }

第一行是参数检查，如果bitIndex小于0，则抛参数非法异常。后面执行的是bitset中操作中经典的两步曲：a. 找到对应的long b. 操作对应的位。
a. 找到对应的long。这行语句是 int wordIndex = wordIndex(bitIndex);
b. 操作对应的位。常见的位操作是通过与特定的mask进行逻辑运算来实现的。因此，首先获取 mask（掩码）。
对于 clear某一位来说，它需要的掩码是指定位为0，其余位为1，然后与对应的long进行&运算。
~(1L << bitIndex); 即获取mask
words[wordIndex] &= ; 执行相应的运算。
注意：这里的参数检查，对负数index跑出异常，对超出大小的index，不做任何操作，直接返回。具体的原因，有待进一步思考。

c. 清空指定范围的那些bits

 /**     * Sets the bits from the specified {@code fromIndex} (inclusive) to the     * specified {@code toIndex} (exclusive) to {@code false}.     *     * @param  fromIndex index of the first bit to be cleared     * @param  toIndex index after the last bit to be cleared     * @throws IndexOutOfBoundsException if {@code fromIndex} is negative,     *         or {@code toIndex} is negative, or {@code fromIndex} is     *         larger than {@code toIndex}     * @since  1.4     */    public void clear(int fromIndex, int toIndex) {        checkRange(fromIndex, toIndex);        if (fromIndex == toIndex)            return;        int startWordIndex = wordIndex(fromIndex);        if (startWordIndex >= wordsInUse)            return;        int endWordIndex = wordIndex(toIndex - 1);        if (endWordIndex >= wordsInUse) {            toIndex = length();            endWordIndex = wordsInUse - 1;        }        long firstWordMask = WORD_MASK << fromIndex;        long lastWordMask  = WORD_MASK >>> -toIndex;        if (startWordIndex == endWordIndex) {            // Case 1: One word            words[startWordIndex] &= ~(firstWordMask & lastWordMask);        } else {            // Case 2: Multiple words            // Handle first word            words[startWordIndex] &= ~firstWordMask;            // Handle intermediate words, if any            for (int i = startWordIndex+1; i < endWordIndex; i++)                words[i] = 0;            // Handle last word            words[endWordIndex] &= ~lastWordMask;        }        recalculateWordsInUse();        checkInvariants();    }

方法是将这个范围分成三块，startword; interval words; stopword。
其中startword，只要将从start位到该word结束位全部置0；interval words则是这些long的所有bits全部置0；而stopword这是从起始位置到指定的结束位全部置0。
而特殊情形则是没有startword和stopword是同一个long。
具体的实现，参照代码，是分别作出两个mask，对startword和stopword进行操作。

4. 重要的两个内部检查函数

从上面的代码，可以看到每个函授结尾都会有两个函数,如下：
recalculateWordsInUse();
checkInvariants();
这两个函数，是对bitset的内部状态进行维护和检查的函数。细看实现既可明白其中原理：

/**     * Sets the field wordsInUse to the logical size in words of the bit set.     * WARNING:This method assumes that the number of words actually in use is     * less than or equal to the current value of wordsInUse!     */    private void recalculateWordsInUse() {        // Traverse the bitset until a used word is found        int i;        for (i = wordsInUse-1; i >= 0; i--)            if (words[i] != 0)                break;        wordsInUse = i+1; // The new logical size    }

wordsInUse 是检查当前的long数组中，实际使用的long的个数，即long[wordsInUse-1]是当前最后一个存储有有效bit的long。这个值是用于保存bitset有效大小的。

    /**     * Every public method must preserve these invariants.     */    private void checkInvariants() {        assert(wordsInUse == 0 || words[wordsInUse - 1] != 0);        assert(wordsInUse >= 0 && wordsInUse <= words.length);        assert(wordsInUse == words.length || words[wordsInUse] == 0);    }

checkInvariants 可以看出是检查内部状态，尤其是wordsInUse是否合法的函数。

5. 反转某一个指定位

反转，就是1变成0,0变成1，是一个与1的xor操作。

 /**     * Sets the bit at the specified index to the complement of its     * current value.     *     * @param  bitIndex the index of the bit to flip     * @throws IndexOutOfBoundsException if the specified index is negative     * @since  1.4     */    public void flip(int bitIndex) {        if (bitIndex < 0)            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);        int wordIndex = wordIndex(bitIndex);        expandTo(wordIndex);        words[wordIndex] ^= (1L << bitIndex);        recalculateWordsInUse();        checkInvariants();    }

反转的基本操作也是两步，找到对应的long，获取mask并与指定的位进行xor操作。
int wordIndex = wordIndex(bitIndex);
words[wordIndex] ^= (1L << bitIndex);
我们注意到在进行操作之前，执行了一个函数 expandTo(wordIndex); 这个函数是确保bitset中有对应的这个long。如果没有的话，就对bitset中的long数组进行扩展。扩展的策略，是将当前的空间翻一倍。
代码如下：

 /**     * Ensures that the BitSet can accommodate a given wordIndex,     * temporarily violating the invariants.  The caller must     * restore the invariants before returning to the user,     * possibly using recalculateWordsInUse().     * @param wordIndex the index to be accommodated.     */    private void expandTo(int wordIndex) {        int wordsRequired = wordIndex+1;        if (wordsInUse < wordsRequired) {            ensureCapacity(wordsRequired);            wordsInUse = wordsRequired;        }    }    /**     * Ensures that the BitSet can hold enough words.     * @param wordsRequired the minimum acceptable number of words.     */    private void ensureCapacity(int wordsRequired) {        if (words.length < wordsRequired) {            // Allocate larger of doubled size or required size            int request = Math.max(2 * words.length, wordsRequired);            words = Arrays.copyOf(words, request);            sizeIsSticky = false;        }    }

同样，也提供了一个指定区间的反转，实现方案与clear基本相同。代码如下：

 public void flip(int fromIndex, int toIndex) {        checkRange(fromIndex, toIndex);        if (fromIndex == toIndex)            return;        int startWordIndex = wordIndex(fromIndex);        int endWordIndex   = wordIndex(toIndex - 1);        expandTo(endWordIndex);        long firstWordMask = WORD_MASK << fromIndex;        long lastWordMask  = WORD_MASK >>> -toIndex;        if (startWordIndex == endWordIndex) {            // Case 1: One word            words[startWordIndex] ^= (firstWordMask & lastWordMask);        } else {            // Case 2: Multiple words            // Handle first word            words[startWordIndex] ^= firstWordMask;            // Handle intermediate words, if any            for (int i = startWordIndex+1; i < endWordIndex; i++)                words[i] ^= WORD_MASK;            // Handle last word            words[endWordIndex] ^= lastWordMask;        }        recalculateWordsInUse();        checkInvariants();    }

6. 设置某一指定位（or 操作）

/**      * Sets the bit at the specified index to {@code true}.     *     * @param  bitIndex a bit index     * @throws IndexOutOfBoundsException if the specified index is negative     * @since  JDK1.0     */    public void set(int bitIndex) {        if (bitIndex < 0)            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);        int wordIndex = wordIndex(bitIndex);        expandTo(wordIndex);        words[wordIndex] |= (1L << bitIndex); // Restores invariants        checkInvariants();    }

思路与flip是一样的，只是执行的是与1的or操作。
同时jdk中提供了，具体设置成0或1的操作，以及设置某一区间的操作。

  public void set(int bitIndex, boolean value) {        if (value)            set(bitIndex);        else            clear(bitIndex);    }

7. 获取某一位置的状态

  /**     * Returns the value of the bit with the specified index. The value     * is {@code true} if the bit with the index {@code bitIndex}     * is currently set in this {@code BitSet}; otherwise, the result     * is {@code false}.     *     * @param  bitIndex   the bit index     * @return the value of the bit with the specified index     * @throws IndexOutOfBoundsException if the specified index is negative     */    public boolean get(int bitIndex) {        if (bitIndex < 0)            throw new IndexOutOfBoundsException("bitIndex < 0: " + bitIndex);        checkInvariants();        int wordIndex = wordIndex(bitIndex);        return (wordIndex < wordsInUse)            && ((words[wordIndex] & (1L << bitIndex)) != 0);    }

同样的两步走，这里的位操作时&。可以看到，如果指定的bit不存在的话，返回的是false，即没有设置。
jdk同时提供了一个获取指定区间的bitset的方法。当然这里的返回值会是一个bitset，是一个仅仅包含需要查询位的bitset。注意这里的大小也仅仅是刚刚能够容纳必须的位（当然，规整到long的整数倍）。代码如下：

public BitSet get(int fromIndex, int toIndex) {        checkRange(fromIndex, toIndex);        checkInvariants();        int len = length();        // If no set bits in range return empty bitset        if (len <= fromIndex || fromIndex == toIndex)            return new BitSet(0);        // An optimization        if (toIndex > len)            toIndex = len;        BitSet result = new BitSet(toIndex - fromIndex);        int targetWords = wordIndex(toIndex - fromIndex - 1) + 1;        int sourceIndex = wordIndex(fromIndex);        boolean wordAligned = ((fromIndex & BIT_INDEX_MASK) == 0);        // Process all words but the last word        for (int i = 0; i < targetWords - 1; i++, sourceIndex++)            result.words[i] = wordAligned ? words[sourceIndex] :                (words[sourceIndex] >>> fromIndex) |                (words[sourceIndex+1] << -fromIndex);        // Process the last word        long lastWordMask = WORD_MASK >>> -toIndex;        result.words[targetWords - 1] =            ((toIndex-1) & BIT_INDEX_MASK) < (fromIndex & BIT_INDEX_MASK)            ? /* straddles source words */            ((words[sourceIndex] >>> fromIndex) |             (words[sourceIndex+1] & lastWordMask) << -fromIndex)            :            ((words[sourceIndex] & lastWordMask) >>> fromIndex);        // Set wordsInUse correctly        result.wordsInUse = targetWords;        result.recalculateWordsInUse();        result.checkInvariants();        return result;    }

这里有一个tricky的操作，即fromIndex的那个bit会存在返回bitset的第0个位置，以此类推。如果fromIndex不是word对齐的话，那么返回的bitset的第一个word将会包含fromIndex所在word的从fromIndex开始的到fromIndex+1开始的的那几位（总共加起来是一个word的大小）。
其中>>>是无符号位想右边移位的操作符。

8. 获取当前bitset总bit的大小

  /**     * Returns the "logical size" of this {@code BitSet}: the index of     * the highest set bit in the {@code BitSet} plus one. Returns zero     * if the {@code BitSet} contains no set bits.     *     * @return the logical size of this {@code BitSet}     * @since  1.2     */    public int length() {        if (wordsInUse == 0)            return 0;        return BITS_PER_WORD * (wordsInUse - 1) +            (BITS_PER_WORD - Long.numberOfLeadingZeros(words[wordsInUse - 1]));    }

9. hashcode

hashcode是一个非常重要的属性，可以用来表明一个数据结构的特征。bitset的hashcode是用下面的方式实现的：

 /**     * Returns the hash code value for this bit set. The hash code depends     * Note that the hash code changes if the set of bits is altered.     *     * @return the hash code value for this bit set     */    public int hashCode() {        long h = 1234;        for (int i = wordsInUse; --i >= 0; )            h ^= words[i] * (i + 1);        return (int)((h >> 32) ^ h);    }

这个hashcode同时考虑了没给word以及word的位置。当有bit的状态发生变化时，hashcode会随之改变。