HashMap的hash算法释疑---常用的三种Map HashMap LinkHashMap TreeMap

来源：互联网发布：家装招标软件编辑：程序博客网时间：2024/05/20 23:37

为什么要对hashmap的key进行hash运算？ hashmap的key是已数组的方式在内存中连续存储的（数组元素为Entry<K,V>）。为了提高查找的性能，将key值计算hash后得到的值作为数组下表来存储。

如何计算hash，并保证hash不冲突？

看方法：

public V put(K key, V value) {
        if (table == EMPTY_TABLE) {
            inflateTable(threshold);
        }
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key);
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

1、先执行hash函数：

final int hash(Object k) {
        int h = hashSeed;
        if (0 != h && k instanceof String) {
            return sun.misc.Hashing.stringHash32((String) k);
        }

h ^= k.hashCode();

        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

2、通过indexFor函数映射到数组下标。

static int indexFor(int h, int length) {
        // assert Integer.bitCount(length) == 1 : "length must be a non-zero power of 2";
        return h & (length-1);
    }

3、数组元素以Entry<K,V>来存储，该对象中保存真实的key、value值和key的实际hash值。

如果找到相同的key值：if (e.hash == hash && ((k = e.key) == key || key.equals(k)))，则覆盖并返回老的value值。

4、如果没有找到相同的key值，则重新计算下标，如果下标重复，则让Entry<K,V>的next指针指向老的Entry<K,V>对象，然后将新的Entry<K,V>对象存入数组。

由此带来的思考：如果hash冲突严重的话，key的数组存储会往链表退化。

为了维护hashmap的性能，hashmap的构造方法提供两个参数来适应业务。

1、initialCapacity 初始数组大小（key的数组）；默认为16.

2、float loadFactor 负载因子；默认为0.75。

（1）负载因子 <= 总元素个数 / 数组总大小，如果元素增长，就会迫使数组扩容。但是如果负载因子过大，则hash、indexFor后小标冲突的概率会增大，退化为链表。

（2）hashmap有个重要的内部变量 threshold=当前数组总容量 * 负载因子。在数据插入hashmap过程中，如果元素个数大于threshold，则执行数组扩容（每次扩容增大为原来的两倍，最大为Integer.Max_Value）。所以如果负载因子过小，则数组扩容次数会增多。

为提升性能，应当根据业务选择合适的initialCapacity和loadFactor。

重要的有序（可按访问时间accessOrder=true和插入顺序排序accessOrder=false）的Map LinkedHashMap

这个Map继承自HashMap，但是另外维护了一张链表，用于存放key元素的顺序。

注意：当使用map=new LinkedHashMap<String, String>(16, 0.75f, true);时，不能用迭代遍历访问，因为对map执行get()会使map执行排序（所有集合都不能在迭代中修改集合的结构），而抛出ConcurrentModificationException异常。

排序的Map：TreeMap，如果不是基本类型，需要在TreeMap的构造方法中传入Comparator、或者让对象实现Comparator接口。否则抛出ClassCastException异常。

TreeMap基于红黑树排序，这种排序优于平衡二叉树，可以在O（log n）时间内执行查、插、删操作。

常用的哈希函数

通用的哈希函数库有下面这些混合了加法和一位操作的字符串哈希算法。下面的这些算法在用法和功能方面各有不同，但是都可以作为学习哈希算法的实现的例子。(其他版本代码实现见下载）

1.RS

从Robert Sedgwicks的 Algorithms in C一书中得到了。我(原文作者)已经添加了一些简单的优化的算法，以加快其散列过程。

[java] view plaincopyprint?

public long RSHash(String str)
{
int b = 378551;
int a = 63689;
long hash = 0;
for(int i = 0; i < str.length(); i++)
{
hash = hash * a + str.charAt(i);
a = a * b;
}
return hash;
}

2.JS

Justin Sobel写的一个位操作的哈希函数。

[c-sharp] view plaincopyprint?

public long JSHash(String str)
{
long hash = 1315423911;
for(int i = 0; i < str.length(); i++)
{
hash ^= ((hash << 5) + str.charAt(i) + (hash >> 2));
}
return hash;
}

3.PJW

该散列算法是基于贝尔实验室的彼得J温伯格的的研究。在Compilers一书中（原则，技术和工具），建议采用这个算法的散列函数的哈希方法。

[java] view plaincopyprint?

public long PJWHash(String str)
{
long BitsInUnsignedInt = (long)(4 * 8);
long ThreeQuarters = (long)((BitsInUnsignedInt * 3) / 4);
long OneEighth = (long)(BitsInUnsignedInt / 8);
long HighBits = (long)(0xFFFFFFFF) << (BitsInUnsignedInt - OneEighth);
long hash = 0;
long test = 0;
for(int i = 0; i < str.length(); i++)
{
hash = (hash << OneEighth) + str.charAt(i);
if((test = hash & HighBits) != 0)
{
hash = (( hash ^ (test >> ThreeQuarters)) & (~HighBits));
}
}
return hash;
}

4.ELF

和PJW很相似，在Unix系统中使用的较多。

[java] view plaincopyprint?

public long ELFHash(String str)
{
long hash = 0;
long x = 0;
for(int i = 0; i < str.length(); i++)
{
hash = (hash << 4) + str.charAt(i);
if((x = hash & 0xF0000000L) != 0)
{
hash ^= (x >> 24);
}
hash &= ~x;
}
return hash;
}

5.BKDR

这个算法来自Brian Kernighan 和 Dennis Ritchie的 The C Programming Language。这是一个很简单的哈希算法,使用了一系列奇怪的数字,形式如31,3131,31...31,看上去和DJB算法很相似。(参照我之前一篇博客，这个就是Java的字符串哈希函数)

[java] view plaincopyprint?

public long BKDRHash(String str)
{
long seed = 131; // 31 131 1313 13131 131313 etc..
long hash = 0;
for(int i = 0; i < str.length(); i++)
{
hash = (hash * seed) + str.charAt(i);
}
return hash;
}

6.SDBM

这个算法在开源的SDBM中使用，似乎对很多不同类型的数据都能得到不错的分布。

[java] view plaincopyprint?

public long SDBMHash(String str)
{
long hash = 0;
for(int i = 0; i < str.length(); i++)
{
hash = str.charAt(i) + (hash << 6) + (hash << 16) - hash;
}
return hash;
}

7.DJB

这个算法是Daniel J.Bernstein 教授发明的，是目前公布的最有效的哈希函数。

[java] view plaincopyprint?

public long DJBHash(String str)
{
long hash = 5381;
for(int i = 0; i < str.length(); i++)
{
hash = ((hash << 5) + hash) + str.charAt(i);
}
return hash;
}

8.DEK

由伟大的Knuth在《编程的艺术第三卷》的第六章排序和搜索中给出。

[java] view plaincopyprint?

public long DEKHash(String str)
{
long hash = str.length();
for(int i = 0; i < str.length(); i++)
{
hash = ((hash << 5) ^ (hash >> 27)) ^ str.charAt(i);
}
return hash;
}

9.AP

这是本文作者Arash Partow贡献的一个哈希函数，继承了上面以旋转以为和加操作。代数描述：

[java] view plaincopyprint?

public long APHash(String str)
{
long hash = 0xAAAAAAAA;
for(int i = 0; i < str.length(); i++)
{
if ((i & 1) == 0)
{
hash ^= ((hash << 7) ^ str.charAt(i) * (hash >> 3));
}
else
{
hash ^= (~((hash << 11) + str.charAt(i) ^ (hash >> 5)));
}
}
return hash;
}

这里有一个关于这些算法的评测，可以稍微看看，自己也可以简单测试下，我在VSM试验中的测试，这些算法没有太大的性能差异，可能是数据量较小的缘故。

0 0