Java8 HashMap源码分析

来源：互联网发布：python tolist 编辑：程序博客网时间：2024/06/09 13:53

简介

特性

HashMap根据键的hashCode值存储数据，大多数情况下可以直接定位到它的值，因而具有很快的访问速度，但遍历顺序却是不确定的。
HashMap最多只允许一条记录的键为null，允许多条记录的值为null。
HashMap非线程安全，即任一时刻可以有多个线程同时写HashMap，可能会导致数据的不一致。如果需要满足线程安全，可以用 Collections的synchronizedMap方法使HashMap具有线程安全的能力，或者使用ConcurrentHashMap。
映射中的key是不可变对象，不可变对象是该对象在创建后它的哈希值不会被改变。如果对象的哈希值发生变化，Map对象很可能就定位不到映射的位置了。

内部实现

几个重要的属性

transient Node<K, V>[] table; int threshold;final float loadFactor;transient int size; transient int modCount;static final int TREEIFY_THRESHOLD = 8;

table
- 哈希桶数组
- 初始化长度length默认为16，长度必须为2的n次方（合数）。
- 常规设计是把length设计为素数，来减少hash冲突的概率。而HashMap在此是为了在取模和扩容的时候做优化，同时也为了减少冲突。
loadFactor
- 负载因子，是table中元素数量和table长度的比值；
- 默认值是0.75。
threshold
- HashMap所能容纳的最大数据量的Node(键值对)个数；
- 计算公式：threshold = table.length * loadFactor，结合公式可知，threshold是负载因子和数组长度对应下允许的最大元素数目，如果超过这个数目，那么就得重新扩容（resize），扩容后的容量是之前容量的2倍。
- 如果内存空间大而又对时间效率要求很高，可以降低负载因子Load factor的值。
- 如果内存空间紧张而对时间效率要求不高，可以增加负载因子loadFactor的值，这个值可以大于1。
size
- HashMap中实际存在的键值对数量；
- 注意与table.length、threshold的区别。
modCount
- 记录HashMap内部结构发生变化的次数；
- 用于迭代的快速失败。
TREEIFY_THRESHOLD
- 链表转红黑树的长度阈值。

存储结构

从结构实现来讲，HashMap是数组+链表+红黑树来实现的。

从源码可知，HashMap类中有一个非常重要的字段，就是Node[] table，即上图中的哈希桶数组table，是一个Node类型的数组。

static class Node<K, V> implements Map.Entry<K, V> {    final int hash; //用来定位数组索引的位置    final K key;    V value;    Node<K, V> next;}

Node是HashMap的一个内部类，实现了Map.Entry接口，本质是就是一个映射(键值对)，上图中的每个黑色圆点就是一个Node对象。

HashMap就是使用哈希表来存储的。哈希表为解决冲突，可以采用开放地址法和链地址法等来解决问题，Java中HashMap采用了链地址法，链地址法简单来说，就是数组加链表的结合。在每个数组元素上都对应一个链表结构，当数据被Hash后，得到数组下标，把数据放在对应下标元素的链表上。

如果哈希桶数组很大，即使较差的Hash算法也会比较分散；如果哈希桶数组数组很小，即使好的Hash算法也会出现较多碰撞，所以就需要在空间成本和时间成本之间权衡。其实就是根据实际情况实行哈希数组的扩容或收缩，并在此基础上设计好的hash算法减少Hash碰撞。

负载因子和Hash算法设计的再合理，也免不了会出现链表过长的情况，一旦链表过长，则会严重影响HashMap的性能。当链表长度太长（默认超过TREEIFY_THRESHOLD）时，链表就转换为红黑树，利用红黑树快速增删改查的特点提高HashMap的性能，其中会用到红黑树的插入、删除、查找等算法。

核心方法分析

根据键值计算哈希桶数组的索引

/**** 根据key计算hash值*/static final int hash(Object key) {    int h;    // h = key.hashCode(); 第一步、取 kek的hashCode值    // h ^ (h >>> 16) 第二步、取hash的高位与hash参与异或运算    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);}/*** 根据hash值和数组长度，计算key在table中的索引。* JDK8 中没有该方法，它直接在方法内部计算 hash & (length - 1) 的值*/private static int indexFor(int hash, int length) {    return hash & (length - 1);}

不管是增加、删除、查找键值对，定位到哈希桶数组的位置都是很关键的第一步。对于任意给定的对象，只要hashCode相同，那么hash()方法返回的hash值总是相同的。一般情况下，将hash值与数组长度进行取模运算来得到数组索引，但是取模运算的消耗还是比较大的。在HashMap中，通过indexFor()方法来计算索引。

indexFor()方法非常的巧妙，通过hash & (length-1)得到对象的保存位置。因为HashMap底层数组的长度总是2的n次方，这时hash & (length-1)运算等价于hash对length的取模，&比%具有更高的效率。

画图说明hash()和indexFor()的运算过程:

put方法

put()流程
（如果看不清，可以右击-在新标签页打开图片）

源码如下

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,               boolean evict) {    Node<K, V>[] tab;    Node<K, V> p;    int n, i;    //table 是否为空 或者 长度为0    if ((tab = table) == null || (n = tab.length) == 0) {        // resize 重新扩容        n = (tab = resize()).length;    }    //如果当前table索引上的值为空    if ((p = tab[i = hash & (n - 1)]) == null)        //直接将值插入        tab[i] = newNode(hash, key, value, null);    else {        Node<K, V> e;        K k;        // 如果 key 并且 hash 相同        if (p.hash == hash && ((k = p.key) == key || (key != null && key.equals(k))))            e = p;//直接覆盖value        else if (p instanceof TreeNode)            //如果是红黑树，则直接在树中插入键值对            e = ((TreeNode<K, V>) p).putTreeVal(this, tab, hash, key, value);        else {            //开始循环，遍历链表            for (int binCount = 0; ; ++binCount) {                if ((e = p.next) == null) {                    //到了链表末尾                    p.next = newNode(hash, key, value, null);                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st                        //链表长度大于8转换为红黑树进行处理                        treeifyBin(tab, hash);                    break;                }                // 如果 key 并且 hash 相同                if (e.hash == hash &&                        ((k = e.key) == key || (key != null && key.equals(k))))                    //直接覆盖value                    break;                p = e;            }        }        if (e != null) { // existing mapping for key            V oldValue = e.value;            if (!onlyIfAbsent || oldValue == null)                e.value = value;            afterNodeAccess(e);            return oldValue;        }    }    ++modCount;    if (++size > threshold)        //超过最大容量 就扩容        resize();    afterNodeInsertion(evict);    return null;}

扩容机制

扩容(resize)就是重新计算容量，向HashMap对象里不停的添加元素，而HashMap对象内部的数组无法装载更多的元素时，对象就需要扩大数组的长度，以便能装入更多的元素。

下面举个例子说明下扩容过程：

public static void main(String[] args) throws NoSuchFieldException, IllegalAccessException {    HashMap<Integer, String> map = new HashMap<>(2);    printInfo(map, "初始化HashMap的信息为：");    int[] values = {3, 7, 5, 9};    for (int i = 0; i < values.length; i++) {        map.put(values[i], "v");        printInfo(map, String.format("添加第%d个元素[%s=%s]后的info：", i + 1, values[i], "v"));    }}

运行结果如下图所示：

初始化HashMap的信息为：size: 0tableLength: 0loadFactor: 0.75threshold: 2modCount: 0table:  null添加第1个元素[3=v]后的info：size: 1tableLength: 2loadFactor: 0.75threshold: 1modCount: 1table:  索引 | 元素     0   | null     1   | [3=v]添加第2个元素[7=v]扩容后的info：size: 2tableLength: 4loadFactor: 0.75threshold: 3modCount: 2table:  索引 | 元素     0   | null     1   | null     2   | null     3   | [3=v] --> [7=v]添加第3个元素[5=v]后的info：size: 3tableLength: 4loadFactor: 0.75threshold: 3modCount: 3table:  索引 | 元素     0   | null     1   | [5=v]     2   | null     3   | [3=v] --> [7=v]添加第4个元素[9=v]扩容后的info：size: 4tableLength: 8loadFactor: 0.75threshold: 6modCount: 4table:  索引 | 元素     0   | null     1   | [9=v]     2   | null     3   | [3=v]     4   | null     5   | [5=v]     6   | null     7   | [7=v]

经过观测可发现，HashMap的table数组长度使用的是2次幂的扩展（长度扩展为原来2倍），数组扩展后，元素的位置要么是在原位置，要么是在原位置再移动2次幂的位置，扩展后对于元素新位置的判断对应的源码为：

HashMap.resize():Node<K,V> loHead = null, loTail = null;Node<K,V> hiHead = null, hiTail = null;Node<K,V> next;do {    next = e.next;    if ((e.hash & oldCap) == 0) {        if (loTail == null)            loHead = e;        else            loTail.next = e;        loTail = e;    }    else {        if (hiTail == null)            hiHead = e;        else            hiTail.next = e;        hiTail = e;    }} while ((e = next) != null);if (loTail != null) {    loTail.next = null;    newTab[j] = loHead;}if (hiTail != null) {    hiTail.next = null;    newTab[j + oldCap] = hiHead;}

接下来以添加第4个元素之后进行扩容的过程分析一下上面代码的原理

以上4个元素的hash值分别为：
key hash 3 3 7 7 5 5 9 9

当把第4个元素[9=v]添加进map之后，未扩容（未执行resize()）前的table为：

table:  索引 | 元素     0   | null     1   | [5=v] --> [9=v]     2   | null     3   | [3=v] --> [7=v]

这时由于++size > threashold ==> 4>3 ，所以需要执行resize()方法
该过程为新建一个长度为原来2倍的数组，如果判断原来数组上的node是一个链表，那么会遍历链表，判断每个元素的(e.hash & oldCap)的值是否为0，来决定链表中元素的新位置
key hash (e.hash & oldCap) 是否为0 新索引 3 3 0 是 3 7 7 4 否 3+4 5 5 4 是 1+4 9 9 0 否 1
根据上表的统计可以得出结论，如果e.hash & oldCap为0，则位置索引不变；否则新的索引是原位置索引+oldCap的，那么扩容后的table为：
```
table:  索引 | 元素     0   | null     1   | [9=v]     2   | null     3   | [3=v]     4   | null     5   | [5=v]     6   | null     7   | [7=v]
```
该判断是JDK8的一个优化，不需要像JDK7那样重新计算hash，只需要判断元素的hash值与oldCap的与运算结果就好了。这样的设计省去了重新计算hash值的时间，并且能够均匀的把冲突的节点分散到新的table中去。另外，JDK8的HashMap在迁移链表的时候会保持链表元素的顺序不变。

resize()方法的全部代码如下

final Node<K, V>[] resize() {    Node<K, V>[] oldTab = table;    int oldCapacity = (oldTab == null) ? 0 : oldTab.length;    int oldThreshold = threshold;    int newCapacity, newThreshold = 0;    if (oldCapacity > 0) {        if (oldCapacity >= MAXIMUM_CAPACITY) {//扩容前的数组大小如果已经达到最大(2^30)了            threshold = Integer.MAX_VALUE;//修改阈值为int的最大值(2^31-1)，这样以后就不会扩容了            return oldTab;        } else if ((newCapacity = oldCapacity << 1) < MAXIMUM_CAPACITY &&                oldCapacity >= DEFAULT_INITIAL_CAPACITY)            newThreshold = oldThreshold << 1; // 将容量和阈值在原来的基础上扩大2倍    } else if (oldThreshold > 0) // initial capacity was placed in threshold        newCapacity = oldThreshold;    else {               // zero initial threshold signifies using defaults        newCapacity = DEFAULT_INITIAL_CAPACITY;        newThreshold = (int) (DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);    }    if (newThreshold == 0) {        float ft = (float) newCapacity * loadFactor;        newThreshold = (newCapacity < MAXIMUM_CAPACITY && ft < (float) MAXIMUM_CAPACITY ?                (int) ft : Integer.MAX_VALUE);//修改阈值    }    threshold = newThreshold;    @SuppressWarnings({"rawtypes", "unchecked"})    Node<K, V>[] newTab = (Node<K, V>[]) new Node[newCapacity];    table = newTab;    if (oldTab != null) {        for (int j = 0; j < oldCapacity; ++j) {//遍历原来的哈希表数组            Node<K, V> current;            if ((current = oldTab[j]) != null) {                oldTab[j] = null;//清空                if (current.next == null)//如果当前节点只有一个节点                    newTab[current.hash & (newCapacity - 1)] = current;                else if (current instanceof TreeNode)//如果当前节点是红黑树                    ((TreeNode<K, V>) current).split(this, newTab, j, oldCapacity);                else { // 当前是链表 ，保留顺序preserve order                    Node<K, V> loHead = null, loTail = null;                    Node<K, V> hiHead = null, hiTail = null;                    Node<K, V> next;                    do {                        next = current.next;                        if ((current.hash & oldCapacity) == 0) {                            if (loTail == null)                                loHead = current;                            else                                loTail.next = current;                            loTail = current;                        } else {                            if (hiTail == null)                                hiHead = current;                            else                                hiTail.next = current;                            hiTail = current;                        }                    } while ((current = next) != null);                    if (loTail != null) {                        loTail.next = null;                        newTab[j] = loHead;                    }                    if (hiTail != null) {                        hiTail.next = null;                        newTab[j + oldCapacity] = hiHead;                    }                }            }        }    }    return newTab;}

线程安全性

并发的rehash过程

在多线程使用场景中，应该尽量避免使用线程不安全的HashMap，因为在并发的多线程使用场景中使用HashMap可能造成数据丢失。

多线程测试HashMap的代码

public static void main(String[] args) {    HashMap<Integer, String> map = new HashMap<>(2, 0.75f);    AtomicInteger counter = new AtomicInteger(0);    map.put(5, "C");    Runnable r1 = () -> {        map.put(7, "B");        counter.incrementAndGet();    };    Runnable r2 = () -> {        map.put(3, "A");        map.put(8, "A");        counter.incrementAndGet();    };    new Thread(r1, "thread1").start();    new Thread(r2, "thread2").start();    while (true) {        if (counter.get() == 2) {            printInfo(map, "");            System.out.println(map.get(7));            break;        }    }}

通过阻塞thread1的resize()，再让thread2执行，并进行resize()操作之后，最后打印的结果为：

size: 4tableLength: 8loadFactor: 0.75threshold: 6modCount: 4table:索引 | 元素     0   | [8=A]     1   | null     2   | null     3   | [3=A]     4   | null     5   | null     6   | null     7   | nullnull

可见table的size为4，表明map经历了4次put过程，而实际上却只有两个元素，其他元素丢失了，那么接下来通过IntellijIdea的多线程断点调试来演示一下元素为什么丢失。

初始化一个调试环境
用debug调试模拟多线程切换的流程
1. 点击debug按钮，这时断点会走到thread1处；
2. 将HashMap.resize(){next=e.next}处打上断点，并设置挂起模式为thread。
3. 接着开始执行thread1，这时thread1线程会停到刚才的断点处，相当于挂起thread1。
4. 切换到thread2，并取消第2步设置的断点，让thread2能够一次性运行结束，并进行resize()过程。
5. thread2线程执行结束后，唤醒thread1，让thread1继续执行。
6. 最后，通过打印的结果可知，数据丢失了。

分析

通过分析resize()的源码可知，每次是让table指向一个newTab

······threshold = newThr;@SuppressWarnings({"rawtypes","unchecked"})    Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];table = newTab;······

接着遍历oldTab，将原有的key-value存到newTab中。

for (int j = 0; j < oldCap; ++j) {                Node<K,V> e;                if ((e = oldTab[j]) != null) {······if (loTail != null) {    loTail.next = null;    newTab[j] = loHead;}if (hiTail != null) {    hiTail.next = null;    newTab[j + oldCap] = hiHead;}

在上面的第三步，thread1执行到next = e.next这挂起，接着唤醒thread2去执行，thread2把[8=A]放进map之后，也会执行resize()操作，这时会将 table 指向一个新的newTab，那么thread1的newTab将会失去引用，所以之前存储的值也就丢失了。

解决方案

因此，在多线程环境中，使用ConcurrentHashMap替换HashMap，或者使用Collections.synchronizedMap将HashMap包装起来。

JDK8和JDK7的HashMap性能对比

HashMap中，如果key经过hash算法得出的数组索引位置全部不相同，即Hash算法非常好，那样的话，getKey方法的时间复杂度就是O(1)，如果Hash算法技术的结果碰撞非常多，假如Hash算极其差，所有的Hash算法结果得出的索引位置一样，那样所有的键值对都集中到一个桶中，或者在一个链表中，或者在一个红黑树中，时间复杂度分别为O(n)和O(lgn)。

Hash比较均匀的情况

编写一个Key类

class Key implements Comparable<Key> {    private final int value;    Key(int value) {        this.value = value;    }    @Override    public int compareTo(Key o) {        return Integer.compare(this.value, o.value);    }    @Override    public boolean equals(Object o) {        if (this == o) return true;        if (o == null || getClass() != o.getClass())            return false;        Key key = (Key) o;        return value == key.value;    }    @Override    public int hashCode() {        return value;    }}

这个类复写了equals方法，并且提供了相当好的hashCode函数，任何一个值的hashCode都不会相同。

创建Keys类，用于缓存Key，避免频繁的GC，而影响HashMap实际查找值的时间。

public class Keys {    public static final int MAX_KEY = 10_000_000;    private static final Key[] KEYS_CACHE = new Key[MAX_KEY];    static {        for (int i = 0; i < MAX_KEY; ++i) {            KEYS_CACHE[i] = new Key(i);        }    }    public static Key of(int value) {        return KEYS_CACHE[value];    }}

开始我们的试验，测试需要做的仅仅是，创建不同size的HashMap（1、10、100、……、10000000）

static void test(int mapSize) {       HashMap<Key, Integer> map = new HashMap<Key,Integer>(mapSize);       for (int i = 0; i < mapSize; ++i) {           map.put(Keys.of(i), i);       }       long beginTime = System.nanoTime(); //获取纳秒       for (int i = 0; i < mapSize; i++) {           map.get(Keys.of(i));       }       long endTime = System.nanoTime();       System.out.println(endTime - beginTime);   }   public static void main(String[] args) {       for(int i=10;i<= 1000 0000;i*= 10){           test(i);       }   }

在测试中会查找不同的值，然后度量花费的时间，为了计算getKey的平均时间，我们遍历所有的get方法，计算总的时间，除以key的数量，计算一个平均值，主要用来比较，绝对值可能会受很多环境因素的影响，结果如下：

hash极不均匀的情况

假设我们有一个非常差的Key，它们所有的实例都返回相同的hashCode值。这是使用HashMap最坏的情况。代码修改如下：
```
class Key implements Comparable<Key> {    //...    @Override    public int hashCode() {        return 1;    }}
```
仍然执行main方法，得出的结果如下表所示
从表中结果中可知，随着size的变大，JDK1.7的花费时间是增长的趋势，而JDK1.8是明显的降低趋势，并且呈现对数增长稳定。当一个链表太长的时候，JDK1.8的HashMap会动态的将它替换成一个红黑树，这话的话会将时间复杂度从O(n)降为O(logn)。hash算法均匀和不均匀所花费的时间明显也不相同，这两种情况的相对比较，可以说明一个好的hash算法的重要性。

总结

扩容是一个特别耗性能的操作，所以当程序员在使用HashMap的时候，估算map的大小，初始化的时候给一个大致的数值，避免map进行频繁的扩容。
负载因子是可以修改的，也可以大于1，但是建议不要轻易修改，除非情况非常特殊。
HashMap是线程不安全的，不要在并发的环境中同时操作HashMap，建议使用ConcurrentHashMap。
JDK1.8引入红黑树大程度优化了HashMap的性能。

阅读全文

0 0