HashMap performance improvements in Java 8

来源:互联网 发布:网络歌手 知乎 编辑:程序博客网 时间:2024/06/05 09:51

原文链接

http://www.javacodegeeks.com/2014/04/hashmap-performance-improvements-in-java-8.html


HashMap<K, V> is fast, versatile and ubiquitous data structure in every Java program. First some basics. As you probably know, it uses hashCode() andequals() method of keys to split values between buckets. The number of buckets (bins) should be slightly higher than the number of entries in a map, so that each bucket holds only few (preferably one) value. When looking up by key, we very quickly determine bucket (using hashCode() modulo number_of_buckets) and our item is available at constant time.


HashMap<k, v> 是一个快速的,通用的,无处不在的数据结构。作为常识,你应该知道它用hashCode()和equals()方法区分buckets中的keys和values。bucket的数量可能稍微的比map中的entry数目要多,所以每个bucket包含很少(一个更为合适)的值。当你根据key查找值的时候,可以快速的定位bucket的位置(根据hashcode() % number_of_buckets),并且是一个固定的时间。


This should have already been known to you. You probably also know that hash collisions have disastrous impact on HashMap performance. When multiple hashCode() values end up in the same bucket, values are placed in an ad-hoc linked list. In worst case, when all keys are mapped to the same bucket, thus degenerating hash map to linked list – from O(1) to O(n) lookup time. Let’s first benchmark how HashMap behaves under normal circumstances in Java 7 (1.7.0_40) and Java 8 (1.8.0-b132). To have full control over hashCode() behaviour we define our custom Key class:


这些你可能早就知道了。你可能也知道hash冲突,灾难性的(突出的)影响到HashMap的性能。当多个hashCode()值最终定位到同一个bucket中,value会放置在一个ad-hoc链表中。在最坏的情况下,所有keys都定位到同一个bucket中,所以最终这个链表的查找时间从o(1)变成了o(n)。让我们来看看在正常情况下,java7(1.7.0_40)和java8(1.8.0-b132)的性能。我们通过自定义的class Key来控制hashCode函数的返回值。


class Key implements Comparable<Key> {     private final int value;     Key(int value) {        this.value = value;    }     @Override    public int compareTo(Key o) {        return Integer.compare(this.value, o.value);    }     // 自定义的equals    @Override    public boolean equals(Object o) {        if (this == o) return true;        if (o == null || getClass() != o.getClass())            return false;        Key key = (Key) o;        return value == key.value;    }     // 自定义的hashcode    @Override    public int hashCode() {        return value;    }}

Key class is well-behaving: it overrides equals() and provides decent hashCode(). To avoid excessive GC I cache immutable Key instances rather than creating them from scratch over and over:


Class Key行为很明确:它复写了equals方法,并且提供了相当好的hashCode函数(任何一个值的hashCode都不会相同,因为直接使用value当做hashcode)。为了避免频繁的GC,我将不变的Key实例缓存了起来,而不是一遍一遍的创建它们。

public class Keys {     public static final int MAX_KEY = 10_000_000;    private static final Key[] KEYS_CACHE = new Key[MAX_KEY];     static {        for (int i = 0; i < MAX_KEY; ++i) {            KEYS_CACHE[i] = new Key(i);        }    }     public static Key of(int value) {        return KEYS_CACHE[value];    } }

Now we are ready to experiment a little bit. Our benchmark will simply create HashMaps of different sizes (powers of 10, from 1 to 1 million) using continuous key space. In the benchmark itself we will lookup values by key and measure how long it takes, depending on the HashMap size:


现在开始我们的试验。测试需要做的仅仅是,创建不同size的HashMap(1,10,100,......1000000)。在测试中会查找不同的值,然后度量花费的时间,根据HashMap的size。


import com.google.caliper.Param;import com.google.caliper.Runner;import com.google.caliper.SimpleBenchmark; public class MapBenchmark extends SimpleBenchmark {     private HashMap<Key, Integer> map;     @Param    private int mapSize;     @Override    protected void setUp() throws Exception {        map = new HashMap<>(mapSize);        for (int i = 0; i < mapSize; ++i) {            map.put(Keys.of(i), i);        }    }     public void timeMapGet(int reps) {        for (int i = 0; i < reps; i++) {            map.get(Keys.of(i % mapSize));        }    } }

The results confirm that HashMap.get() is indeed O(1):


结果显示HashMpa.get()的确是O(1).



Interestingly Java 8 is on average 20% faster than Java 7 in simple HashMap.get(). The overall performance is equally interesting: even with one million entries in a HashMap a single lookup taken less than 10 nanoseconds, which means around 20 CPU cycles on my machine*. Pretty impressive! But that’s not what we were about to benchmark.


有趣的是HashMpa.get()在java8中要比在java7中快20%。全面来看性能也很有趣:即使有1000000条数据在HashMap中,一个简单的lookup操作仅仅消耗10纳秒,这意味着20个cpu时间(在我的电脑中)。非常完美!但是这不是我们真的要测试的。


Suppose that we have a very poor map key that always returns the same value. This is the worst case scenario that defeats the purpose of using HashMap altogether:


假设我们又一个非常差的Key,它们所有的实例都返回相同的hashcode值。这是使用HashMap最坏的情况。


class Key implements Comparable<Key> {     //...     @Override    public int hashCode() {        return 0;    }}

I used the exact same benchmark to see how it behaves for various map sizes (notice it’s a log-log scale):


还是使用上面的测试代码,来看看在不同的hashmap size下性能如何。




Results for Java 7 are to be expected. The cost of HashMap.get() grows proportionally to the size of the HashMap itself. Since all entries are in the same bucket in one huge linked list, looking up one requires traversing half of such list (of size n) on average. Thus O(n) complexity as visualized on the graph.

JAVA7的性能和预期相同。HashMap.get()的时间随着hashmap size而增长。因为所有的entry都存放在同一个bucket中,形成了一个巨大的链表,寻找一个值平均上的时间是list size的一半(O(n)),因此时间复杂度如图中所示是O(n)。

But Java 8 performs so much better! It’s a log scale so we are actually talking about several orders of magnitude better. The same benchmark executed on JDK 8 yields O(logn) worst case performance in case of catastrophic hash collisions, as pictured better if JDK 8 is visualized alone on a log-linear scale:

但是在java8中性能会更好。它的复杂度是O(logn)。相同的测试在jdk8上执行,在最坏的情况下(灾难性的hash冲突)复杂度是o(logn),下面的图中更好的变现了复杂度




What is the reason behind such a performance improvement, even in terms of big-O notation? Well, this optimization is described in JEP-180. Basically when a bucket becomes too big (currently: TREEIFY_THRESHOLD = 8), HashMap dynamically replaces it with an ad-hoc implementation of tree map. This way rather than having pessimistic O(n) we get much better O(logn). How does it work? Well, previously entries with conflicting keys were simply appended to linked list, which later had to be traversed. Now HashMap promotes list into binary tree, using hash code as a branching variable. If two hashes are different but ended up in the same bucket, one is considered bigger and goes to the right. If hashes are equal (as in our case), HashMap hopes that the keys are Comparable, so that it can establish some order. This is not a requirement of HashMap keys, but apparently a good practice. If keys are not comparable, don’t expect any performance improvements in case of heavy hash collisions.

是什么因素导致了这种巨大的性能提升,即使是在O的情况下。这个优化在JEP-180中有描述。基本上是这样:当一个bucket太大的时候,HashMap会动态的将它替换成一个tree map。这话总方式会将时间复杂度从O(n)降为O(logn)。它是如何工作的呢?在先前的实现中,冲突的key会被添加到链表中,这导致了遍历(查找的时候需要遍历所有数据,也就是O(n))。现在HashMap将list提升为了binary tree,将hashcode作为划分分支的值。如果两个hash不同,但是定位到了同一个bucket,则一个会被定义为bigger,放置到邮编。如果hash相同,HashMap希望它是可以比较的(实现了comparable接口),所以可以按照某个顺序排序。这不是HashMap的Key必须的,但是是一个更好的实践。如果key不可比较,那么在灾难性的hash冲突时,就不要期望有什么性能的提升了。

Why is all of this so important? Malicious software, aware of hashing algorithm we use, might craft couple of thousand requests that will result in massive hash collisions. Repeatedly accessing such keys will significantly impact server performance, effectively resulting in denial-of-service attack. In JDK 8 an amazing jump from O(n) to O(logn) will prevent such attack vector, also making performance a little bit more predictive. I hope this will finally convince your boss to upgrade.

为什么这个提升这么重要呢?恶意软件,意识到我们运用的hash算法,会构造成百上千相同的请求,从而导致大量的hash冲突。这样会严重的影响服务器的性能,导致服务器拒绝提供服务。当java8把hashmap在这种糟糕的情况下的性能从O(n)提升到O(logn)的时候,可以有效的防止这种攻击。

*Benchmarks executed on Intel Core i7-3635QM @ 2.4 GHz, 8 GiB of RAM and SSD drive, running on 64-bit Windows 8.1 and default JVM settings.



0 0
原创粉丝点击