java String.intern() 方法

来源：互联网发布：来电闪光灯软件编辑：程序博客网时间：2024/04/28 03:24

source : http://java-performance.info/string-intern-in-java-6-7-8/

这篇文章介绍String.intern 方法在java 6中的实现和java 7,8中的改变

字符串共享机制(string pooling)

字符串共享机制就是使用单一的string对象来标示唯一的字面值，而不是使用多个表示相同字面值的不同string对象实例。
你可以选择自己实现字符串池(后面会有一个例子)，或者使用jdk提供的 String.intern()方法。
在jdk6时，很多标准都规定禁用string的intern方法，因为其可能会导致OOM。java7因此也对 string.intern()方法做了一些调整。
这里有两个相关的bug report
+ http://bugs.sun.com/view_bug.do?bug_id=6962931
+ http://bugs.sun.com/view_bug.do?bug_id=6962930

java7 的String.intern()

java7中，String.intern()方法做了重要的改变。
首先，字符串池的存储位置从永久带(java8中，方法区(Method area)取代了永久带(PermGen)) 挪到了堆。这样带来了两个重要的影响
1. 字符串常量池中的对象可以被GC
2. 字符串常量池的大小不再受永久带内存大小限制(在jdk6中，永久带的大小是固定的)

java 6,7,8中的字符串实现

在jdk中字符串池是通过hashmap来实现的，hashmap的每个bucket包含hashcode值相同的一组字符串。
在java6早期版本中这个hashmap的大小是不可配置的，默认为 1009，在后续版本中，逐渐增加了字符串池大小的配置，并在7的较晚版本(大概是7u40)开始将默认大小增长到 60013。
可以通过 -XX:StringTableSize=N 来制定字符串池hashmap的size(注意将size设为素数，这样将减少键冲突的概率)。但是这个参数在java6中没什么作用，因为字符串池会受到固定大小的永久带限制。后面将不在谈及java6。

java7u40

在java7中，字符串池只受到最大堆大小的限制。
当内存空间消耗到几百兆以后，自然而然的我们开始担心内存使用的问题。在几百兆内存占用下，给字符串池分配 8-10MB,应该是可以接受的，这样字符串池的hashmap就可以大概1百万左右的bucket(注意为其大小设置一个素数～)。
如果你想要有效的使用字符串池，就要设置一个合适的hashmap大小，因为我们知道，在最坏情况下hashmap的性能会下降为链表。

下面是我的部分测试结果。
测试内容：使用默认的字符串池大小，在一个已经intern一部分字符串的基础上(第一个数字)，再对10000个字符串调用intern方法

0; time = 0.0 sec50000; time = 0.03 sec100000; time = 0.073 sec150000; time = 0.13 sec200000; time = 0.196 sec250000; time = 0.279 sec300000; time = 0.376 sec350000; time = 0.471 sec400000; time = 0.574 sec450000; time = 0.666 sec500000; time = 0.755 sec550000; time = 0.854 sec600000; time = 0.916 sec650000; time = 1.006 sec700000; time = 1.095 sec750000; time = 1.273 sec800000; time = 1.248 sec850000; time = 1.446 sec900000; time = 1.585 sec950000; time = 1.635 sec1000000; time = 1.913 sec

我的机器是Core i5-3317u@1.7Ghz CPU。可以看到执行时间呈线性增长，当字符串池包含100000字符串时，每秒大概只能intern5000个字符串。对于数据量大的应用来说这个时间显然是不可接受的。

接下来我们把字符串池大小调整到100003，再看下测试结果 -XX:StringTableSize=100003

50000; time = 0.017 sec100000; time = 0.009 sec150000; time = 0.01 sec200000; time = 0.009 sec250000; time = 0.007 sec300000; time = 0.008 sec350000; time = 0.009 sec400000; time = 0.009 sec450000; time = 0.01 sec500000; time = 0.013 sec550000; time = 0.011 sec600000; time = 0.012 sec650000; time = 0.015 sec700000; time = 0.015 sec750000; time = 0.01 sec800000; time = 0.01 sec850000; time = 0.011 sec900000; time = 0.011 sec950000; time = 0.012 sec1000000; time = 0.012 sec

可以看到intern消耗的时间几乎是常量的。每个bucket包含的字符串数量应该不会超过10个。
在相同的配置下，让我们把插入数量增加到 10,000,000 。每个bucket的size大概在100个左右。

2000000; time = 0.024 sec3000000; time = 0.028 sec4000000; time = 0.053 sec5000000; time = 0.051 sec6000000; time = 0.034 sec7000000; time = 0.041 sec8000000; time = 0.089 sec9000000; time = 0.111 sec10000000; time = 0.123 sec

接下来再次增加hashmap到 1,000,003

1000000; time = 0.005 sec2000000; time = 0.005 sec3000000; time = 0.005 sec4000000; time = 0.004 sec5000000; time = 0.004 sec6000000; time = 0.009 sec7000000; time = 0.01 sec8000000; time = 0.009 sec9000000; time = 0.009 sec10000000; time = 0.009 sec

可以看到从0到10万的插入量，消耗时间几乎没什么变化。即使在我这个比较慢的笔记本上每秒钟也可以插入10万的字符串，说名1 000 003 已经足够了。

我们还应该手动管理字符串池么？

现在我们使用 WeakHashMap

private static final WeakHashMap<String, WeakReference<String>> MANUAL_CACH=new WeakHashMap<>(100003);private static String manuIntern( final String str ) {    final WeakReference <String> cached = MANUAL_CACH.get( str );    if ( cached != null){        final String value = cached.get();        if (value != null)            return value;    }    MANUAL_CACH.put(str,new WeakReference<String>( str ));    return str;}

下面使用相同的测试来测下这个字符串池实现

0; manual time = 0.001 sec50000; manual time = 0.03 sec100000; manual time = 0.034 sec150000; manual time = 0.008 sec200000; manual time = 0.019 sec250000; manual time = 0.011 sec300000; manual time = 0.011 sec350000; manual time = 0.008 sec400000; manual time = 0.027 sec450000; manual time = 0.008 sec500000; manual time = 0.009 sec550000; manual time = 0.008 sec600000; manual time = 0.008 sec650000; manual time = 0.008 sec700000; manual time = 0.008 sec750000; manual time = 0.011 sec800000; manual time = 0.007 sec850000; manual time = 0.008 sec900000; manual time = 0.008 sec950000; manual time = 0.008 sec1000000; manual time = 0.008 sec

看起来我们自己实现的字符串池和jvm的相比，性能上不相上下。但是，在指定堆大小为1280M(-Xmx1280M)，字符串池 1 000 003 的情况下。我们自己实现的字符串池插入大概2.5M个字符串后就OOM了，而jvm可以插入12.72M个字符串到字符串池，大概是我的5倍。显然我们还是不要重复造轮子了～

java 7u40以后的String.intern()

java 7u40 将字符串池大小默认值增赵到60013。在发生冲突之前，这个大小可以容纳大概30 000 个不同的字符串。这个量级通常足够intern 使用了。
我们可以通过-XX:+PrintFlagFinals 来查看默认的配置。

Test code

下面是我是用的测试代码。一个方法循环的创建和intern字符串。我们也记录没intern 10 000 个字符串花费的时间。
跑测试程序的时候最好通过 -verbos:gc 参数来查看 GC的情况。也可以通过-Xmx参数改变下最大堆内存看看。
代码一共有两个测试
1. testStringPoolGarbageCollection 主要证明字符串池是可以被GC的，同时也记录了intern消耗的时间。这个测试在java6 下会由于永久带固定大小的限制而失败。
2. 第二个测试是为了查看字符串池能够容纳多少个string。在java6下，可以试试两种情况, -Xmx128m 个 -Xmx1280m(10倍或者更多)。你应该能看到由于永久带固定大小的影响，这两个参数下结果是一样的。在java7 下，应该能一直跑到堆内存满。

下面代码我是手打的，不保证直接能编译过，不过代码很简单，有错误应该也很好改

//Testing String.intern//Run this class at least with -verbose:gc JVM parameterpublic class InterTest{    public static void main( String [ ] args ){        testStringPoolGarbageCollection( );         testLongLoop();        }    // Use this method to see where interned strings are stored and how many of them can you fit for the given heap size    private static void testLongLoop(){        test (1000*1000*1000);        //uncomment the following line to see the hand-written cache performance        //testManual(1000*1000*1000);    }    //use this method to chcek that not used interned strings are garbage collected    priate static void testStringPoolGarbageCollection(){        //first method call - use it as a reference        test(1000*1000);        //we are going to clean the cach here.        System.gc();        //check the memory consumption and how long does it take to intern strings        //in the second method call.        test(1000*1000);     }    private static void test (final int cnt) {        final List<String> list=new ArrayList(100);        long start= System.currentTimeMills();        for ( int i=0 ; i < cnt ; i++ ){            final String str= "very long test string, which tellls you about something "+"very-very import, definitely deserving to be interned #"+i;        //uncomment the following lien to test dependency from string length        // final String str=Integer.toSring(i);        list.add(str.intern());        if ( i% 10000 ==0){            System.out.println( i+ "; time = " +(System.currentTimeMillis() - start) / 1000.0 + " sec");            start = System.currentTimeMillis();        }        }        System.out.println( "Total length = " + list.size());    }    private static final WeakHashMap<String,WeakREference<String>> s_manualCache = new WeakHashMap<>(100000);    private static final String manualIntern ( final String str ){        final WeakREference<String> cachaed= s_manualCach.get ( str);        if (cached != null) {            final String value = cached.get()l            if (value != null )                return value;         }        s_manualCach.put( str,new WeakReference<String>( str ) );    }    private static void testManaul (fina int cnt ){        final List<String> list = new ArrayList<>( 100 );        long start = System.currentTimeMillis();        for ( int i=0 ; i< cnt; i++) {            final String str = "Very long test string, which tells you about simething "+ "very-very important , definitely deserving to be interned #"+i;            list.add(manualIntern( str ));            if (i%10000==0){                System.out.println ( i + "; manual time = " + (System.currentTimeMills ( ) - start ) / 1000.0 + " serc" );                start = System.currentTimeMills ( );            }         }        System.out.println ("Total length = "+ list.size( ) );    }}

总结

java6下不要使用String.intern()方法，因为固定大小的永久带限制会造成OOM
java7和8将字符串池挪到了堆内存。也就是说字符串池可以使用整个的堆内存，直到堆满
适当的设置-XX:StringTableSize，因为字符串池是使用hashmap实现，因此字符串池的大小会影响其性能
在java6字符串池默认大小是1009,7u40以后增长到了 60013
如果你想看字符串池适用情况，可以使用 -XX:+PrintStringTablesStatics 参数。启用这个功能后，当程序执行完后会打印字符串池的时使用情况

0 0