String内部存储方式与Unicode

来源：互联网发布：windows airplay 编辑：程序博客网时间：2024/06/05 17:05

本文分析String类，从源码的角度出发分析了Java中String的内部存储方式

String类中的私有域

String类中，字符串是以 char[]的形式被保存

/** The value is used for character storage. */private final char value[];/** Cache the hash code for the string */private int hash; // Default to 0

String类的构造方法

要了解value这个char[]中到底存储了什么，需要找一个String类中具有代表性的构造方法。下面是String类的一个构造方法。其中最重要的参数就是int[] codePoints，另外两个参数只是指定了截取的位置和长度。

    public String(int[] codePoints, int offset, int count) {        if (offset < 0) {            throw new StringIndexOutOfBoundsException(offset);        }        if (count <= 0) {            if (count < 0) {                throw new StringIndexOutOfBoundsException(count);            }            if (offset <= codePoints.length) {                this.value = "".value;                return;            }        }        // Note: offset or count might be near -1>>>1.        if (offset > codePoints.length - count) {            throw new StringIndexOutOfBoundsException(offset + count);        }        final int end = offset + count;        // Pass 1: Compute precise size of char[]        int n = count;        for (int i = offset; i < end; i++) {            int c = codePoints[i];            if (Character.isBmpCodePoint(c))                continue;            else if (Character.isValidCodePoint(c))                n++;            else throw new IllegalArgumentException(Integer.toString(c));        }        // Pass 2: Allocate and fill in char[]        final char[] v = new char[n];        for (int i = offset, j = 0; i < end; i++, j++) {            int c = codePoints[i];            if (Character.isBmpCodePoint(c))                v[j] = (char)c;            else                Character.toSurrogates(c, v, j++);        }        this.value = v;    }

参数CodePionts是什么

要理解上述构造方法，首先要知道Code Piont是什么。Code Point就是一个完整的Unicode字符。由于不是所有的Code Point都能用16bit（java中char是16bit）表示，所以CodePoints参数为int数组，并且在构造函数中需要转换才能存入char value[] 中。

Unicode：
（统一码、万国码、单一码）是计算机科学领域里的一项业界标准,包括字符集、编码方案等。Unicode 是为了解决传统的字符编码方案的局限而产生的，它为每种语言中的每个字符设定了统一并且唯一的二进制编码，以满足跨语言、跨平台进行文本转换、处理的要求。1990年开始研发，1994年正式公布（维基百科）。
Unicode简单地说是一个能全球通用的字符编码，它为每个字符指定了一个唯一的编号。采用U+后面接一组十六进制数来表示

BMP（Basic Multilingual Plane，基本多文种平面）：
只需要知道BMP代表了一个字符范围，在BMP范围内的字符，可以用4位十六进制数表示（16bit），而在BMP以外的字符，需要不止4位十六进制数表示。

CodePoints到char[]转换过程：
主要转换过程如下，最重要是判断 codePoints[i] 是否为BMP范围内的编码，如果是则可以用char表示，否则需要用两个char来表示 Character.toSurrogates(c, v, j++)。

  // Pass 2: Allocate and fill in char[]        final char[] v = new char[n];        for (int i = offset, j = 0; i < end; i++, j++) {            int c = codePoints[i];            if (Character.isBmpCodePoint(c))                v[j] = (char)c;            else                Character.toSurrogates(c, v, j++);        }        this.value = v; /*********************************************************/    Character.toSurrogates(c, v, j++);        static void toSurrogates(int codePoint, char[] dst, int index) {        // We write elements "backwards" to guarantee all-or-nothing        dst[index+1] = lowSurrogate(codePoint);        dst[index] = highSurrogate(codePoint);    } /*********************************************************/

Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)

High-Surrogate Code Unit. A 16-bit code unit in the range D80016 to DBFF16, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)

Low-Surrogate Code Unit. A 16-bit code unit in the range DC0016 to DFFF16, used in UTF-16 as the trailing code unit of a surrogate pair. Also known as a trailing surrogate. (See definition D74 in Section 3.8, Surrogates.)

总结

String内部采用char数组形式存储Unicode字符串，由于char是16位，也可以说是UTF-16编码。但并不是一个char存储一个字符，当字符在BMP范围以外时，会用两个char存储一个字符。

0 0