code point

来源:互联网 发布:mac的手绘软件 编辑:程序博客网 时间:2024/05/16 15:56
一个完整的Unicode字符叫CodePoint
一个Java char 叫代码单元code unit;
The Unicode standard was originally designed as a fixed-width 16-bit character
encoding. It has since been changed to allow for characters whose representa-
tion requires more than 16 bits. The range of legal code points is now U+0000 to
U+10FFFF, using the hexadecimal U+n notation. Characters whose code points are
greater than U+FFFF are called supplementary characters. To represent the complete
range of characters using only 16-bit units, the Unicode standard defines an
encoding called UTF-16. In this encoding, supplementary characters are represented
as pairs of 16-bit code units, the first from the high-surrogates range,
(U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to
U+DFFF). For characters in the range U+0000 to U+FFFF, the values of code points
and UTF-16 code units are the same.
The Java programming language represents text in sequences of 16-bit code
units, using the UTF-16 encoding. A few APIs, primarily in the Character class,
use 32-bit integers to represent code points as individual entities. The Java platform
provides methods to convert between the two representations.
(From JLS-3.0)
int 值表示所有 Unicode 代码点,包括增补代码点。int 的 21 个低位(最低有效位)用于表示 Unicode 代码点,并且 11 个高位(最高有效位)必须为零。
为什么只用21位就可以了呢?
合法代码点 的范围现在是从 U+0000 到 U+10FFFF
代码点大于 U+FFFF 的字符称为增补字符,范围是0x10000到0x10ffff
0000 0001 0000 0000 0000 0000
0001 0000 1111 1111 1111 1111
可见增补字符只用到了int类型的后21位