Unicode and UTF-8

来源:互联网 发布:英菲克网络机顶盒价格 编辑:程序博客网 时间:2024/06/05 00:40

UTF-8 is a character encoding capable of encoding all possible Unicode code points. The encoding is defined by the Unicode standard.
wiki上介绍UTF-8是字符的编码方式,可以将Unicode的所有code points都进行编码。

UTF-8 is an encoding - Unicode is a character set

Unicode和utf8其实没法直接比较,Unicode 是字符集,UTF-8是编码方式。

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as “code points”). For example, in the Unicode character set, the number for A is 41.

字符集,就是一系列字符,每个字符都有独一无二的编号(所谓的code points,代码点)。

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100 

而编码方式就是把数字转换为二进制的算法或者规范,为了用来存储或者传输。

一个例子,
Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is “hello”.

应用程序读取到的二进制首先通过utf-8进行decode为具体的Unicode code points,然后通过code points就能对应到相应的字符上。
用户看到的是Unicode字符,而存储到硬盘的是将Unicode字符对应的code points通过utf-8 encode后的二进制。


windows中的encoding如下图所示,按照上面的分析,Unicode是字符集,为啥还有Unicode的编码方式?

这里写图片描述

下面是解释,

“Unicode” is a unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.

UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).

When “Unicode” is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their “native” character encoding. This leads to hairy problems if you need to worry about characters which can’t be encoded in a single UTF-16 value (they’re encoded as “surrogate pairs”) - but most developers never worry about this, IME.