Unicode and UTF-8

来源：互联网发布：英菲克网络机顶盒价格编辑：程序博客网时间：2024/06/05 00:40

UTF-8 is a character encoding capable of encoding all possible Unicode code points. The encoding is defined by the Unicode standard.
wiki上介绍UTF-8是字符的编码方式，可以将Unicode的所有code points都进行编码。

UTF-8 is an encoding - Unicode is a character set

Unicode和utf8其实没法直接比较，Unicode 是字符集，UTF-8是编码方式。

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as “code points”). For example, in the Unicode character set, the number for A is 41.

字符集，就是一系列字符，每个字符都有独一无二的编号(所谓的code points，代码点)。

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100

而编码方式就是把数字转换为二进制的算法或者规范，为了用来存储或者传输。

一个例子，
Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is “hello”.

应用程序读取到的二进制首先通过utf-8进行decode为具体的Unicode code points，然后通过code points就能对应到相应的字符上。
用户看到的是Unicode字符，而存储到硬盘的是将Unicode字符对应的code points通过utf-8 encode后的二进制。

windows中的encoding如下图所示，按照上面的分析，Unicode是字符集，为啥还有Unicode的编码方式？

这里写图片描述

下面是解释，

“Unicode” is a unfortunately used in various different ways, depending on the context. Its most correct use (IMO) is as a coded character set - i.e. a set of characters and a mapping between the characters and integer code points representing them.

UTF-8 is a character encoding - a way of converting from sequences of bytes to sequences of characters and vice versa. It covers the whole of the Unicode character set. ASCII is encoded as a single byte per character, and other characters take more bytes depending on their exact code point (up to 4 bytes for all currently defined code points, i.e. up to U-0010FFFF, and indeed 4 bytes could cope with up to U-001FFFFF).

When “Unicode” is used as the name of a character encoding (e.g. as the .NET Encoding.Unicode property) it usually means UTF-16, which encodes most common characters as two bytes. Some platforms (notably .NET and Java) use UTF-16 as their “native” character encoding. This leads to hairy problems if you need to worry about characters which can’t be encoded in a single UTF-16 value (they’re encoded as “surrogate pairs”) - but most developers never worry about this, IME.

阅读全文

0 0