Unicode和UTF-8的区别

来源:互联网 发布:怎样复制知乎的文字 编辑:程序博客网 时间:2024/06/06 08:54

Unicode和UTF-8的区别

As Rasmus states in his article “The difference between UTF-8 and Unicode?” (link fixed):

译:像Rasums在他的文章中写明的:”UTF-8和unicode的区别是什么”

If asked the question, “What is the difference between UTF-8 and Unicode?”, would you confidently reply with a short and precise answer? In these days of internationalization all developers should be able to do that. I suspect many of us do not understand these concepts as well as we should. If you feel you belong to this group, you should read this ultra short introduction to character sets and encodings.

译:如果被问到这个问题,”UTF-8和Unicode之间的区别是什么?”,你可以很自信回答出一个短并且准确的答案吗?在这个国际化普遍的年代,所有开发者都应该可以回答这个问题。但我怀疑我们中的许多人都不能理解这些我们应该知道的概念。如果你感觉你属于此中,那么你应该读一下一下关于字符集和编码的介绍说明.

Actually, comparing UTF-8 and Unicode is like comparing apples and oranges:

UTF-8 is an encoding - Unicode is a character set

译:实际上,把utf-8和unicode放在一起比较,就像是比较苹果和橘子一样。UTF-8是一个编码,Unicode是一个字符集。

A character set is a list of characters with unique numbers (these numbers are sometimes referred to as “code points”). For example, in the Unicode character set, the number for A is 41.

译:一个字符集是将所有的字符定义为分为的唯一数字(这些数字有时候被当作”code points”。例如说,在unicode字符集中,A的数字是41.

An encoding on the other hand, is an algorithm that translates a list of numbers to binary so it can be stored on disk. For example UTF-8 would translate the number sequence 1, 2, 3, 4 like this:

00000001 00000010 00000011 00000100
Our data is now translated into binary and can now be saved to disk.

译:相反来说,一个字符集是一种把字符列表的数字翻译成二进制表示的算法,从而存储到硬盘上。例如UTF-8可以翻译数字1,2,3,4如下面表示.

我们的数据现在被转成二进制了,从而它现在可以保存在磁盘上了。

All together now

Say an application reads the following from the disk:

1101000 1100101 1101100 1101100 1101111
The app knows this data represent a Unicode string encoded with UTF-8 and must show this as text to the user. First step, is to convert the binary data to numbers. The app uses the UTF-8 algorithm to decode the data. In this case, the decoder returns this:

104 101 108 108 111
Since the app knows this is a Unicode string, it can assume each number represents a character. We use the Unicode character set to translate each number to a corresponding character. The resulting string is “hello”.

译:放在一起说
假如一个应用从磁盘上读取一下内容:
1101000 1100101 1101100 1101100 1101111
这个app知道这个数据代表一个unicode字符集并且使用UTF-8编码的字符并且将数据展示给用户。第一步,就是转化二进制数据给数字。这个app使用UTF-8算法进行解析数据。在这个案例,解码将放回以下内容:
104 101 108 108 111

因为这个app知道这是一个Unicode的字符串,它可以知道每一个数字代表的是哪一个字符。我们使用Unicode字符集将每个数字转换成正确的字符。返回的结果是”hello”.

Conclusion

So when somebody asks you “What is the difference between UTF-8 and Unicode?”, you can now confidently answer short and precise:

UTF-8 and Unicode cannot be compared. UTF-8 is an encoding used to translate numbers into binary data. Unicode is a character set used to translate characters into numbers.
译:
结论
所以当有人问你”UTF-8和Unicode的区别”的时候,你可以自信简短并且正确的回答这个问题。
UTF-8和Unicode不能被比较。UTF-8是一个字符集,用来将数字转化成二进制数据。Unicode是一个字符集可以转换字符成数字。

个人理解:
汉字存储到磁盘上:

数据–字符集–>数字–编码集—>二进制—>磁盘上

磁盘二进制数据–编码–>数字–字符集—>数据

0 0