ucs2 和 utf16

来源:互联网 发布:你好我是淘宝客服 编辑:程序博客网 时间:2024/06/01 07:47

最近在处理一个跨国版本问题时 接触到 ucs2 这种字符集, unicode 的 utf8 utf16 utf32 倒是经常听说, 但工程里用的 ucs2这种倒是第一次了解。 

找了一些资料发现原来 这是 不同标准制定委员会 之间产生的 命名差异 , 虽然内容一样但并没有因此废除掉其中一个,而且在标准变更时 ,这2种字符集同时按新标准更新并仍旧保持编码一致 原文如下

-------------------------------------------------------------------------------------------------------------

UTF16 and USC2 are one and the same.
The mess comes from the fact that there are two different groups
dealing with the Unicode standard:
- The Unicode Consortium
- ISO 10646

They work a lot to keep the two standards in sync
(don't ask me why they don't merge)
But they don't fight so much to keep the same terminology.
The UTF is the Unicode Consortium terminology (Unicode Transformation Format)
UCS is the ISO terminology (Universal Character Set)

On the Unicode web site, in the glossary, the UTF entries now mention UCS:
"UTF-16. Unicode (or UCS) Transformation Format, 16-bit encoding form."
(http://www.unicode.org/glossary/index.html#UTF_16)

The one thing to keep in mind is that both standard are moving.
UCS and UTF are just encodings of the character tables.
Windows NT likes to use the ISO terminology to show the compliance with the
international standards.

---------------------------------------------------------------------------------------------------------------

如不放心,仍怀疑这2中编码存在差异,可以用2进制工具打开 utf-16文件和 ucs2文件 就会发现其 bom 头均为 FF FE

---------------------------------------------------------------------------------------------------------------

(顺带补充一下bom 的知识)

BOM: Byte Order Mark

就是一个字节顺序标签,类似一个标记,又叫签名

BOM签名的意思就是告诉编辑器当前文件采用何种编码,方便编辑器识别,但是BOM虽然在编辑器中不显示,但是会产生输出,就像多了一个空行

一般的编码集中并不会出现bom头,unicode编码集中会出现

常见的bom头是:

UTF-8    ║ EF BB BF

UTF-16LE ║ FF FE (小尾)

UTF-16BE ║ FE FF (大尾)

UTF-32LE ║ FF FE 00 00

UTF-32BE ║ 00 00 FE FF

utf-16,utf-32不指定bom头,解析程序默认就认为是ansi编码,出现乱码。而utf-8指定或者不指定程序都可判断知道对于的字符集编码

---------------------------------------------------------------------------------------------------------------

0 0
原创粉丝点击