DBCS字符集

来源：互联网发布：知乎创始人周源编辑：程序博客网时间：2024/06/05 12:44

A "character set" is a mapping of characters to their identifying code values. The character set most commonly used in computers today isUnicode, a global standard for character encoding.

Most applications written today handle character data primarily as Unicode, using the UTF-16 encoding. However, many legacy applications continue to use character sets based on code pages. Even new applications sometimes have to work with code pages, often for one of the following reasons:

To communicate with legacy applications.
To communicate with older mail and news servers, which might not always support Unicode.
To communicate with the Windows Console, which does not support Unicode.

Code page is another term for character encoding. It consists of a table of values that describes the character set for a particular language. Each code page is represented by a code page identifier, for example, 1252, and is handled by the Unicode and character set API functions. For a list of supported code page identifiers, see Code Page Identifiers. For the most consistent results, applications should use Unicode, such as UTF-8 or UTF-16, instead of a specific code page. Windows code pages, commonly called "ANSI code pages", are code pages for which non-ASCII values (values greater than 127) represent international characters. These code pages are used natively in Windows Me, and are also available on Windows NT and later.

A single-byte character set (SBCS) is a mapping of 256 individual characters to their identifying code values, implemented as a code page.

A double-byte character set (DBCS), also known as an "expanded 8-bit character set", is an extendedsingle-byte character set (SBCS), implemented as a code page. DBCSs were originally developed to extend the SBCS design to handle languages such as Japanese and Chinese. Each DBCS code page supports different characters, but no page supports the full breadth of characters provided by Unicode. Each DBCS code page supports a different subset, differently encoded. Data converted from one DBCS code page to another is subject to corruption because the same data value on different code pages can encode a different character. Data converted from Unicode to DBCS is subject to data loss, because a given code page might not be able to represent every character used in that particular Unicode data.

http://msdn.microsoft.com/en-us/library/windows/desktop/dd317794(v=vs.85).aspx