Character sets and codepages

来源：互联网发布：冒险岛台服数据库编辑：程序博客网时间：2024/05/20 01:09

Archived on Fri Jan 21 12:19:02 2005

Abstract

The goal of this document is to:

Define terminology relating to character sets.
Explain how characters are mapped to glyphs.
Describe the Windows 95 WGL4 character set.
List standard codepages for Windows 95.
Explain the codepage/Unicode range encoding within a text font.

Additional references

TrueType 1.0 Font File Specification, v.1.65, Microsoft
See for more information about character sets, WGL4.0 list of characters, Macintosh compatibility, and language encoding within a font.

The Unicode Standard, Version 1.0, Addison-Wesley, 1991
See for more information about Unicode script ranges and the characters they cover.

Developing International Software for Windows 95 and Windows NT: A Handbook for International Software Design, Microsoft
See for more information about various writing systems, input methods, Far East character mappings, NT specific issues, and programming with Unicode.

Characters, glyphs and fonts

We often speak inaccurately of character sets: we may refer to a "Greek character set" or a "Latin character set". But in order to understand how different writing systems are supported by Windows, we need to be more precise about characters.

Users don't view or print characters: a user views or prints glyphs. A glyph is a representation of a character. The character "Capital Letter A" is represented by the glyph "A" in Times New Roman Bold, and "A" in Arial Bold. A font is a collection of glyphs. Windows is able to retrieve the appropriate glyphs by using mapping information about the keyboard, the language system in use, and the glyphs associated with each character.

Fonts are designed with character sets in mind: a font for use in Russia will include glyphs representing Cyrillic characters. There is no magic word or blessing uttered to create a "character set." A character set is only a collection of characters. However, characters from different language systems are conventionally divided into different "character sets", primarily because, in the past, a limited number of characters could be "addressed" at any one time.

Preparing for TrueType Open: glyph substitution

Glyphs can also represent combinations of characters and alternate forms of characters: there is not a strict one-to-one correspondence between glyphs and characters. For example, two characters may be typed in a document, but represented by only a single glyph (a ligature glyph). Conversely, different versions of a character may appear at the beginning, middle, or ending of a word. Thus, a single character can be represented by several different glyphs in a font.

TrueType Open will provide a substitution table to handle one-to-one, one-to-many, and many-to-one mappings.

Character codes

Characters are represented by character codes. Character codes are generated and stored when a user inputs a document. Single-Byte character sets (SBCS) provide 256 character codes (2). This is an adequate number to encode most of the characters needed for Western Europe. For example, the Windows Extended ANSI character set contains 256 characters consisting of Latin letters, Arabic numerals, punctuation, and drawing characters.

However, 256 character codes are not enough to represent all the characters needed by multi-lingual users in a single font, or by users in the Far East, where over 12,000 characters may need to be addressed at any one time. Consequently, Multi-Byte character sets (commonly known as Double-Byte character sets) are necessary. Double-Byte character sets (DBCS) are a mixture of Single-Byte and Double-Byte character encodings and provide over 65,000 character codes (2 to the 16th power).

Unicode

Unicode is a 16-bit encoding that encompasses many characters used in general text interchange throughout the world. Each Unicode index refers unambiguously to a given character. Unicode allows a larger range of characters to be addressed than is possible using a Single-Byte character encoding. All Unicode values are Double-Byte, which simplifies the way a Unicode-based system reads a string of text. In comparison, a Double-Byte system must determine which values in a string are Single-Byte character codes and which are Double-Byte character codes.

NT internally uses Unicode for character encoding. Under NT, applications can still support existing Single-Byte codepages (discussed below) using the NLS APIs. DBCS-to-Unicode mappings are handled via the MultiByteToWideChar and WideCharToMultiByte API's.

Windows 95 does not use Unicode internally for character encoding. However Windows 95 is able to handle Multi-Byte character sets, and is able to map to Unicode using International API's (such as MultiByteToWideChar mentioned above).

Codepages

The meaning of the term "codepage" has evolved over time. Only one definition concerns us now: In Windows 95 and NT, a codepage is a list of selected character codes in a certain order. Codepages are usually defined to support specific languages or groups of languages which share common writing systems. For example, codepage 1253 provides character codes required in the Greek writing system.

The order of the character codes in a codepage allows the system to provide the appropriate character code to an application when a user presses a key on the keyboard. When a new codepage is loaded, different character codes are provided to the application.

In Windows 95, codepages can be changed on-the-fly by the user, without changing the default language system in use. An application can determine which codepages a specific font supports and can then present language options to the user.

Preparing for TrueType Open: saving writing system information within a text stream

When a user changes codepages, character codes from the new codepage are stored in the text stream. However, most codepages support multiple writing systems, each of which may have special rules about substituting or placing glyphs. TrueType Open will allow the flexibility for multiple writing systems to be supported by a single character set. Glyph substitution and placement rules can be associated with a writing system and stored in the font. Applications requiring these advanced features will need to save in the document an indication of the writing system in use, as well as the character codes entered.

The WGL4 character set

Traditionally, a font has been designed to contain all the glyphs required by a single codepage. However, Microsoft has now defined a character set standard which includes characters required by Western, Central, and Eastern European writing systems, as well as characters required by Greek and Turkish. This "PanEuropean" character set contains 652 characters and is called WGL4: Windows Glyph List 4. WGL4 takes advantage of the ability of Windows 95 to address characters according to their Unicode Double-Byte character codes using API extensions.

Note: WGL4 fonts are not required under Windows 95. Windows 95 will continue to support fonts which worked under Windows 3.1.

The WGL4 character set covers several codepages: 1250, 1251, 1252, 1253, and 1254. A user can load a single WGL4 font, and change codepages as needed. Previously, a user desiring to switch from English to Cyrillic to Greek while typing would have to choose three different fonts: first typing in Times New Roman, then in Times New Roman Cyrillic, and then in Times New Roman Greek.

Microsoft is supporting font developers as they create new WGL4 fonts. Windows 95 will also enable font developers to create fonts for large character sets other than WGL4, and users will be able to access all the glyphs as long as the associated characters exist in codepages supported by Windows 95.

The WGL4 character set is listed in Chapter 4 of the TrueType 1.0 Font File Specification (available on MSDN). The character set is compared to Win 3.1 ANSI, UGL, and Macintosh character sets.

Identifying writing system information within a font

As mentioned earlier, Windows can not determine an intended writing system or language based solely on the glyphs contained in a font. Before giving the user or application writing system options, Windows must know which writing systems a font covers.

Fortunately, fonts contain a great deal of information about their glyphs: in well-designed fonts you'll find hinting instructions, metrics, language information, attachment points for diacritical marks, underline and strikethrough information, and more. Fonts are comprised of many data structures, commonly referred to as tables, each containing specific information.

Language information about a font is stored in the "OS/2" table of the font. This table contains a variety of information about typeface weight, superscripts, strikeouts, ascender/descender values, PANOSE classification, licensing info, and more. For more information about the structure of TrueType Font Files, see the TrueType 1.0 Font File Specification (available on MSDN).

Writing systems covered by the glyphs in a font can be specified according to the Unicode script ranges covered by the font, or the codepages covered by the font. A font manufacturer sets script ranges and/or codepages by setting the appropriate bits of the ulCodePageRange fields or the ulUnicodeRange fields in the OS/2 table of the font. Multiple ranges can be specified for a single font. This encoding can not be changed by the user.

http://www.microsoft.com/typography/unicode/cscp.htm