The shortest guide to character sets you'll ever read

来源：互联网发布：java链表实现原理编辑：程序博客网时间：2024/04/28 19:44

When the "Oracle written in cyrillic" homographic attackprank spread on the Internet, it seemed for a moment that Google had removed Oracle from its index. It was a simple trick, accomplished with Unicode Cyrillic characters that were the exact copy of Latin ones. This kind of "attack", if we can call it so, would not have existed in the pre UFT-8 era, but as web developers we must gain some background with character set in order to not fall in these simple traps. At the same time, we should produce valid pages, with unambiguos encodings, so that every client in the world can render them correctly.

Character sets and encodings

This article contains some info on character sets, the logical maps of characters used by Von Neumann computers. They map ideas, like the concept of a C or a space, to codepoints, unique numbers.

Along with the definition of character set, we must consider the notion of character encoding, which is a dinstinct one. A character encoding is the physical representation of characters, usually as a sequence of bits.

The character sets and encodings are not presented in a strict historical order, but rather in order of complexity. For some of them, the ideas of set and encoding are the same (ASCII), while for other they are quite different (Unicode and UTF-8).

ASCII

At first, there was only ASCII. It was originally introduced as a standard in 1963, and it uses only 7 bits to represent a character. ASCII mapped 128 characters, and it is now represented with 8-bit entities like bytes due to simplicity.

ASCII had all the characters necessary to write phrases in the English language. It also not only has printable characters but also control characters, originally used to drive devices like printers and now obsolete except for special ones like 0A (or if you prefer, 10d: new line). It was sent to printers to tell them to advance the sheet; nowadays it separates line of a plain text document.

ISO-8859-X

The next question in character set standardization was: how do we deal with characters used in other languages and alphabets?

For example my native tongue, Italian, needs accented letters like à and è. German words present diacritics marks in some cases, such as in ö.

Multiple character sets were born, and we began assigning different meanings to bytes depending on the defined character set. ASCII was now not the unique choice anymore, but this solution raised another problem: recognizing the character set of a document.

In this generation of standards, ISO-8859-1 (aka Latin 1) covers nearly all the European languages needs. ISO-8859-15 instead changed some of its rarely used characters to new symbols like the Euro currency one (€).

Unicode

Every solution bring new problems. The next question was: how do we deal with documents written in the Chinese Mandarin or Japanese language, which have thousands of characters?

And so, multibyte encodings were introduced, and they used more than a byte to represent every characters. Apart from the size issue, which Shannon said we can't solve, the multibyte strings cause problems with string functions in all programming languages even today. Basically, you have to make sure that string functions, which by standard work on the single bytes, do not divide bytes pertaining to the same character during manipulation

PHP has a set of specialized string functions, the multibyte extension, which aims to solve the problem until this support will be incorporated into the language.

By the way, the most diffused multibyte character set is Unicode, which contain every character you can ever imagine, from the Arabic glyphs to Hebrew to Angul or Sanskrit. Or hieroglyphs.

UTF-8

Again, new problems start showing up. For example all our old documents were saved in ISO-8859-1, or in ASCII encoding. How can we convert all them to using 4-byte sequences?

Moreover, four bytes for every character is very heavy on bandwidth: it reduces it by a 75% factor for transmitting documents. Since you and I probably don't need Chinese language, why should we take an hit for someone else's burden?

And UTF-8 was developed: an encoding of Unicode where the most common characters are represented with one-byte sequences (copied from and compatible with ASCII), and the other ones with longer encodings up to 4 bytes. Japanese characters or letters with diacritics marks and accents take longer to encode, but they are by far rarer than Latin characters. Even if this article was written in Japanese, all the HTML tags would still use Latin characters.

UTF-8 is very clever in its usage of the encoding space. Every byte tells us his role, by the range in which he falls: for example, it is included between 00 and 7F it is from ASCII, while if it falls from F0 to F4 it is the start of a 4 byte sequence, and so on.

Using UTF-8 allows us to treat different languages in the same page or interaction, always with the same character encoding. Every character is in its shortest possible representation.

Specifying the character set

Now to the most important part: since there are many encodings available, where should I configure my web pages character encoding in order for the clients to understand them? The answer is simple: everytime you get the chance, so that the browser would not have to assume any default value for it.

Nowadays, UTF-8 is usually the right choice. But we want browser to render documents by assuming they are in UTF-8 and not in obscure encodings, so we must let them know.

The first place where to specify an encoding is in your editor, when you write source code or translation files. Correct representation starts here.
The next place is the HTML meta tag: <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> is fundamental.
Another useful metadata field is the equivalent HTTP header, Content-Type.
Database tables must match the charset of the rest of the applications. Many times this is forgotten as we focus on the front end.
Also HTML forms have an accept-charset attribute, but they can usually inherit their charset from the document.
Finally, UTF-8 string should be URL encoded when transmitting Unicode values via GET or POST. Browsers do this for us when necessary, and try to keep showing us the real characters even while they're sending a request for an URL like http://www.google.com/search?q=%D0%BEr%D0%B0c%D0%86%D0%B5.

This was the shortest guide to deal with character sets for a web developer. The topic is very deep and fascinating, so I encourage you to read more about it, but you shouldn't be necessarily transported when the essential information can take you a long way. One day I hope the nightmare of multiple conflicting character sets will come to an end.