JSP和Servlet对中文的处理过程JSP and Servlet on the Chinese processing

来源:互联网 发布:mac刷新 编辑:程序博客网 时间:2024/05/18 00:50

世界上的各地区都有本地的语言。地区差异直接导致了语言环境的差异。在开发一个国际化程序的过程中,处理语言问题就显得很重要了。

这是一个世界范围内都存在的问题,所以,Java提供了世界性的解决方法。本文描述的方法是用于处理中文的,但是,推而广之,对于处理世界上其它国家和地区的语言同样适用。

汉字是双字节的。所谓双字节是指一个双字要占用两个BYTE的位置(即16位),分别称为高位和低位。中国规定的汉字编码为GB2312,这是强制性的,目前几乎所有的能处理中文的应用程序都支持GB2312。GB2312包括了一二级汉字和9区符号,高位从0xa1到0xfe,低位也是从0xa1到0xfe,其中,汉字的编码范围为0xb0a1到0xf7fe。

另外有一种编码,叫做GBK,但这是一份规范,不是强制的。GBK提供了20902个汉字,它兼容GB2312,编码范围为0x8140到0xfefe。GBK中的所有字符都可以一一映射到Unicode 2.0。

在不久的将来,中国会颁布另一种标准:GB18030-2000(GBK2K)。它收录了藏、蒙等少数民族的字型,从根本上解决了字位不足的问题。注意:它不再是定长的。其二字节部份与GBK兼容,四字节部分是扩充的字符、字形。它的首字节和第三字节从0x81到0xfe,二字节和第四字节从0x30到0x39。

本文不打算介绍Unicode,有兴趣的可以浏览“http://www.unicode.org/”查看更多的信息。Unicode有一个特性:它包括了世界上所有的字符字形。所以,各个地区的语言都可以建立与Unicode的映射关系,而Java正是利用了这一点以达到异种语言之间的转换。

JDK中,与中文相关的编码有:

表1 JDK中与中文相关的编码列表

编码名称
说    明

ASCII
7位,与ascii7相同

ISO8859-1
8-位,与 8859_1,ISO-8859-1,ISO_8859-1,latin1...等相同

GB2312-80
16位,与gb2312,gb2312-1980,EUC_CN,euccn,1381,Cp1381, 1383, Cp1383, ISO2022CN,ISO2022CN_GB...等相同

GBK
与MS936相同,注意:区分大小写

UTF8
与UTF-8相同

GB18030
与cp1392、1392相同,目前支持的JDK很少

在实际编程时,接触得比较多的是GB2312(GBK)和ISO8859-1。

为什么会有“?”号

上文说过,异种语言之间的转换是通过Unicode来完成的。假设有两种不同的语言A和B,转换的步骤为:先把A转化为Unicode,再把Unicode转化为B。

举例说明。有GB2312中有一个汉字“李”,其编码为“C0EE”,欲转化为ISO8859-1编码。步骤为:先把“李”字转化为Unicode,得到“674E”,再把“674E”转化为ISO8859-1字符。当然,这个映射不会成功,因为ISO8859-1中根本就没有与“674E”对应的字符。

当映射不成功时,问题就发生了!当从某语言向Unicode转化时,如果在某语言中没有该字符,得到的将是Unicode的代码“/uffffd”(“/u”表示是Unicode编码,)。而从Unicode向某语言转化时,如果某语言没有对应的字符,则得到的是“0x3f”(“?”)。这就是“?”的由来。

例如:把字符流buf =“0x80 0x40 0xb0 0xa1”进行new String(buf, "gb2312")操作,得到的结果是“/ufffd/u554a”,再println出来,得到的结果将是“?啊”,因为“0x80 0x40”是GBK中的字符,在GB2312中没有。

The world's regions have the local language. As a direct result of regional differences in the language of environmental differences. In the development of an international procedure, the language dealing with the issue is very important.

This is a world exists, so, Java provides a global solution. This article describes the methods for dealing with the Chinese, however, by extension, for the world to deal with other countries and regions to apply the same language.

Is the double-byte characters. Refers to the so-called double-byte characters to take up a two-BYTE two positions (that is, the 16-bit), respectively known as the high and low. Chinese character coding for the provisions of the GB2312, it is mandatory, at present almost all Chinese to deal with the application support GB2312. GB2312 includes 12 Chinese characters and symbols District 9, from the high 0xa1 to 0xfe, down from 0xa1 to 0xfe, which, in the range of character encoding 0xb0a1 to 0xf7fe.

In addition there is a code, known as GBK, but this is a norm, rather than mandatory. GBK offers 20,902 Chinese characters, it is compatible with GB2312, encoded in the range of 0x8140 to 0xfefe. GBK all the characters one by one can be mapped to the Unicode 2.0.

In the near future, China will be issuing a different standard: GB18030-2000 (GBK2K). It is a collection of the Tibetan, Mongolian and other ethnic minorities of the font, a fundamental solution to the word of the shortfall. NOTE: It is no longer a fixed long. The second part of the byte compatible with the GBK, is part of the expansion of the four-byte characters, and characters. It's the first and third bytes bytes from 0x81 to 0xfe, second and fourth bytes bytes from 0x30 to 0x39.

This article does not intend to introduce Unicode, who are interested can visit the "http://www.unicode.org/" see more information. Unicode has a characteristic: It includes the world's all-shaped characters. Therefore, in all areas of language can be set up with the Unicode mapping relations, and the use of Java is that in order to achieve the conversion between the dissimilar languages.

In the JDK, and Chinese-related code:

Table 1 JDK with a list of Chinese-related coding

Code name
Help

ASCII
7, with the same ascii7

ISO8859-1
8 -, and 8859_1, ISO-8859-1, ISO_8859-1, latin1 ... and so on the same

GB2312-80
16, and gb2312, gb2312-1980, EUC_CN, euccn, 1381, Cp1381, 1383, Cp1383, ISO2022CN, ISO2022CN_GB ... and so on the same

GBK
MS936 with the same attention: case-sensitive

UTF8
UTF-8 with the same

GB18030
And cp1392, 1392 the same as at present little support for the JDK

In practice, programming, is a contacted more than GB2312 (GBK) and ISO8859-1.

Why "?"

Above that xenotransplantation is the language between the conversion to complete the adoption of Unicode. Assuming there are two different languages A and B, the conversion steps: A first translated into Unicode, then Unicode into B.

Examples. GB2312 there in a Chinese character "Li", the code for "C0EE", want to be translated into ISO8859-1 encoding. Steps: first, "Li" into Unicode, to be "674E", then "674E" into ISO8859-1 characters. Of course, the map will not succeed because of ISO8859-1 and there is no "674E" corresponding to the characters.

When the map is unsuccessful, the issue took place! When a language from the conversion to Unicode, if not in a language of the characters, will be the Unicode is a code "/ uffffd" ( "/ u" that is the Unicode encoding). From a language conversion to Unicode, if a language does not correspond to the characters, is to be "0x3f "("?")。 This is the "?" The origin.

For example: the characters flow buf = "0x80 0x40 0xb0 0xa1" to carry out new String (buf, "gb2312") operation, the result is "/ ufffd / u554a", and then println, the result will be? "Ah" As "0x80 0x40" is the character of GBK, GB2312 did not.