再议UTF-16的编码

来源:互联网 发布:程序源码 编辑:程序博客网 时间:2024/05/16 06:34

上面的文章已经说过UTF-16的来历,但是所有的Unicode都有一个区分,那就是有签名和无签名之分,何为签名呢,就是文件编辑器在Unicode文件的开头字节中自动加入的识别字符。

现在来看看UTF-16的区别,UTF-16有两种编码:UTF-16LE和UTF-16BE,这两种编码有分别分为有签名和无签名,要正确读取解析这中文件就要用字符集一一对应,目前在JAVA中的对应方式如下:

Unicode:UTF-16LE有签名

UTF-16LE:UTF-16LE无签名

UTF-16BE:UTF-16BE无签名

UTF-16:UTF-16BE有签名

如何识别这四种编码方式,请看以下代码:

**
  * 判断文件的编码格式
  * @author luoyifan
  * @date 2010-03-22
  */
 private String getCharset(String filename){  
        String charset = "GBK";//设置默认为ANSI  
        byte[] first3Bytes = new byte[3];  
        try {  
            boolean checked = false;  
            BufferedInputStream bis = new BufferedInputStream(new FileInputStream(filename));  
            bis.mark(0);  
            int read = bis.read(first3Bytes,0,3);  
            if (read == -1 ) return charset;  
            if (first3Bytes[0] == (byte) 0xFF && first3Bytes[1] == (byte)0xFE) {  
                charset = "UTF-16LE-sign"; //UTF-16LE有签名 
                checked = true;  
            }  
            else if (first3Bytes[0] == (byte)0xFE && first3Bytes[1] == (byte)0xFF) {  
                charset = "UTF-16BE-sign"; //UTF-16BE有签名 
                checked = true;  
            }
            else if(first3Bytes[0] == (byte)0x22 && first3Bytes[1] == (byte)0x00){
             charset = "UTF-16LE-unsign";//UTF-16LE无签名  
                checked = true;
            }
            else if(first3Bytes[0] == (byte)0x00 && first3Bytes[1] == (byte)0x22){
             charset = "UTF-16BE-unsign";//UTF-16BE无签名  
                checked = true;
            }
            else if (first3Bytes[0] == (byte)0xEF && first3Bytes[1] == (byte)0xBB && first3Bytes[2] == (byte)0xBF ) {  
                charset = "UTF-8";  
                checked = true;  
            }  
            bis.reset();  
            if (!checked){
                int loc = 0; 
                while ((read = bis.read()) != -1) {  
                    loc++;  
                    if (read >= 0xF0) break;  
                    if (0x80 <= read && read <= 0xBF) // 单独出现BF以下的,也算是GBK  
                    break;  
                    if (0xC0 <= read && read <= 0xDF) {  
                        read = bis.read();  
                        if (0x80 <= read && read <= 0xBF) //双字节 (0xC0 - 0xDF) (0x80  
                                                                        // - 0xBF),也可能在GB编码内  
                        continue;  
                        else break;  
                    }  
                    else if (0xE0 <= read && read <= 0xEF) {//也有可能出错,但是几率较小  
                        read = bis.read();  
                        if (0x80 <= read && read <= 0xBF) {  
                            read = bis.read();  
                            if (0x80 <= read && read <= 0xBF) {  
                                charset = "UTF-8";  
                                break;  
                            }  
                            else break;  
                        }  
                        else break;  
                    }  
                }  
                //System.out.println( loc + " " + Integer.toHexString( read ) );  
            }  
 
            bis.close();  
        } catch ( Exception e ) {  
            e.printStackTrace();  
        }  
 
        return charset;  
    }

 

上面的代码是总结的开头字节,UTF-16LE有签名对应着[-1,-2],UTF-16BE有签名对应着[-2,-1],UTF-16LE无签名对应着[34,0],UTF-16BE无签名对应着[0,34]。

目前在系统中创建的UTF-16文件都是按如上的开头字节对应的,但是也有例外的情况,当从UTF-8的文件转换为UTF-16的无签名文件就无签名就会转变为[70,85]和[85,70]这个地方是不定的,随文件的不同而不同

 

原创粉丝点击