String编码源码解析

来源：互联网发布：linux系统磁盘分区编辑：程序博客网时间：2024/05/18 00:33

String编码的使用

1：将byte[]转换为String一般的操作：new String(byte[])，进行new String时需要指定编码格式，如果未指定编码格式，则使用默认编码格式。

        // 获取默认编码格式,然后进行decode        String csn = Charset.defaultCharset().name();        try{                     return decode(csn, ba, off, len);        } catch (UnsupportedEncodingException x) {            warnUnsupportedCharset(csn);        }        //如果csn是未支持的编码格式，则使用ISO-8859-1进行编码        try {            return decode("ISO-8859-1", ba, off, len);        } catch (UnsupportedEncodingException x) {            System.exit(1);            return null;        }

下面是获取默认编码Charset.defaultCharset()的实现

       if (defaultCharset == null) {            synchronized (Charset.class) {                //下面代码是获取系统的Property的属性，相当于System.getProperty("file.encoding")                String csn = AccessController.doPrivileged(                    new GetPropertyAction("file.encoding"));                   Charset cs = lookup(csn);                // 如果是未支持的编码格式，则默认的编码格式为UTF-8                if (cs != null)                    defaultCharset = cs;                else                    defaultCharset = forName("UTF-8");            }        }        return defaultCharset;

2：当new String()指定编码时，java提供了两种方式public String(byte bytes[], String charsetName)和public String(byte bytes[], Charset charset)，下面将讲解这两种方式：

public String(byte bytes[], String charsetName)

        // 获取上一次缓存中的编码格式        StringDecoder sd = deref(decoder);        // 如果未设置编码方式，则使用ISO-8859-1，否则使用设置的编码方式        String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;        // 判断从缓存中获取的编码格式和设置的编码格式是否相同，不相同，则重新获取编码格式        if ((sd == null) || !(csn.equals(sd.requestedCharsetName())                              || csn.equals(sd.charsetName()))) {            sd = null;            try {                // 重新获取编码格式                Charset cs = lookupCharset(csn);                if (cs != null)                    sd = new StringDecoder(cs, csn);            } catch (IllegalCharsetNameException x) {}            if (sd == null)                throw new UnsupportedEncodingException(csn);            // 将这次的编码格式存入到缓存中                set(decoder, sd);        }        // 进行decode        return sd.decode(ba, off, len);

public String(byte bytes[], Charset charset)

该方法的核心代码和sd.decode()方法非常相似，唯一的区别在于sd.decode()方法将cd.reset()放置于else后面，而Charset编码则将reset()方法放置于每次编码前，这将导致每次使用Charset进行new String时，都需要reset。

        CharsetDecoder cd = cs.newDecoder();        int en = scale(len, cd.maxCharsPerByte());        char[] ca = new char[en];        if (len == 0)            return ca;        boolean isTrusted = false;        if (System.getSecurityManager() != null) {            if (!(isTrusted = (cs.getClass().getClassLoader0() == null))) {                ba =  Arrays.copyOfRange(ba, off, off + len);                off = 0;            }        }        // 用Chaeset进行编码，则将reset方法放到这个位置        cd.onMalformedInput(CodingErrorAction.REPLACE)          .onUnmappableCharacter(CodingErrorAction.REPLACE)          .reset();        if (cd instanceof ArrayDecoder) {            int clen = ((ArrayDecoder)cd).decode(ba, off, len, ca);            return safeTrim(ca, clen, cs, isTrusted);        } else {            //charsetName进行编码将reset方法放在这个位置            ByteBuffer bb = ByteBuffer.wrap(ba, off, len);            CharBuffer cb = CharBuffer.wrap(ca);            try {                CoderResult cr = cd.decode(bb, cb, true);                if (!cr.isUnderflow())                    cr.throwException();                cr = cd.flush(cb);                if (!cr.isUnderflow())                    cr.throwException();            } catch (CharacterCodingException x) {                // Substitution is always enabled,                // so this shouldn't happen                throw new Error(x);            }            return safeTrim(ca, cb.position(), cs, isTrusted);        }

通过比较源码，发现上述两种编码格式有两个不同点：

第一：reset的位置不同

第二：使用charsetName的方式，先从缓存中获取编码格式，如果这次编码格式和上次编码格式相同，则直接进行decode，而无需重新获取编码格式。因此通过使用charsetName比Charset性能高，前提是缓存的命中率，如果统一编码格式的话，缓存的命中大大提高，则能有效的提高性能。

Charset获取方式

在JDK1.7 提供了StandardCharset，StandardCharset提供了常用的编码格式，如果版本低于1.7则可以使用Charset.forName(“UTF-8”)。Charset默认有二级缓存，缓存最近的两次获取的Charset，最近一次放在一级缓存中，第二次放在二级缓存中。

        private static Charset lookup(String charsetName) {        if (charsetName == null)            throw new IllegalArgumentException("Null charset name");        Object[] a;        // 从一级缓存中获取Charset，如果命中，则直接返回，否则调用二级缓存        if ((a = cache1) != null && charsetName.equals(a[0]))            return (Charset)a[1];        // 调用二级缓存             return lookup2(charsetName);    }    private static Charset lookup2(String charsetName) {        Object[] a;        // 二级缓存中获取Charset        if ((a = cache2) != null && charsetName.equals(a[0])) {            cache2 = cache1;            cache1 = a;            return (Charset)a[1];        }        Charset cs;        // 如果缓存中没有获取相对应的Chaeset，则通过standardProvider获取Chaeset        if ((cs = standardProvider.charsetForName(charsetName)) != null ||            (cs = lookupExtendedCharset(charsetName))           != null ||            (cs = lookupViaProviders(charsetName))              != null)        {            // 缓存Chaeset            cache(charsetName, cs);            return cs;        }

为什么要去指定编码格式？

因为每台计算机默认编码格式可能不同，这将导致相同的代码在不同的计算机上运行时，出现乱码情况。简单的一个例子，当相同的代码运行在不同编码格式的计算机上，在其中的一台以GBK的编码格式将数据存入到数据库中，另一台通过的UTF-8的编码格式获取数据，则将出现乱码的情况，所以规避风险最好的方式是指定的特定的编码格式。

阅读全文

0 0