关于java字符的编码问题学习

来源：互联网发布：平面装修设计软件编辑：程序博客网时间：2024/05/22 02:11

关于java字符串相关的字符集和编码方式不再解释，可以参见该篇文章[Java与字符编码问题详谈](http://hxraid.iteye.com/blog/559607)，今天要说的是在java字符串转字节数组时的方式： String.getBytes()和Charset.encode(string)的区别；问题：     String str = "我";    byte[] ba1 = str.getBytes("UTF-16");    printBytes(ba1);// 编码：0xfe 0xff 0x62 0x11； 前两个为BOM    -------------华丽的分割线--------------------------------    Charset cs = CharSet.forName("UTF-16");     byte[] ba2 = cs.encode(str).array();    printBytes(ba2);//编码：0xfe 0xff 0x62 0x11 0x00; 最后多个0x00困扰我的问题来了，为什么会多了一个字节呢？   先来看看源码：一。String.getBytes()方法：    public byte[] getBytes(String charsetName)            throws UnsupportedEncodingException {            if (charsetName == null) throw new NullPointerException();           return StringCoding.encode(charsetName, value, 0, value.length);    }在这里是直接调用StringCoding类的encode()方法。    static byte[] encode(String charsetName, char[] ca, int off, int len)        throws UnsupportedEncodingException    {        StringEncoder se = deref(encoder);        String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;        if ((se == null) || !(csn.equals(se.requestedCharsetName())                              || csn.equals(se.charsetName()))) {            se = null;            try {                Charset cs = lookupCharset(csn);                if (cs != null)                    se = new StringEncoder(cs, csn);            } catch (IllegalCharsetNameException x) {}            if (se == null)                throw new UnsupportedEncodingException (csn);            set(encoder, se);        }        return se.encode(ca, off, len);    }StringCoding类先获取当前项目平台的编码， 如果和要求的指定编码方式不同，则通过制定编码方式获取字符集CharSet,进而生成StringEncode类（StringCoding的私有成员类）， 而我们来看StringEncode类： private static class StringEncoder {    private Charset cs;    private CharsetEncoder ce;    private final String requestedCharsetName;    private final boolean isTrusted;    ...........     byte[] encode(char[] ca, int off, int len) {        int en = scale(len, ce.maxBytesPerChar());        byte[] ba = new byte[en];        if (len == 0)            return ba;        if (ce instanceof ArrayEncoder) {            int blen = ((ArrayEncoder)ce).encode(ca, off, len, ba);            return safeTrim(ba, blen, cs, isTrusted);        } else {            ce.reset();            ByteBuffer bb = ByteBuffer.wrap(ba);            CharBuffer cb = CharBuffer.wrap(ca, off, len);            try {                CoderResult cr = ce.encode(cb, bb, true);                if (!cr.isUnderflow())                    cr.throwException();                cr = ce.flush(bb);                if (!cr.isUnderflow())                    cr.throwException();            } catch (CharacterCodingException x) {                // Substitution is always enabled,                // so this shouldn't happen                throw new Error(x);            }            return safeTrim(ba, bb.position(), cs, isTrusted);        }可以看到StringEncode类的编码事实上也是调用了CharEncoder类来编码； 但注意的是StringCoding类的encode（）方法中最后的这个safeTrim(ba, bb.position(), cs, isTrusted)；这里会将ByteBuffer类中的值放到byte[]中，而这里的参数bb.position是将ByteBuffer中的有效的值放入了。 详情可参见[java.nio.ByteBuffer用法小结](http://blog.csdn.net/zhoujiaxq/article/details/22822289)；

二. CharSet.encode(string)
首先CharSet.encode(),实际上是先生成CharsetEncode类，如下

    public final ByteBuffer encode(CharBuffer cb) {        try {                return ThreadLocalCoders.encoderFor(this)                .onMalformedInput(CodingErrorAction.REPLACE)                .onUnmappableCharacter(CodingErrorAction.REPLACE)                .encode(cb);        } catch (CharacterCodingException x) {            throw new Error(x);         // Can't happen        }    }

看这里如何编码，如下：

    public final ByteBuffer encode(CharBuffer in)        throws CharacterCodingException    {    int n = (int)(in.remaining() * averageBytesPerChar());    ByteBuffer out = ByteBuffer.allocate(n);    if ((n == 0) && (in.remaining() == 0))        return out;    reset();    for (;;) {        CoderResult cr = in.hasRemaining() ?            encode(in, out, true) : CoderResult.UNDERFLOW;        if (cr.isUnderflow())            cr = flush(out);        if (cr.isUnderflow())            break;        if (cr.isOverflow()) {            n = 2*n + 1;    // Ensure progress; n might be 0!            ByteBuffer o = ByteBuffer.allocate(n);            out.flip();            o.put(out);            out = o;            continue;        }        cr.throwException();    }    out.flip();    return out;}

这两种方式的最终的编码其实都是调用了CharsetEncode类的方法：
CoderResult encode(CharBuffer in, ByteBuffer out,boolean endOfInput)；这里区别在于：
1>在调用方法前的参数上，同一字符串调用时，参数ByteBuffer的会不同。这取决于对传入的ByteBuffer的capacity属性的不同赋值，StringCoding的encode方法为：

int len = string.length;byte[] encode(char[] ca, int off, int len) {    int en = scale(len, ce.maxBytesPerChar());     ...................

对ce(CharsetEncoder)的maxBytesPerChar;
而在CharSet类的encode方法中为：

CharBuffer in = CharBuffer.wrap(string);int n = (int)(in.remaining() * averageBytesPerChar());//注：string.length() = CharBuffer.wrap(string).remaining();ByteBuffer out = ByteBuffer.allocate(n);..............for (;;) {    CoderResult cr = in.hasRemaining() ?encode(in, out, true) : CoderResult.UNDERFLOW;                    if (cr.isUnderflow())        cr = flush(out);    if (cr.isUnderflow())        break;    if (cr.isOverflow()) {           n = 2*n + 1;    // Ensure progress; n might be 0!        //这里为了确保程序正常，当初始的ByteBuffer容量不足时，newCapacity = 2*oldCapacity+1;        ByteBuffer o = ByteBuffer.allocate(n);        out.flip();        o.put(out);        out = o;        continue;    }        cr.throwException();    } .....................

2>StringCoding对返回的结果ByteBuffer进行了处理。

    byte[] encode(char[] ca, int off, int len) {        int en = scale(len, ce.maxBytesPerChar());         if (len == 0)            ...................         if (ce instanceof ArrayEncoder) {            ...................        } else {            ce.reset();            ByteBuffer bb = ByteBuffer.wrap(ba);            CharBuffer cb = CharBuffer.wrap(ca, off, len);            try {                CoderResult cr = ce.encode(cb, bb, true);                ...................             } catch (CharacterCodingException x) {                ...................              }           return safeTrim(ba, bb.position(), cs, isTrusted);

调用了safeTrim()方法，将ByteBuffer中的值复制到byte[]中。
而CharSet的encode方法直接就返回了ByteBuffer方法，而ByteBuffer中有的方法为array();如下：

    Charset cs = CharSet.forName("UTF-16");     ByteBuffer bb = cs.encode(str);    byte[] ba = bb.array();    printBytes(ba2);//编码：0xfe 0xff 0x62 0x11 0x00; 最后多个0x00

而ByteBuffer.array()方法是将ByteBuffer容器的每个元素都转变为数组中的值，这其中编码时，ByteBuffer容器没有被填满->容器中的某些值为0(因为ByteBuffer.allocate(n)初始化时会默认赋值)；所有在字符编码时会，调用CharSet.encode()方法后，对返回的结果ByteBuffer直接调用.array()方法会造成编码后的字节数组多出一个字节。
正确的处理CharSet.encode()返回结果的方式应如下：

String str = "我们";CharSet cs = CharSet.forName("UTF-16");//或'UTF-8'/'GBK'/'ASCII'等ByteBuffer bb = cs.encode(str);byte[] ba = new byte[bb.limit()];for(int i = 0;;i++){    if(bb.hasRemaining())        ba[i] = bb.get();    else{        break;    }   }printBytes(ba);  //结果0xFE  0xff  0x62  0x11  0x4E  0xEC

1 0