CP1252 and ISO8859-1

来源:互联网 发布:淘宝直播号出售 编辑:程序博客网 时间:2024/04/29 10:18

 http://www.herongyang.com/Unicode/Java-charset-EncodingSampler-Test-encode-Method.html

http://www.herongyang.com/Unicode/Java-charset-Example-of-CP1252-ISO-8859-1-Encoding.html

JDK offers 4 methods to encode characters:

  • CharsetEncoder.encode()
  • Charset.encode()
  • String.getBytes()
  • OutputStreamWriter.write()

Here is a program that demonstrate how to encode characters with each of above 4 methods:

/** * EncodingSampler.java * Copyright (c) 2002 by Dr. Herong Yang */import java.io.*;import java.nio.*;import java.nio.charset.*;class EncodingSampler {   static String dfltCharset = null;   static char[] chars={0x0000, 0x003F, 0x0040, 0x007F, 0x0080, 0x00BF,                        0x00C0, 0x00FF, 0x0100, 0x3FFF, 0x4000, 0x7FFF,                        0x8000, 0xBFFF, 0xC000, 0xEFFF, 0xF000, 0xFFFF};   static char hexDigit[] = {'0', '1', '2', '3', '4', '5', '6', '7',                             '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};   public static void main(String[] arg) {      String charset = null;      if (arg.length>0) charset = arg[0];      OutputStreamWriter o = new OutputStreamWriter(         new ByteArrayOutputStream());      dfltCharset = o.getEncoding();      if (charset==null) System.out.println("Default ("+dfltCharset         +") encoding:");      else System.out.println(charset+" encoding:");      System.out.println("Char, String, Writer, Charset, Encoder");      for (int i=0; i<chars.length; i++) {         char c = chars[i];         byte[] b1 = encodeByString(c,charset);         byte[] b2 = encodeByWriter(c,charset);         byte[] b3 = encodeByCharset(c,charset);         byte[] b4 = encodeByEncoder(c,charset);         System.out.print(charToHex(c)+",");         printBytes(b1);         System.out.print(",");         printBytes(b2);         System.out.print(",");         printBytes(b3);         System.out.print(",");         printBytes(b4);         System.out.println("");      }   }   public static byte[] encodeByCharset(char c, String cs) {      Charset cso = null;      byte[] b = null;      try {            if (cs==null) cso = Charset.forName(dfltCharset);         else cso = Charset.forName(cs);         ByteBuffer bb = cso.encode(String.valueOf(c));         b = copyBytes(bb.array(),bb.limit());      } catch (IllegalCharsetNameException e) {         System.out.println(e.toString());      }            return b;   }   public static byte[] encodeByEncoder(char c, String cs) {      Charset cso = null;      byte[] b = null;      try {            if (cs==null) cso = Charset.forName(dfltCharset);         else cso = Charset.forName(cs);         CharsetEncoder e =  cso.newEncoder();         e.reset();         ByteBuffer bb = e.encode(CharBuffer.wrap(new char[] {c}));         b = copyBytes(bb.array(),bb.limit());      } catch (IllegalCharsetNameException e) {         System.out.println(e.toString());      } catch (CharacterCodingException e) {         //System.out.println(e.toString());         b = new byte[] {(byte)0x00};      }            return b;   }   public static byte[] encodeByString(char c, String cs) {      String s = String.valueOf(c);      byte[] b = null;      if (cs==null) {         b = s.getBytes();      } else {         try {            b = s.getBytes(cs);         } catch (UnsupportedEncodingException e) {            System.out.println(e.toString());         }      }      return b;   }   public static byte[] encodeByWriter(char c, String cs) {      byte[] b = null;      ByteArrayOutputStream bs = new ByteArrayOutputStream();      OutputStreamWriter o = null;      if (cs==null) {         o = new OutputStreamWriter(bs);      } else {         try {            o = new OutputStreamWriter(bs, cs);         } catch (UnsupportedEncodingException e) {            System.out.println(e.toString());         }      }      String s = String.valueOf(c);      try {         o.write(s);         o.flush();         b = bs.toByteArray();         o.close();      } catch (IOException e) {         System.out.println(e.toString());      }      return b;   }   public static byte[] copyBytes(byte[] a, int l) {      byte[] b = new byte[l];      for (int i=0; i<Math.min(l,a.length); i++) b[i] = a[i];      return b;   }   public static void printBytes(byte[] b) {      for (int j=0; j<b.length; j++)         System.out.print(" "+byteToHex(b[j]));   }   public static String byteToHex(byte b) {      char[] a = { hexDigit[(b >> 4) & 0x0f], hexDigit[b & 0x0f] };      return new String(a);   }   public static String charToHex(char c) {      byte hi = (byte) (c >>> 8);      byte lo = (byte) (c & 0xff);      return byteToHex(hi) + byteToHex(lo);   }}

Note that:

  • If the same encoding is used, each of the encode method in the program should return the exactly the same byte sequence.
  • getEncoding() is used on OuputStreamWriter class to get the name of the default encoding.
  • There is no way to know the name of the default encoding on String class.
  • There is no default instance of Charset and Encoder.
  • In encodeByEncoder(), 0x00 is used as the output when the given character can not be encoded by the encoder.

Running the testing program, EncodingSampler.java, provided in the previous section without any argument will use the JVM's default encoding:

Default (Cp1252) encoding:Char, String, Writer, Charset, Encoder0000, 00, 00, 00, 00003F, 3F, 3F, 3F, 3F0040, 40, 40, 40, 40007F, 7F, 7F, 7F, 7F0080, 3F, 3F, 3F, 0000BF, BF, BF, BF, BF00C0, C0, C0, C0, C000FF, FF, FF, FF, FF0100, 3F, 3F, 3F, 003FFF, 3F, 3F, 3F, 004000, 3F, 3F, 3F, 007FFF, 3F, 3F, 3F, 008000, 3F, 3F, 3F, 00BFFF, 3F, 3F, 3F, 00C000, 3F, 3F, 3F, 00EFFF, 3F, 3F, 3F, 00F000, 3F, 3F, 3F, 00FFFF, 3F, 3F, 3F, 00

The results shows that:

  • The default encoding of the String class seems to be the same as OutputStreamWriter: Cp1252.
  • There are a number of characters that can not be encoded by Cp1252. The String, OutputStreamWriter, and Charset classes are returning 0x3F for those non-encodable characters.
  • It's obvious that Cp1252 works on a character set in the 0x0000 - 0x00FF range.

Running the program again with 'CP1252' as argument should give us the same output as the previous run:

CP1252 encoding:Char, String, Writer, Charset, Encoder0000, 00, 00, 00, 00003F, 3F, 3F, 3F, 3F0040, 40, 40, 40, 40007F, 7F, 7F, 7F, 7F0080, 3F, 3F, 3F, 0000BF, BF, BF, BF, BF00C0, C0, C0, C0, C000FF, FF, FF, FF, FF0100, 3F, 3F, 3F, 003FFF, 3F, 3F, 3F, 004000, 3F, 3F, 3F, 007FFF, 3F, 3F, 3F, 008000, 3F, 3F, 3F, 00BFFF, 3F, 3F, 3F, 00C000, 3F, 3F, 3F, 00EFFF, 3F, 3F, 3F, 00F000, 3F, 3F, 3F, 00FFFF, 3F, 3F, 3F, 00

Let's try another encoding, ISO-8859-1:

ISO-8859-1 encoding:Char, String, Writer, Charset, Encoder0000, 00, 00, 00, 00003F, 3F, 3F, 3F, 3F0040, 40, 40, 40, 40007F, 7F, 7F, 7F, 7F0080, 80, 80, 80, 8000BF, BF, BF, BF, BF00C0, C0, C0, C0, C000FF, FF, FF, FF, FF0100, 3F, 3F, 3F, 003FFF, 3F, 3F, 3F, 004000, 3F, 3F, 3F, 007FFF, 3F, 3F, 3F, 008000, 3F, 3F, 3F, 00BFFF, 3F, 3F, 3F, 00C000, 3F, 3F, 3F, 00EFFF, 3F, 3F, 3F, 00F000, 3F, 3F, 3F, 00FFFF, 3F, 3F, 3F, 00

It appears to be the same as CP1252.

 

原创粉丝点击