如何根据字节流内容确定汉字编码，从而解决相同App在部分省份显示乱码的问题

来源：互联网发布：水滴刷单软件编辑：程序博客网时间：2024/05/16 05:18

近期，有某些省份的电信用户反映公司的Android客户端App通过3G手机卡得到的部分数据显示是乱码，但在wifi环境下显示是正常，初步排查是因为数据在进行gzip压缩之前的编码不同，在某些省份是GBK，有些是UFT8，在解码后可能与预定的GBK编码不符，出现乱码。因此，需要对网络流进行编码探测，根据探测结果选择编码。

一种简单的方式是通过HttpEntity的ContentType分析字符编码：

    public static String getContentCharSet(final HttpEntity entity)        throws ParseException {        if (entity == null) {            throw new IllegalArgumentException("HTTP entity may not be null");        }        String charset = null;        if (entity.getContentType() != null) {             HeaderElement values[] = entity.getContentType().getElements();            if (values.length > 0) {                NameValuePair param = values[0].getParameterByName("charset");                if (param != null) {                    charset = param.getValue();                }            }        }        return charset;    }

测试的确是有效的，但是可能给人的感觉却似乎总是不放心的，比如万一HTTP Header里缺少ContentType，那如何判断字符编码？

网上有大量的判断字符编码的博客了，但多数是对于文件的编码判断，对于网络流的判断是失效的，在此推荐一个开源组件cpdetector（总共494KB），可以检测文件和字节流编码。

下面是EncodingDetector工具类代码。cpdetector是基于统计学的，统计的字节数越多，准确性越高。对于文件流，字节数是已知的，探测的字节数是文件长度-1，但不超过2000。

public class EncodingDetector {    private static final CodepageDetectorProxy detector = CodepageDetectorProxy .getInstance();    static {            detector.add(new ParsingDetector(false));        detector.add(JChardetFacade.getInstance());        detector.add(ASCIIDetector.getInstance());        detector.add(UnicodeDetector.getInstance());    }        public static String getCharset(InputStream is, Boolean useAvailable) {        Charset charset = null;        int detectCharNum = 2000; //检测的字节数越多越准确, 字节数的指定不能超过文本流的最大长度        try {        if(useAvailable) {            int available = is.available();            if(available <= 1) { //有的输入流可能没有能力返回字节数（比如网络流，并不能准确知道还有多少数据未到达）            return HTTP.UTF_8;            }            if(detectCharNum > available) {            detectCharNum = available - 1;            }        }        BufferedInputStream bufferedInputStream = new BufferedInputStream(is);            charset = detector.detectCodepage(bufferedInputStream, detectCharNum);            bufferedInputStream.reset();                    } catch (Exception e) {        }        return null != charset ? charset.name() : null;    }public static String getCharset(ByteArrayOutputStream bos) {String charset = null;try {ByteArrayInputStream is = new ByteArrayInputStream(bos.toByteArray());charset = getCharset(is, true); //bos字节数是已知的is.close();} catch (IOException e) {}return charset;}}

对于网络流，因为不能准确知道还有多少数据没有到达，应该先读取并缓存字节流，然后探测缓存的编码。

HttpGet httpGet = null;HttpResponse httpResponse;InputStream is = null;BufferedReader in = null;ByteArrayOutputStream bos = null;try {httpGet = new HttpGet(url);//httpGet.addHeader();httpResponse = httpClient.execute(httpGet);int statusCode = httpResponse.getStatusLine().getStatusCode();HttpEntity httpEntity = httpResponse.getEntity();String json = "";if(httpEntity != null){is= httpEntity.getContent();Header val = httpEntity.getContentEncoding();if (val != null && val.getValue()!= null && val.getValue().contains("gzip")) {        is= new GZIPInputStream(is);}else{BufferedInputStream bis = new BufferedInputStream(is);bis.mark(2);// 取前两个字节byte[] header = new byte[2];int result = bis.read(header);// reset输入流到开始位置bis.reset();// 判断是否是GZIP格式if(result!=-1 && Utils.toInt(header, 0)== GZip_Value) {    is= new GZIPInputStream(bis);} else {    is= bis;}}if(encoding != null) {//解决部分省份出现乱码的问题Boolean mustUseDefault = false;if(needDetectEncoding) {/*String chartsetFromHttpEntity = EntityUtils.getContentCharSet(httpEntity);if(!TextUtils.isEmpty(chartsetFromHttpEntity)) {chartsetFromHttpEntity = chartsetFromHttpEntity.toUpperCase();mustUseDefault = chartsetFromHttpEntity.contains("UTF");}*/bos = new ByteArrayOutputStream();byte[] buff = new byte[100]; //buff用于存放循环读取的临时数据 int rc = 0; while ((rc = is.read(buff, 0, 100)) > 0) { bos.write(buff, 0, rc); }String chartsetFromInputStream = EncodingDetector.getCharset(bos);if(!TextUtils.isEmpty(chartsetFromInputStream)) {chartsetFromInputStream = chartsetFromInputStream.toUpperCase();mustUseDefault = chartsetFromInputStream.contains("UTF");}//android.util.Log.e("httpGetWithZip", chartsetFromHttpEntity + chartsetFromInputStream);is.close();is = new ByteArrayInputStream(bos.toByteArray());}if(mustUseDefault) {in = new BufferedReader(new InputStreamReader(is));} else {in = new BufferedReader(new InputStreamReader(is, “GBK”));}} else{in = new BufferedReader(new InputStreamReader(is));}String line = "";while ((line = in.readLine()) != null) {json += line;}}if(statusCode != HttpStatus.SC_OK){}} catch (ClientProtocolException e) {if(httpGet != null){httpGet.abort();}} catch(IllegalArgumentException e){if(httpGet != null){httpGet.abort();}}catch(OutOfMemoryError e){if(httpGet != null){httpGet.abort();}}catch (IOException e) {rspInfo.setStatusCode(NetError);if(httpGet != null){httpGet.abort();}}finally{if(is != null){try {is.close();} catch (IOException e) {}}if(in != null){try {in.close();} catch (IOException e) {}}if(bos != null) {try {bos.close();} catch (IOException e) {}}}

要正确使用detector.add(JChardetFacade.getInstance());，将cpdetector_1.0.10.jar放到\libs\目录下，并且antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar也放到\libs\目录下。

0 0