如何根据字节流内容确定汉字编码,从而解决相同App在部分省份显示乱码的问题

来源:互联网 发布:水滴刷单软件 编辑:程序博客网 时间:2024/05/16 05:18

近期,有某些省份的电信用户反映公司的Android客户端App通过3G手机卡得到的部分数据显示是乱码,但在wifi环境下显示是正常,初步排查是因为数据在进行gzip压缩之前的编码不同,在某些省份是GBK,有些是UFT8,在解码后可能与预定的GBK编码不符,出现乱码。因此,需要对网络流进行编码探测,根据探测结果选择编码。


一种简单的方式是通过HttpEntity的ContentType分析字符编码:

    public static String getContentCharSet(final HttpEntity entity)        throws ParseException {        if (entity == null) {            throw new IllegalArgumentException("HTTP entity may not be null");        }        String charset = null;        if (entity.getContentType() != null) {             HeaderElement values[] = entity.getContentType().getElements();            if (values.length > 0) {                NameValuePair param = values[0].getParameterByName("charset");                if (param != null) {                    charset = param.getValue();                }            }        }        return charset;    }

测试的确是有效的,但是可能给人的感觉却似乎总是不放心的,比如万一HTTP Header里缺少ContentType,那如何判断字符编码?


网上有大量的判断字符编码的博客了,但多数是对于文件的编码判断,对于网络流的判断是失效的,在此推荐一个开源组件cpdetector(总共494KB),可以检测文件和字节流编码。


下面是EncodingDetector工具类代码。cpdetector是基于统计学的,统计的字节数越多,准确性越高。对于文件流,字节数是已知的,探测的字节数是文件长度-1,但不超过2000。

public class EncodingDetector {    private static final CodepageDetectorProxy detector = CodepageDetectorProxy .getInstance();    static {            detector.add(new ParsingDetector(false));        detector.add(JChardetFacade.getInstance());        detector.add(ASCIIDetector.getInstance());        detector.add(UnicodeDetector.getInstance());    }        public static String getCharset(InputStream is, Boolean useAvailable) {        Charset charset = null;        int detectCharNum = 2000; //检测的字节数越多越准确, 字节数的指定不能超过文本流的最大长度        try {        if(useAvailable) {            int available = is.available();            if(available <= 1) { //有的输入流可能没有能力返回字节数(比如网络流,并不能准确知道还有多少数据未到达)            return HTTP.UTF_8;            }            if(detectCharNum > available) {            detectCharNum = available - 1;            }        }        BufferedInputStream bufferedInputStream = new BufferedInputStream(is);            charset = detector.detectCodepage(bufferedInputStream, detectCharNum);            bufferedInputStream.reset();                    } catch (Exception e) {        }        return null != charset ? charset.name() : null;    }public static String getCharset(ByteArrayOutputStream bos) {String charset = null;try {ByteArrayInputStream is = new ByteArrayInputStream(bos.toByteArray());charset = getCharset(is, true); //bos字节数是已知的is.close();} catch (IOException e) {}return charset;}}


对于网络流,因为不能准确知道还有多少数据没有到达,应该先读取并缓存字节流,然后探测缓存的编码。

HttpGet httpGet = null;HttpResponse httpResponse;InputStream is = null;BufferedReader in = null;ByteArrayOutputStream bos = null;try {httpGet = new HttpGet(url);//httpGet.addHeader();httpResponse = httpClient.execute(httpGet);int statusCode = httpResponse.getStatusLine().getStatusCode();HttpEntity httpEntity = httpResponse.getEntity();String json = "";if(httpEntity != null){is= httpEntity.getContent();Header val = httpEntity.getContentEncoding();if (val != null && val.getValue()!= null && val.getValue().contains("gzip")) {        is= new GZIPInputStream(is);}else{BufferedInputStream bis = new BufferedInputStream(is);bis.mark(2);// 取前两个字节byte[] header = new byte[2];int result = bis.read(header);// reset输入流到开始位置bis.reset();// 判断是否是GZIP格式if(result!=-1 && Utils.toInt(header, 0)== GZip_Value) {    is= new GZIPInputStream(bis);} else {    is= bis;}}if(encoding != null) {//解决部分省份出现乱码的问题Boolean mustUseDefault = false;if(needDetectEncoding) {/*String chartsetFromHttpEntity = EntityUtils.getContentCharSet(httpEntity);if(!TextUtils.isEmpty(chartsetFromHttpEntity)) {chartsetFromHttpEntity = chartsetFromHttpEntity.toUpperCase();mustUseDefault = chartsetFromHttpEntity.contains("UTF");}*/bos = new ByteArrayOutputStream();byte[] buff = new byte[100]; //buff用于存放循环读取的临时数据 int rc = 0; while ((rc = is.read(buff, 0, 100)) > 0) { bos.write(buff, 0, rc); }String chartsetFromInputStream = EncodingDetector.getCharset(bos);if(!TextUtils.isEmpty(chartsetFromInputStream)) {chartsetFromInputStream = chartsetFromInputStream.toUpperCase();mustUseDefault = chartsetFromInputStream.contains("UTF");}//android.util.Log.e("httpGetWithZip", chartsetFromHttpEntity + chartsetFromInputStream);is.close();is = new ByteArrayInputStream(bos.toByteArray());}if(mustUseDefault) {in = new BufferedReader(new InputStreamReader(is));} else {in = new BufferedReader(new InputStreamReader(is, “GBK”));}} else{in = new BufferedReader(new InputStreamReader(is));}String line = "";while ((line = in.readLine()) != null) {json += line;}}if(statusCode != HttpStatus.SC_OK){}} catch (ClientProtocolException e) {if(httpGet != null){httpGet.abort();}} catch(IllegalArgumentException e){if(httpGet != null){httpGet.abort();}}catch(OutOfMemoryError e){if(httpGet != null){httpGet.abort();}}catch (IOException e) {rspInfo.setStatusCode(NetError);if(httpGet != null){httpGet.abort();}}finally{if(is != null){try {is.close();} catch (IOException e) {}}if(in != null){try {in.close();} catch (IOException e) {}}if(bos != null) {try {bos.close();} catch (IOException e) {}}}


要正确使用detector.add(JChardetFacade.getInstance());,将cpdetector_1.0.10.jar放到\libs\目录下,并且antlr-2.7.4.jar、chardet-1.0.jar、jargs-1.0.jar也放到\libs\目录下。


0 0
原创粉丝点击