java里面httpclient如何识别网页编码

来源:互联网 发布:元数据与大数据 编辑:程序博客网 时间:2024/06/03 14:51

之前读取网页的时候经常出现乱码,后来查找了一些资料,终于搞明白了网页编码的奥秘。今天在这里和大家分享一下httpclient自动识别网页编码的方法。


首先,先了解一下浏览器识别编码的方法(来源:http://every-best.iteye.com/blog/970861)。

浏览器识别编码有3种方式: 

1、HTTP头的Content-Type 
2、meta标签(有2类meta标签可以设置编码) 
3、BOM 

当前面这3种都不存在时,浏览器默认使用US-ASCII编码,而不是UTF-8,至少标准是这么解释的,当然浏览器可能不以标准为参考: 
引用
If the document does not start with a U+FEFF BYTE ORDER MARK (BOM) character, and if its encoding is not explicitly given by a Content-Type HTTP header, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element in the encoding declaration state.


当使用编码时,必须时刻注意这个编码必须是在INAA中有注册的,比如UTF-8就必须写UTF-8,虽然部分浏览器识别UTF8(少一个“-”),但在另外一部分浏览器下就可能出错 
另外INAA明确指出了,编码名称大小写不敏感 
引用
However, no distinction is made between use of upper and lower case letters.


另外当使用meta标签指定编码,即HTTP头和BOM都不存在时,编码必须是一个ASCII编码的超集,也就楼主说的有可能会找不到meta标签的情况 

另外标准不允许使用以下的编码,但问题是浏览器竟然能解析UTF-7等编码,导致一些安全问题: 
UTF-32, JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB, ISO-2022系列, EBCDIC系列, CESU-8, UTF-7, BOCU-1, SCSU 

最后,如果是XML型(如XHTML),需要用XML声明来指定编码,如<?xml version="1.0" encoding="UTF-8"?>  


搞清楚原理以后,编码起来就简单了。

Java代码  收藏代码

  1. public  String readPage(String url){String html=null;if(StringUtils.isBlank(url))return null;if(!(url.startsWith("http://") ||url.startsWith("https://"))){url="http://"+url;}HttpClient client = new DefaultHttpClient();setParameters(client);HttpResponse response = null;HttpContext httpContext = new BasicHttpContext();HttpGet get = new HttpGet(url);get.addHeader("Accept", "text/html");get.addHeader("Accept-Charset", "gb2312,utf-8");get.addHeader("Accept-Encoding", "gzip");get.addHeader("Accept-Language", "zh-cn,zh,en-US,en");get.addHeader("User-Agent", util.UserAgent.getUserAgent());HttpEntity entity = null;try {response = client.execute(get,httpContext);entity = response.getEntity();Header header = entity.getContentEncoding();if (header != null){HeaderElement[] codecs = header.getElements();for (int i = 0; i < codecs.length; i++){if (codecs[i].getName().equalsIgnoreCase("gzip")){response.setEntity(new GzipDecompressingEntity(entity));}}}entity = response.getEntity();HttpHost targetHost = (HttpHost)httpContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST);HttpUriRequest realRequest = (HttpUriRequest)httpContext.getAttribute(ExecutionContext.HTTP_REQUEST);realUrl = ExtractorUtil.connectUrl(targetHost.toString(),realRequest.toString());byte[] bytes= EntityUtils.toByteArray(entity);String charset = EntityUtils.getContentCharSet(entity);if(StringUtils.isBlank(charset)){charset = FileUtil.getHtmlCharset(bytes);}/*html = new String(bytes);String charset = FileUtil.getHtmlCharset(html);*/html = new String(bytes ,charset);if(charset.equalsIgnoreCase("BIG5")){html = ZHConverter.convert(html, ZHConverter.SIMPLIFIED);}EntityUtils.consume(entity);} catch(Exception e){logger.debug(e.getMessage()+url);e.printStackTrace();}finally{}return html;}

/**     * The byte-order mark (BOM) in HTML     * @param bytes     * @return     */    public static String getEncode(byte[] bytes){    String code = null;    if(bytes==null || bytes.length<2){    return code;    }        int p = ((int)bytes[0]&0x00ff) <<8|((int)bytes[1]&0x00ff);    switch (p) {    case 0xefbb:    code = "UTF-8";    break;    case 0xfffe:    code = "Unicode";    break;    case 0xfeff:    code = "UTF-16BE";    break;    default:    code = "GBK";    }    return code;        }

/**     * 返回网页的编码     * 1.检查HTML meta标签是否含有charset信息     * 2.使用BOM     * @param bytes     * @return     */    public static String getHtmlCharset(byte[] bytes){    String content = new String(bytes);    String charset=null;    Pattern pattern = Pattern.compile("<[mM][eE][tT][aA][^>]*([cC][Hh][Aa][Rr][Ss][Ee][Tt][\\s]*=[\\s\\\"']*)([\\w\\d-_]*)[^>]*>");Matcher matcher = pattern.matcher(content);if(matcher.find()){charset = matcher.group(2);}else{charset = getEncode(bytes);}    return charset;    }


0 0
原创粉丝点击