java里面httpclient如何识别网页编码

来源：互联网发布：元数据与大数据编辑：程序博客网时间：2024/06/03 14:51

之前读取网页的时候经常出现乱码，后来查找了一些资料，终于搞明白了网页编码的奥秘。今天在这里和大家分享一下httpclient自动识别网页编码的方法。

首先，先了解一下浏览器识别编码的方法（来源：http://every-best.iteye.com/blog/970861）。

浏览器识别编码有3种方式：

1、HTTP头的Content-Type
2、meta标签（有2类meta标签可以设置编码）
3、BOM

当前面这3种都不存在时，浏览器默认使用US-ASCII编码，而不是UTF-8，至少标准是这么解释的，当然浏览器可能不以标准为参考：

引用

If the document does not start with a U+FEFF BYTE ORDER MARK (BOM) character, and if its encoding is not explicitly given by a Content-Type HTTP header, then the character encoding used must be an ASCII-compatible character encoding, and, in addition, if that encoding isn't US-ASCII itself, then the encoding must be specified using a meta element with a charset attribute or a meta element in the encoding declaration state.

当使用编码时，必须时刻注意这个编码必须是在INAA中有注册的，比如UTF-8就必须写UTF-8，虽然部分浏览器识别UTF8（少一个“-”），但在另外一部分浏览器下就可能出错
另外INAA明确指出了，编码名称大小写不敏感

引用

However, no distinction is made between use of upper and lower case letters.

另外当使用meta标签指定编码，即HTTP头和BOM都不存在时，编码必须是一个ASCII编码的超集，也就楼主说的有可能会找不到meta标签的情况

另外标准不允许使用以下的编码，但问题是浏览器竟然能解析UTF-7等编码，导致一些安全问题：
UTF-32, JIS_C6226-1983, JIS_X0212-1990, HZ-GB-2312, JOHAB, ISO-2022系列, EBCDIC系列, CESU-8, UTF-7, BOCU-1, SCSU

最后，如果是XML型（如XHTML），需要用XML声明来指定编码，如<?xml version="1.0" encoding="UTF-8"?>

搞清楚原理以后，编码起来就简单了。

Java代码  

public  String readPage(String url){String html=null;if(StringUtils.isBlank(url))return null;if(!(url.startsWith("http://") ||url.startsWith("https://"))){url="http://"+url;}HttpClient client = new DefaultHttpClient();setParameters(client);HttpResponse response = null;HttpContext httpContext = new BasicHttpContext();HttpGet get = new HttpGet(url);get.addHeader("Accept", "text/html");get.addHeader("Accept-Charset", "gb2312,utf-8");get.addHeader("Accept-Encoding", "gzip");get.addHeader("Accept-Language", "zh-cn,zh,en-US,en");get.addHeader("User-Agent", util.UserAgent.getUserAgent());HttpEntity entity = null;try {response = client.execute(get,httpContext);entity = response.getEntity();Header header = entity.getContentEncoding();if (header != null){HeaderElement[] codecs = header.getElements();for (int i = 0; i < codecs.length; i++){if (codecs[i].getName().equalsIgnoreCase("gzip")){response.setEntity(new GzipDecompressingEntity(entity));}}}entity = response.getEntity();HttpHost targetHost = (HttpHost)httpContext.getAttribute(ExecutionContext.HTTP_TARGET_HOST);HttpUriRequest realRequest = (HttpUriRequest)httpContext.getAttribute(ExecutionContext.HTTP_REQUEST);realUrl = ExtractorUtil.connectUrl(targetHost.toString(),realRequest.toString());byte[] bytes= EntityUtils.toByteArray(entity);String charset = EntityUtils.getContentCharSet(entity);if(StringUtils.isBlank(charset)){charset = FileUtil.getHtmlCharset(bytes);}/*html = new String(bytes);String charset = FileUtil.getHtmlCharset(html);*/html = new String(bytes ,charset);if(charset.equalsIgnoreCase("BIG5")){html = ZHConverter.convert(html, ZHConverter.SIMPLIFIED);}EntityUtils.consume(entity);} catch(Exception e){logger.debug(e.getMessage()+url);e.printStackTrace();}finally{}return html;}
/**     * The byte-order mark (BOM) in HTML     * @param bytes     * @return     */    public static String getEncode(byte[] bytes){    String code = null;    if(bytes==null || bytes.length<2){    return code;    }        int p = ((int)bytes[0]&0x00ff) <<8|((int)bytes[1]&0x00ff);    switch (p) {    case 0xefbb:    code = "UTF-8";    break;    case 0xfffe:    code = "Unicode";    break;    case 0xfeff:    code = "UTF-16BE";    break;    default:    code = "GBK";    }    return code;        }
/**     * 返回网页的编码     * 1.检查HTML meta标签是否含有charset信息     * 2.使用BOM     * @param bytes     * @return     */    public static String getHtmlCharset(byte[] bytes){    String content = new String(bytes);    String charset=null;    Pattern pattern = Pattern.compile("<[mM][eE][tT][aA][^>]*([cC][Hh][Aa][Rr][Ss][Ee][Tt][\\s]*=[\\s\\\"']*)([\\w\\d-_]*)[^>]*>");Matcher matcher = pattern.matcher(content);if(matcher.find()){charset = matcher.group(2);}else{charset = getEncode(bytes);}    return charset;    }

0 0