
来源:互联网 发布:宜家海沃格床垫知乎 编辑:程序博客网 时间:2024/05/21 10:31


最近在用httpClient做网络爬虫的时候,遇到了一个不大不小的问题,当使用HttpGet向指定网址发送请求后,接收到的Response无法正常解析,出现 口口??这样的乱码,编码也考虑到了中文编码,具体代码如下:

//处理逻辑HttpResponse response = HttpUtils.doGet(baseUrl + title + postUrl, headers);InputStream is = getInputStreamFromResponse(response);responseText = Utils.getStringFromInputStream(in);result = EncodeUtils.unicdoeToGB2312(responseText);//上面使用到的函数    public static HttpResponse doGet(String url, Map<String, String> headers) {        HttpClient client = createHttpClient();        HttpGet getMethod = new HttpGet(url);        HttpResponse response = null;        response = client.execute(getMethod);        return response;    }public static String getStringFromStream(InputStream in) {        StringBuilder buffer = new StringBuilder();        BufferedReader reader = null;        reader = new BufferedReader(new InputStreamReader(in, "UTF-8"));        String line = null;        while ((line = reader.readLine()) != null) {            buffer.append(line + "\n");        }        reader.close();        return buffer.toString();    }






而上面的处理逻辑则没有考虑到Response的inputStream是经过压缩的,需要使用对应的数据流对象处理,图中使用的content-encoding是gzip格式,则需要使用GZIPInputStream对其进行处理,只需要对上文中的函数public static String getStringFromStream(InputStream in)改进即可,如下所示:

public static String getStringFromResponse(HttpResponse response) {        if (response == null) {            return null;        }        String responseText = "";        InputStream in = getInputStreamFromResponse(response);        Header[] headers = response.getHeaders("Content-Encoding");        for(Header h : headers){            if(h.getValue().indexOf("gzip") > -1){                //For GZip response                try{                    GZIPInputStream gzin = new GZIPInputStream(is);                    InputStreamReader isr = new InputStreamReader(gzin,"utf-8");                    responseText = Utils.getStringFromInputStreamReader(isr);                }catch (IOException exception){                    exception.printStackTrace();                }                break;            }        }        responseText = Utils.getStringFromStream(in);        return responseText;    }





RFC 2616 for HTTP 1.1 specifies how web servers must indicate encoding transformations using the Content-Encoding header. Although on the surface, Content-Encoding (e.g., gzip, deflate, compress) and Content-Type(e.g., x-application/x-gzip) sound similar, they are, in fact, two distinct pieces of information. Whereas servers use Content-Type to specify the data type of the entity body, which can be useful for client applications that want to open the content with the appropriate application, Content-Encoding is used solely to specify any additional encoding done by the server before the content was transmitted to the client. Although the HTTP RFC outlines these rules pretty clearly, some web sites respond with “gzip” as the Content-Encoding even though the server has not gzipped the content.
Our testing has shown this problem to be limited to some sites that serve Unix/Linux style “tarball” files. Tarballs are gzip compressed archives files. By setting the Content-Encoding header to “gzip” on a tarball, the server is specifying that it has additionally gzipped the gzipped file. This, of course, is unlikely but not impossible or non-compliant.
Therein lies the problem. A server responding with content-encoding, such as “gzip,” is specifying the necessary mechanism that the client needs in order to decompress the content. If the server did not actually encode the content as specified, then the client’s decompression would fail.


HTTP 1.1 协议官方文档

1 0
热门问题 老师的惩罚 人脸识别 我在镇武司摸鱼那些年 重生之率土为王 我在大康的咸鱼生活 盘龙之生命进化 天生仙种 凡人之先天五行 春回大明朝 姑娘不必设防,我是瞎子 本科自考准考证丢了怎么办 大学团员证丢了怎么办 大学开学团员证丢了怎么办 研究生开学没有团员证怎么办 研究生开学已经不是团员了怎么办 毕业了要搬宿舍怎么办 中专学历认证已停止怎么办 中专不做学历认证考试怎么办 大学生欠学费被扣毕业证怎么办 考警校体检没过怎么办 美国签证申请预约名字写错怎么办 当兵不从学校走怎么办 门牙崩了一小块怎么办 遇到很难过的事情怎么办 小孩子上课精力不集中怎么办 每天工作都很累压力大怎么办 重体力活搬不动怎么办 大学没参加体测怎么办 英文写的很丑怎么办 患有勃起障碍应该怎么办较好 运动过度小腿肌肉酸痛怎么办 高考有纹身是字怎么办 新生儿测听力没过关怎么办 色弱高考体检时没查出来怎么办 公司福利体检查二对半怎么办 高考体检表复印件丢了怎么办 高考体检表身高填错了怎么办 大学档案高考体检表丢了怎么办 工厂组织体检我有乙肝怎么办 我有乙肝单位组织体检怎么办? 矮腰袜子老掉怎么办 短腰袜子老下滑怎么办 中考体检结果丢了怎么办 咳嗽左胸围一处刺痛怎么办? 阴茎小父母催婚怎么办 头发扎进指甲缝怎么办 指甲缝扎流血了怎么办 中考考差了高中怎么办 骨折后我抽烟了怎么办 五年级科学考不好怎么办 考试连续考差了怎么办