crawler httpclient 爬 sohu 搜狐部分页面数据

来源：互联网发布：php sig dfl 编辑：程序博客网时间：2024/06/06 09:52

因为sohu部分页面内容返回格式为 gzip，所以在得到返回结果的是要判断此内容，再进行对内容的解析。

下面是以sohu主页的数据

HttpClient client = new HttpClient();GetMethod method = new GetMethod("http://www.sohu.com");try{int statusCode = client.executeMethod(method);if(method.getResponseHeader("Content-Encoding") != null && method.getResponseHeader("Content-Encoding").getValue() != null && method.getResponseHeader("Content-Encoding").getValue().toLowerCase().indexOf("gzip") != -1){ GZIPInputStream gin = new GZIPInputStream(method.getResponseBodyAsStream()); ByteArrayOutputStream os = new ByteArrayOutputStream(); byte[] bs = new byte[1024]; int len = -1; while((len = gin.read(bs)) != -1){ os.write(bs, 0, len); } System.out.println(os.toString("gb2312")); }else{ System.out.println(method.getResponseBodyAsString()); } }catch(Exception e){ e.printStackTrace(); }finally{ try{ method.releaseConnection(); }catch(Exception e){ e.printStackTrace(); } }

下面是访问sohu主页时服务器返回的响应及其头信息：

HTTP/1.1 200 OKContent-Type: text/htmlConnection: keep-aliveDate: Sat, 04 Dec 2010 15:22:41 GMTServer: ApacheVary: Accept-Encoding,X-Up-Calling-Line-id,X-Source-ID,X-Up-Bearer-TypeCache-Control: max-age=105Expires: Sat, 04 Dec 2010 15:24:26 GMTLast-Modified: Sat, 04 Dec 2010 15:01:20 GMTContent-Encoding: gzipContent-Length: 64524FSS-Cache: HIT from 10231717.19079087.10957868