爬虫爬取页面过程中HttpClient导致的进程阻塞问题

来源：互联网发布：淘宝子账号无锡认证编辑：程序博客网时间：2024/06/06 02:58

爬虫爬取页面过程中HttpClient导致进程阻塞问题

目前在做爬虫项目，爬取多个书籍网站的书籍详情页面，遇到一个很恶心的问题，别的网站都能在短时间内完成爬取，唯独网站A的线程卡死，永远随机的阻塞在某个页面。定位到错误点在下载函数，这是初始的下载函数：

public String staticDownload(String urlstr, String encoding,String param) throws Exception{StringBuffer buffer = new StringBuffer();URL url = null;PrintWriter out = null;BufferedReader in = null;try {  url = new URL(urlstr);  URLConnection connection = url.openConnection();  ((HttpURLConnection) connection).setRequestMethod("POST");  connection.setDoOutput(true);  connection.setDoInput(true);  connection.setConnectTimeout(5000);  connection.setReadTimeout(5000);  connection.setRequestProperty("accept", "*/*");  connection.setRequestProperty("connection", "Keep-Alive");  connection.setRequestProperty("User-Agent", "Mozilla/5.0 "      + "(Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) "      + "Gecko/20080404 Firefox/2.0.0.14");  out = new PrintWriter(connection.getOutputStream());  // 发送请求参数  out.print(param);  // flush输出流的缓冲  out.flush();  in = new BufferedReader(new     InputStreamReader(connection.getInputStream(), encoding));  String line;  while ((line = in.readLine()) != null) {    buffer.append(line);    buffer.append("\r\n");  }}    catch (Exception e) {  // TODO: handle exception    }    finally{       try{              if(out!=null){                  out.close();              }              if(in!=null){                  in.close();              }          }          catch(IOException ex){              ex.printStackTrace();          }    }return buffer.toString();}

这里延伸一下，页面下载方式有很多种，如果是爬虫，最好是模拟浏览器行为下载页面，使用WebClient方法，但对于需要人行为参与的页面，比如网站的搜索页面，需要填入搜索项进而获得爬取的内容，我们知道向指定网站发出请求的方式有两种：get和post方式。

基于HTTP 协议来访问网络资源的URLconnection 和HttpClient均可以实现上述请求，贴上两者区别的地址，具体不做分析。显然我们这里用的是前者。

通过查找资料知道readline()是一个阻塞函数，当没有数据读取时就会一直卡在那里:

1、只有当数据流出现异常或者网站服务端主动close()掉时才会返回null值，
2、如果不指定buffer的大小，则readLine()使用的buffer有8192个字符。在达到buffer大小之前，只有遇到“/r”、”/n”、”/r/n”才会返回。

我们不知道所爬取的网站服务端返回的是否有内容，为空数据也会阻塞，如果有内容每一行内容到底有没有包含以上三个特殊字符，如果不包含，则会进入阻塞，也就说while循环无法跳出，真正的问题找到了，那么只能换掉readline()了，资料也建议socket流最好避免使用readline()函数。

既然URLConnection不行那就换成HttpClient吧，后者比前者更为强大，也不需要readline()函数，反正病急乱投医喽，我们的问题出现在以post方式获得页面的函数上，param为传入的值，再次运行爬虫问题定位到:

public String staticDownloadByHttpClient(String urlstr, String encoding, boolean bFrame, String param) throws IOException {String bufferStr= null;// 创建默认的httpClient实例.CloseableHttpClient httpclient = HttpClients.createDefault();// 创建httppostHttpPost httppost = new HttpPost(urlstr);// 创建参数队列List<NameValuePair> formparams = new ArrayList<NameValuePair>();String name = param.split("=")[0];String value = param.split("=")[1];formparams.add(new BasicNameValuePair(name, value));UrlEncodedFormEntity uefEntity;try {  uefEntity = new UrlEncodedFormEntity(formparams, "UTF-8");  httppost.setEntity(uefEntity);  CloseableHttpResponse response = httpclient.execute(httppost);  if (response == null) {    httpclient.close();    return  bufferStr;  }  try {    HttpEntity entity = response.getEntity();    if (entity != null) {      InputStream is = entity.getContent();      InputStreamReader in = new InputStreamReader(is, encoding);      int ch = 0;      //貌似这条if语句没啥用，当时主要怕网站返回数据为空      if((ch = in.read())!=-1){        //问题出现下面这条语句上        bufferStr = EntityUtils.toString(entity, encoding);      }      else{        try {          Thread.sleep(sleepTime);        } catch (InterruptedException e) {            e.printStackTrace()        }      }    }    try {      EntityUtils.consume(entity);    } catch (final IOException ignore) {    }  } finally {    response.close();  }} catch (ClientProtocolException e) {    e.printStackTrace()} catch (UnsupportedEncodingException e) {    e.printStackTrace()} catch (IOException e) {    e.printStackTrace()} finally {  // 关闭连接,释放资源  try {    httpclient.close();  } catch (IOException e) {      e.printStackTrace()  }}return  bufferStr;}

无奈只能去查看toSting函数源代码，该代码我有微小改动，基本是这样的:

private String toString(final HttpEntity entity, final Charset defaultCharset) throws IOException, ParseException {      Args.notNull(entity, "Entity");      final InputStream instream = entity.getContent();      if (instream == null) {          return null;      }      try {          Args.check(entity.getContentLength() <= Integer.MAX_VALUE,                  "HTTP entity too large to be buffered in memory");          int i = (int)entity.getContentLength();          if (i < 0) {              i = 4096;          }          Charset charset = null;          try {              final ContentType contentType = ContentType.get(entity);              if (contentType != null) {                  charset = contentType.getCharset();              }          } catch (final UnsupportedCharsetException ex) {              throw new UnsupportedEncodingException(ex.getMessage());          }          if (charset == null) {              charset = defaultCharset;          }          if (charset == null) {              charset = HTTP.DEF_CONTENT_CHARSET;          }          final Reader reader = new InputStreamReader(instream, charset);          final CharArrayBuffer buffer = new CharArrayBuffer(i);          final char[] tmp = new char[1024];          int l;          long dis = System.currentTimeMillis();          //问题依旧在这里          while(reader.ready() && (l = reader.read(tmp)) != -1 ) {              buffer.append(tmp, 0, l);              long now = System.currentTimeMillis();              if(now-dis > 5*60*1000){                logUtil.getLogger().error(String.format("MSG: the content that site return is too large to be buffered in memory, 超时： %s ms", now-dis));                break;              }          }          return buffer.toString();      } finally {          instream.close();      }  }  private String toString(final HttpEntity entity, final String defaultCharset) throws IOException, ParseException {    return toString(entity, defaultCharset != null ?        Charset.forName(defaultCharset) : null);}

继续定位问题，呵呵，依旧是while死循环问题，这里显然是同样的一个字符一个字符读入的，不存在readline函数问题，绝望之下百度了“HttpClient post 超时处理“，看到了此大神的很短的一篇日志，其中一句话是：
BTW,4.3版本不设置超时的话，一旦服务器没有响应，等待时间N久(>24小时)。

又看了看我的HttpClient jar包版本，墙裂感觉问题要被解决了，于是立刻加上超时设置：

RequestConfig requestConfig = RequestConfig.custom().setSocketTimeout(6000).setConnectTimeout(6000).build();//设置请求和传输超时时间httppost.setConfig(requestConfig);

目前已测试8+遍，都没有出现线程再卡死的情况，后来想想，对于这种涉及到socket编程不都应该加上超时处理么，来，跟我读一遍以下文字：

我们知道Socket在读数据的时候是阻塞式的，如果没有读到数据程序会一直阻塞在那里。在同步请求的时候我们肯定是不能允许这样的情况发生的，这就需要我们在请求达到一定的时间后控制阻塞的中断，让程序得以继续运行。Socket为我们提供了一个setSoTimeout()方法来设置接收数据的超时时间，单位是毫秒。当设置的超时时间大于0，并且超过了这一时间Socket还没有接收到返回的数据的话，Socket就会抛出一个SocketTimeoutException

参考：
http://blog.csdn.net/hguang_zjh/article/details/33743249
http://blog.csdn.net/wuhong_csdn/article/details/50830349
http://witcheryne.iteye.com/blog/1135817
http://www.yiibai.com/java/io/bufferedreader_ready.html
https://zhidao.baidu.com/question/330258186.html
https://my.oschina.net/u/577453/blog/173724
http://elim.iteye.com/blog/1979837

0 0