Redis异常调查

来源:互联网 发布:牛轧糖 知乎 编辑:程序博客网 时间:2024/05/16 03:14

问题

今天同事让我协助调查一个redis的问题。他给我的异常信息如下:

redis.clients.jedis.exceptions.JedisConnectionException: Could not get a resource from the pool

乍一看信息感觉是pool满了,获取不到新的连接导致发生错误。

原理

找原因之前让我们先了解一下jedis获取redis连接的思路,redis.clients.util.Pool.getResource会从JedisPool实例池中返回一个可用的redis连接,分析源码可知JedisPool extends redis.clients.util.Pool ,而Pool是通过commons-pool开源工具包中的org.apache.commons.pool.impl.GenericObjectPool来实现对Jedis实例的管理的,具体代码如下所示:

public abstract class Pool<T> {    protected GenericObjectPool<T> internalPool;    /**     * Using this constructor means you have to set and initialize the     * internalPool yourself.     */    public Pool() {    }    public Pool(final GenericObjectPoolConfig poolConfig,        PooledObjectFactory<T> factory) {    initPool(poolConfig, factory);    }    public void initPool(final GenericObjectPoolConfig poolConfig,        PooledObjectFactory<T> factory) {    if (this.internalPool != null) {        try {        closeInternalPool();        } catch (Exception e) {        }    }    this.internalPool = new GenericObjectPool<T>(factory, poolConfig);    }    public T getResource() {        try {            return internalPool.borrowObject();        } catch (Exception e) {            throw new JedisConnectionException("Could not get a resource from the pool", e);        }   }}

common-pool有三个重要的属性:

MaxActive: 可用连接实例的最大数目,为负值时没有限制。

MaxIdle: 空闲连接实例的最大数目,为负值时没有限制。Idle的实例在使用前,通常会通过org.apache.commons.pool.BasePoolableObjectFactory的activateObject()方法使其变得可用。

MaxWait: 等待可用连接的最大数目,单位毫秒(million seconds)。

(注:pool.getResource()方法实际调用的GenericObjectPool类borrowObject()方法,该方法会根据MaxWait变量值在没有可用连接(idle/active)时阻塞等待知道超时,具体含义参看api。) 也就是说当连接池中没有active/idle的连接时,会等待maxWait时间,如果等待超时还没有可用连接,则抛出Could not get a resource from the pool异常

调查

好了,开始调查问题所在,查看我们系统redis的上面3个配置。

redis.servers=10.20.101.148:6379redis.pool.maxTotal=100redis.pool.maxIdle=50redis.pool.minIdle=20redis.pool.testOnBorrow=true

maxTotal已经设置为100了,而且jedis的连接都是try resource方法的,用完都会归还到pool的,应该不会有连接泄露的问题。

那这个配置没问题的话就要看看redis的总连接数是不是满了。。

# Serverredis_version:2.8.8redis_git_sha1:00000000redis_git_dirty:0redis_build_id:bb0fd57e1222dc84redis_mode:standaloneos:Linux 2.6.32-358.el6.x86_64 x86_64arch_bits:64multiplexing_api:epollgcc_version:4.4.7process_id:2730run_id:e1a4cc088c68639ba4346fe23eac91dbc4fe0af4tcp_port:6379uptime_in_seconds:11208321uptime_in_days:129hz:10lru_clock:6056690config_file:/home/lot/local/redis/conf/redis.conf# Clientsconnected_clients:12client_longest_output_list:0client_biggest_input_buf:0blocked_clients:0

当前连接客户端数才12个,redis的默认maxClients是10000来的。这个也没问题。。。

常规方法解决不了,那现在只能看源码啦,查了下异常堆栈信息,还有如下日志:

Caused by: java.util.NoSuchElementException: Unable to validate object        at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:497)        at org.apache.commons.pool2.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:360)        at redis.clients.util.Pool.getResource(Pool.java:40)

查回源码:

1 public T borrowObject(long borrowMaxWaitMillis) throws Exception { 2         assertOpen(); 3         AbandonedConfig ac = this.abandonedConfig; 4         if (ac != null && ac.getRemoveAbandonedOnBorrow() && 5                 (getNumIdle() < 2) && 6                 (getNumActive() > getMaxTotal() - 3) ) { 7             removeAbandoned(ac); 8         } 9         PooledObject<T> p = null;10         // Get local copy of current config so it is consistent for entire11         // method execution12         boolean blockWhenExhausted = getBlockWhenExhausted();13         boolean create;14         long waitTime = 0;15         while (p == null) {16             create = false;17             if (blockWhenExhausted) {18                 p = idleObjects.pollFirst();19                 if (p == null) {20                     create = true;21                     p = create();22                 }23                 if (p == null) {24                     if (borrowMaxWaitMillis < 0) {25                         p = idleObjects.takeFirst();26                     } else {27                         waitTime = System.currentTimeMillis();28                         p = idleObjects.pollFirst(borrowMaxWaitMillis,29                                 TimeUnit.MILLISECONDS);30                         waitTime = System.currentTimeMillis() - waitTime;31                     }32                 }33                 if (p == null) {34                     throw new NoSuchElementException(35                             "Timeout waiting for idle object");36                 }37                 if (!p.allocate()) {38                     p = null;39                 }40             } else {41                 p = idleObjects.pollFirst();42                 if (p == null) {43                     create = true;44                     p = create();45                 }46                 if (p == null) {47                     throw new NoSuchElementException("Pool exhausted");48                 }49                 if (!p.allocate()) {50                     p = null;51                 }52             }53             if (p != null) {54                 try {55                     factory.activateObject(p);56                 } catch (Exception e) {57                     try {58                         destroy(p);59                     } catch (Exception e1) {60                         // Ignore - activation failure is more important61                     }62                     p = null;63                     if (create) {64                         NoSuchElementException nsee = new NoSuchElementException(65                                 "Unable to activate object");66                         nsee.initCause(e);67                         throw nsee;68                     }69                 }70                 if (p != null && getTestOnBorrow()) {71                     boolean validate = false;72                     Throwable validationThrowable = null;73                     try {74                         validate = factory.validateObject(p);75                     } catch (Throwable t) {76                         PoolUtils.checkRethrow(t);77                         validationThrowable = t;78                     }79                     if (!validate) {80                         try {81                             destroy(p);82                             destroyedByBorrowValidationCount.incrementAndGet();83                         } catch (Exception e) {84                             // Ignore - validation failure is more important85                         }86                         p = null;87                         if (create) {88                             NoSuchElementException nsee = new NoSuchElementException(89                                     "Unable to validate object");90                             nsee.initCause(validationThrowable);91                             throw nsee;92                         }93                     }94                 }95             }96         }97         updateStatsBorrow(p, waitTime);98         return p.getObject();99     }
报错在上面代码的88行(NoSuchElementException nsee = new NoSuchElementException("Unable to validate object"))(实际的源码是497行),分析源码是获取到pool之后validate不通过(第79行),查看validateObject方法。

public boolean validateObject(PooledObject<Jedis> pooledJedis) {    final BinaryJedis jedis = pooledJedis.getObject();    try {        return jedis.isConnected() && jedis.ping().equals("PONG");    } catch (final Exception e) {        return false;    }}

额,找到原因啦,跟服务器连接不通。(redis用ping和pong来确定心跳的)

上去客户端PING一下服务器:

127.0.0.1:6379> PING(error) MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
额,硬盘爆了,确认一下:
[lot@mob925 bin]$ df -hFilesystem Size Used Avail Use% Mounted on/dev/sda2 29G 3.1G 25G 12% /tmpfs 24G 0 24G 0% /dev/shm/dev/sda1 485M 37M 423M 8% /boot/dev/sda5 877G 832G 4.0K 100% /home

/home目录真爆了。。。。。。。。

清除多余文件之后redis恢复正常。。。。。

所获

  1. 调查一个问题之前先确认日志堆栈
  2. 要清楚一个问题的原理
  3. 源代码永远是你调查问题的一个方向








0 0