一次数据库hang,大量enqueue等待事件的问题

来源:互联网 发布:邪恶镰刀淘宝 编辑:程序博客网 时间:2024/06/08 12:13
现象:数据库hang,大量enqueue等待事件,大约20分钟后数据库数据库自己恢复
平台 oracle 9.2.0.7 + hpux 11.31


收到告警:数据库断连和非空闲等待事件2948个,经查询bomc监控日志,2948个等待事件具体如下,因此原因基本可以确定为大量enqueue(锁等待)等待导致数据库异常,从p1=1397817350可以推断出lock mode为SQ,及sequence相关。Enqueue的进一步原因和相关应用sql还需要进一步确认。

       SID    SERIAL# OSUSER                          USERNAME                          SVRPROC PROCESS      EVENT                                                                         P1         P2              P3
---------- ---------- ------------------------------ ------------------------------ ------------ ------------ ---------------------------------------------------------------- ---------- ---------- ----------
         25     12810                                                                            7832                          enqueue                                                                    1397817350     149               0
         63     40789                                                                            8166                         enqueue                                                                      1397817350     149               0
         59     32332                                                                            20857                      enqueue                                                                      1397817350     149               0
         43     60179                                                                            18928                      enqueue                                                                      1397817350     149               0
       399     29721                                                                            19165                      enqueue   
.。。。。。

经查询,enqueue类型为SQ
SELECT chr(to_char(bitand(1397817350,-16777216))/16777215)||
  2                   chr(to_char(bitand(1397817350, 16711680))/65535) "Lock",
  3                   to_char( bitand(1397817350, 65535) )    "Mode" from dual;

Lock   Mode
-- -
SQ   6

SQL> select OWNER,OBJECT_NAME,OBJECT_TYPE from dba_objects where object_id=149;

OWNER                          OBJECT_NAME          OBJECT_TYPE
------------------------------ -------------------- ------------------
SYS                            AUDSES$              SEQUENCE

通过收集故障时段的statspack报告,得出执行速度比较慢的语句,特别是含有squence的部分主要如下。但这些语句理论上该seq为循环模式,不应该导致enqueue等待。很可能由于连接数超过最大数,导致部分业务不能正常连接。p2=149看,应该是AUDSES$ 导致enqueue,而AUDSES$正是登录使用的sequeue,对应v$session中的AUDSID列

SQL> SELECT text FROM ALL_SOURCE where TYPE='FUNCTION' AND NAME='DSM_GET_MG_ID';

TEXT
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
function       DSM_GET_MG_ID return varchar2 is
  V_Result varchar2(32);
  V_SEQ Varchar2(20);
begin
  select dsm_seq_mgid.nextval into v_seq from dual;

  V_Result :=to_char(sysdate,'YYYYMMDDHH24miSS')||v_seq;
  return(V_Result);
end DSM_GET_MG_ID;


10 rows selected.


SQL> select * from dba_sequences where SEQUENCE_NAME = 'DSM_SEQ_MSGID';

SEQUENCE_OWNER                 SEQUENCE_NAME                   MIN_VALUE  MAX_VALUE INCREMENT_BY C O CACHE_SIZE LAST_NUMBER
------------------------------ ------------------------------ ---------- ---------- ------------ - - ---------- -----------
CBOSS                          DSM_SEQ_MGID                      100000     999999            1 Y Y         10      489050
    
    
    继续分析,感觉sequeue不是最终的原因,查看系统日志,看出了别的问题
    经过继续分析,昨日问题现在看根本原因应该是日志文件不能正常切换导致数据库hang,进而导致各种等待事件。已1_59448.dbf文件为例,17:01就开始归档,17:39:27才最终完成,正常情况下应该1分钟内完成,由于未知原因导致花费40分钟才完成一个日志文件的归档动作。我看了下相关文件系统,未出现归档表空间满的情况,故障时间点利用率只有15%。



Mon Mar 26 17:01:41 2012
ARC1: Evaluating archive   log 3 thread 1 sequence 59448
ARC1: Unable to archive log 3 thread 1 sequence 59448
      Log actively being archived by another process
Mon Mar 26 17:01:46 2012
Beginning log switch checkpoint up to RBA [0xe83a.2.10], SCN: 0x0b09.08dc9044
Thread 1 advanced to log sequence 59450
  Current log# 2 seq# 59450 mem# 0: /dev/vgcboss02/rredo02
Mon Mar 26 17:01:46 2012
ARC1: Evaluating archive   log 3 thread 1 sequence 59448
ARC1: Unable to archive log 3 thread 1 sequence 59448
      Log actively being archived by another process
ARC1: Evaluating archive   log 1 thread 1 sequence 59449
ARC1: Beginning to archive log 1 thread 1 sequence 59449
Creating archive destination LOG_ARCHIVE_DEST_1: '/cboss/arch/1_59449.dbf'
Mon Mar 26 17:04:49 2012
Completed checkpoint up to RBA [0xe83a.2.10], SCN: 0x0b09.08dc9044
Mon Mar 26 17:14:22 2012
ORACLE Instance cbossdb - Can not allocate log, archival required
Mon Mar 26 17:14:22 2012
ARCH: Connecting to console port...
Thread 1 cannot allocate new log, sequence 59451
All online logs needed archiving
  Current log# 2 seq# 59450 mem# 0: /dev/vgcboss02/rredo02
Mon Mar 26 17:39:25 2012
ARC1: Completed archiving  log 1 thread 1 sequence 59449
Mon Mar 26 17:39:25 2012
Beginning log switch checkpoint up to RBA [0xe83b.2.10], SCN: 0x0b09.09609366


NAME                                                                             COMPLETION_TIME      S
-------------------------------------------------------------------------------- -------------------- -
/cboss/arch/1_59436.dbf                                                          2012/03/26  14:22:34 A
/cboss/arch/1_59437.dbf                                                          2012/03/26  14:30:44 A
/cboss/arch/1_59438.dbf                                                          2012/03/26  14:45:22 A
/cboss/arch/1_59439.dbf                                                          2012/03/26  14:55:34 A
/cboss/arch/1_59440.dbf                                                          2012/03/26  15:05:23 A
/cboss/arch/1_59441.dbf                                                          2012/03/26  15:16:23 A
/cboss/arch/1_59442.dbf                                                          2012/03/26  15:26:59 A
/cboss/arch/1_59443.dbf                                                          2012/03/26  15:42:02 A
/cboss/arch/1_59444.dbf                                                          2012/03/26  15:57:22 A
/cboss/arch/1_59445.dbf                                                          2012/03/26  16:12:52 A
/cboss/arch/1_59446.dbf                                                          2012/03/26  16:27:01 A

NAME                                                                             COMPLETION_TIME      S
-------------------------------------------------------------------------------- -------------------- -
/cboss/arch/1_59447.dbf                                                          2012/03/26  16:40:53 A
/cboss/arch/1_59449.dbf                                                          2012/03/26  17:39:25 A
/cboss/arch/1_59448.dbf                                                          2012/03/26  17:39:27 A
/cboss/arch/1_59450.dbf                                                          2012/03/26  17:40:32 A
/cboss/arch/1_59451.dbf                                                          2012/03/26  17:51:06 A
/cboss/arch/1_59452.dbf                                                          2012/03/26  18:06:51 A
/cboss/arch/1_59453.dbf                                                          2012/03/26  18:33:36 A
/cboss/arch/1_59454.dbf                                                          2012/03/26  19:12:46 A
/cboss/arch/1_59455.dbf                                                          2012/03/26  19:54:10 A
/cboss/arch/1_59456.dbf                                                          2012/03/26  20:33:57 A

总结分析:最终归档慢的原因没有找到,但原因却不是当初表现的seq问题,所以分析问题还需要全面进行系统查看。主机、数据库的日志是必须要分析的