VKTM导致ASMB终止导致ORALCE实例宕

来源：互联网发布：eclipse发布java程序编辑：程序博客网时间：2024/05/08 02:51

昨天在查看数据库备份情况时，查看日志发现：

RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_24/thread_1_seq_595.499.883235121 thread=1 sequence=595
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_25/thread_1_seq_596.498.883270839 thread=1 sequence=596
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_25/thread_1_seq_597.497.883270839 thread=1 sequence=597
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_26/thread_1_seq_598.496.883357239 thread=1 sequence=598
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_26/thread_1_seq_599.495.883357239 thread=1 sequence=599
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_27/thread_1_seq_600.494.883443639 thread=1 sequence=600
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_27/thread_1_seq_601.493.883443639 thread=1 sequence=601
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_28/thread_1_seq_602.492.883530039 thread=1 sequence=602
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_28/thread_1_seq_603.491.883530039 thread=1 sequence=603
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_29/thread_1_seq_604.490.883616439 thread=1 sequence=604
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_29/thread_1_seq_605.489.883616439 thread=1 sequence=605
RMAN-08137: WARNING: archived log not deleted, needed for standby or upstream capture process
archived log file name=+DATA/iokbl/archivelog/2015_06_30/thread_1_seq_606.488.883702839 thread=1 sequence=606

数据库版本为11.2.0.4，linux环境，用的是DATA GUARD,通过日志可以看出，归档日志未传递到备库，日志不能删除，而主库一直正常运行，说明备库出了问题，于是用sqlplus / as sysdba登录备库，提示连接到一个空实例，说明备库宕了。

查看alert日志：

Wed Jun 24 01:00:01 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Wed Jun 24 15:05:41 2015
Warning: VKTM detected a time drift.
Time drifts can result in an unexpected behavior such as time-outs. Please check trace file for more details.
Wed Jun 24 15:05:42 2015
NOTE: ASMB terminating
Errors in file /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_asmb_114106.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1261 Serial number: 11
Errors in file /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_asmb_114106.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1261 Serial number: 11
ASMB (ospid: 114106): terminating the instance due to error 15064
Wed Jun 24 15:05:42 2015
System state dump requested by (instance=1, osid=114106 (ASMB)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_diag_114080_20150624150542.trc
Dumping diagnostic data in directory=[cdmp_20150624150542], requested by (instance=1, osid=114106 (ASMB)), summary=[abnormal instance termination].
Instance terminated by ASMB, pid = 114106

原来6月24号就已经挂了，因为备份在主库上做，策略是删除前7天的归档日志，也未看备库信息，所以一直没发现问题，看来以后要随时观察主备库情况，有问题及时解决

那么就先把备库起来再说

SQL> startup;
ORACLE instance started.

Total System Global Area 732352512 bytes
Fixed Size 1367144 bytes
Variable Size 331350936 bytes
Database Buffers 377487360 bytes
Redo Buffers 22147072 bytes
Database mounted.
Database opened.

启日志应用：

SQL> altter database recover managed standby database disconnect from session;

接下来分析日志吧，先看当时的VKTM trace file：

*** 2015-06-22 20:00:01.102
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1434974497506826 drift: 1071
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Time stalled at 1434974401
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Stall, backward drift ended at 1434974402 drift: 1
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435024898106410

*** 2015-06-23 10:00:01.197
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435024898107666 drift: 1256
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435028498622049

*** 2015-06-23 11:00:01.108
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435028498623142 drift: 1093
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Time stalled at 1435028401
kstmchkdrift (kstmhighrestimecntkeeper:lowres): Stall, backward drift ended at 1435028402 drift: 1
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435078899223760

*** 2015-06-24 01:00:01.205
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435078899224947 drift: 1187
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time stalled at 1435082499729169

*** 2015-06-24 02:00:01.106
kstmchkdrift (kstmhighrestimecntkeeper:highres): Stall, backward drift ended at 1435082499730249 drift: 1080
kstmchkdrift (kstmhighrestimecntkeeper:highres): Time jumped forward by (339800056)usec at (1435129640131631) whereas (1000000) is allowed

一直有关于时间不同步的警告，查看service ntpd tatus,发现ntp已经停了，于是先启动ntp，service ntpd start

根据alert中的信息，把systemstate dump信息转换下，用aas.awk工具:

awk -f ass1038.awk /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_diag_114080_20150624150542.trc

Starting Systemstate 1
.................................................................
Ass.Awk Version 1.0.38
~~~~~~~~~~~~~~~~~~~~~~
Source file : /u01/app/oracle/diag/rdbms/standby/standby/trace/standby_diag_114080_20150624150542.trc

System State 1 (2015-06-24 15:05:42.666)
~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~
1: [DEAD]
2: waiting for 'pmon timer' [DEAD]
3: waiting for 'rdbms ipc message' [DEAD]
4: waiting for 'VKTM Logical Idle Wait' [DEAD]
5: waiting for 'rdbms ipc message' [DEAD]
6: waiting for 'DIAG idle wait'
7: waiting for 'rdbms ipc message' [DEAD]
8: waiting for 'DIAG idle wait' [DEAD]
9: waiting for 'rdbms ipc message' [DEAD]
10: waiting for 'rdbms ipc message'
11: waiting for 'rdbms ipc message'
12: waiting for 'rdbms ipc message'
13: waiting for 'rdbms ipc message'
14: waiting for 'rdbms ipc message' [DEAD]
15: waiting for 'rdbms ipc message' [DEAD]
16: waiting for 'smon timer' [DEAD]
17: waiting for 'rdbms ipc message' [DEAD]
18: waiting for 'rdbms ipc message' [DEAD]
19:
20: waiting for 'rdbms ipc message' [DEAD]
21: waiting for 'rdbms ipc message' [DEAD]
22: waiting for 'wait for unread message on broadcast channel' [DEAD]
23: waiting for 'SQL*Net message from client' [DEAD]
24: waiting for 'SQL*Net message from client' [DEAD]
26: waiting for 'rdbms ipc message' [DEAD]
27: waiting for 'rdbms ipc message' [DEAD]
28: waiting for 'rdbms ipc message' [DEAD]
29: waiting for 'rdbms ipc message' [DEAD]
30: waiting for 'rdbms ipc message' [DEAD]
32: waiting for 'SQL*Net message from client' [DEAD]
34: waiting for 'SQL*Net message from client' [DEAD]
36: waiting for 'class slave wait' [DEAD]
37: waiting for 'MRP wait on archivelog arrival' [DEAD]
38: waiting for 'parallel recovery slave next change' [DEAD]
39: waiting for 'parallel recovery slave next change' [DEAD]
40: waiting for 'parallel recovery slave next change' [DEAD]
41: waiting for 'parallel recovery slave next change' [DEAD]
42: waiting for 'parallel recovery slave next change' [DEAD]
43: waiting for 'parallel recovery slave next change' [DEAD]
44: waiting for 'parallel recovery slave next change' [DEAD]
45: waiting for 'parallel recovery slave next change' [DEAD]
46: waiting for 'parallel recovery slave next change' [DEAD]
47: waiting for 'parallel recovery slave next change' [DEAD]
48: waiting for 'parallel recovery slave next change' [DEAD]
49: waiting for 'parallel recovery slave next change' [DEAD]
50: waiting for 'parallel recovery slave next change' [DEAD]
51: waiting for 'parallel recovery slave next change' [DEAD]
52: waiting for 'parallel recovery slave next change' [DEAD]
53: waiting for 'parallel recovery slave next change' [DEAD]
54: waiting for 'parallel recovery slave next change' [DEAD]
55: waiting for 'parallel recovery slave next change' [DEAD]
56: waiting for 'parallel recovery slave next change' [DEAD]
57: waiting for 'parallel recovery slave next change' [DEAD]
58: waiting for 'parallel recovery slave next change' [DEAD]
59: waiting for 'parallel recovery slave next change' [DEAD]
60: waiting for 'parallel recovery slave next change' [DEAD]
61: waiting for 'parallel recovery slave next change' [DEAD]
62: waiting for 'parallel recovery slave next change' [DEAD]
63: waiting for 'parallel recovery slave next change' [DEAD]
64: waiting for 'parallel recovery slave next change' [DEAD]
65: waiting for 'parallel recovery slave next change' [DEAD]
66: waiting for 'parallel recovery slave next change' [DEAD]
67: waiting for 'parallel recovery slave next change' [DEAD]
68: waiting for 'parallel recovery slave next change' [DEAD]
69: waiting for 'parallel recovery slave next change' [DEAD]

NO BLOCKING PROCESSES FOUND

------------------ooOoo------------------
For the LATEST version of this utility see
http://dlsunuk11.uk.oracle.com/Public/Utils.html#ass

For additional documentation see
http://dlsunuk11.uk.oracle.com/Public/TOOLS/Ass.html

Suggested improvements, bugs etc. should be sent to kevin.p.quinn@oracle.com

End of report. 49578 Lines Processed.

由于ASMB进程死掉，导致oralce实例宕，于是不能接收从主库传递过来的日志信息，起因就是ntp出了问题

具体可见MOS ORA-15064 as per ASM instance founds sessions unresponsive for over 3 mins (文档 ID 1615459.1)

根据MOS文档 Base Bug 11837095介绍，此bug已在11.2.0.3中fixed.

根据MOS文档 ID 1347586.1介绍，可以设置事件来解决此问题：

alter system set event="10795 trace name context forever, level 2" scope=spfile;

然后重启实例。后期要注意观察。另可见大神惜纷飞的博客也有说明：http://www.oratping.com/?p=359

0 0