ORA-600 [KFDAUDEALLOC2] AND INSTANCE CRASH EVEN WITH THE FIX OF BUG 14467061 (文档 ID 1903273.1)

来源:互联网 发布:java me基础教程 编辑:程序博客网 时间:2024/06/05 22:34

In this Document

 Symptoms Cause Solution References


This document is being delivered to you via Oracle Support's Rapid Visibility (RaV) process and therefore has not been subject to an independent technical review.

APPLIES TO:

Oracle Database - Enterprise Edition - Version 11.2.0.3 to 12.1.0.1 [Release 11.2 to 12.1]
Information in this document applies to any platform.

SYMPTOMS

 

Customer got ORA-600 [kfdAuDealloc2] and instance crashed even with the fix of bug 14467061.

 

Mon Jun 02 09:39:25 2014 
Errors in file /odb1/asm/diag/asm/+asm/+ASM3/trace/+ASM3_ora_2283.trc (incident=561): 
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [187], [603], [28], [], [], [], [], [], [], [], [] 
Incident details in: /odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_561/+ASM3_ora_2283_i561.trc 
Use ADRCI or Support Workbench to package the incident. 
See Note 411.1 at My Oracle Support for error and packaging details. 
ERROR: An unrecoverable error has been identified in ASM metadata. The instance will be taken down. 
Mon Jun 02 09:39:41 2014 
NOTE: AMDU dump of disk group DG2SVC created at /odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_561 
NOTE: starting check of diskgroup DG2SVC 
ERROR: file +dg2svc.603.849027701: F603 PX3 => D0 A1896 => F3 PX126: fnum mismatch 
ERROR: file +dg2svc.603.849027701: F603 PX4 => D0 A1983 => F3 PX127: fnum mismatch 
ERROR: file +dg2svc.603.849027701: F603 PX5 => D0 A2214 => F1861 PX221: fnum mismatch 
ERROR: file +dg2svc.603.849027701: F603 PX6 => D0 A2228 => F1861 PX222: fnum mismatch 
.... 
ERROR: disk DG2SVC_DISK1, AT 4: D0 A2214 => F1861 X221: extent not mapped 
ERROR: disk DG2SVC_DISK1, AT 4: D0 A2216 => F1861 X232: extent not mapped 
ERROR: disk DG2SVC_DISK1, AT 4: D0 A2228 => F1861 X222: extent not mapped 
ERROR: disk DG2SVC_DISK1, asz 0, AT 14: AT full, FS avail 
NOTE: disk DG2SVC_DISK1, used AU total mismatch: DD={52750, 0} AT={53279, 0} 
ERROR: check of diskgroup DG2SVC found 51 total errors 
ORA-15049: diskgroup "DG2SVC" contains 51 error(s) 
Mon Jun 02 09:39:44 2014 
Dumping diagnostic data in directory=[cdmp_20140602093944], requested by (instance=3, osid=2283), summary=[incident=561]. 
Errors in file /odb1/asm/diag/asm/+asm/+ASM3/trace/+ASM3_ora_2283.trc (incident=562): 
ORA-00600: internal error code, arguments: [17090], [], [], [], [], [], [], [], [], [], [], [] 
Incident details in: 
/odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_562/+ASM3_ora_2283_i562.trc 
Dumping diagnostic data in directory=[cdmp_20140602093945], requested by (instance=3, osid=2283), summary=[incident=562]. 
Use ADRCI or Support Workbench to package the incident. 
See Note 411.1 at My Oracle Support for error and packaging details. 
Errors in file /odb1/asm/diag/asm/+asm/+ASM3/trace/+ASM3_ora_2283.trc (incident=563): 
ORA-00600: internal error code, arguments: [kfdAuDealloc2], [187], [603], [28], [], [], [], [], [], [], [], [] 
ORA-00600: internal error code, arguments: [17090], [], [], [], [], [], [], [], [], [], [], [] 
Incident details in: /odb1/asm/diag/asm/+asm/+ASM3/incident/incdir_563/+ASM3_ora_2283_i563.trc 
Use ADRCI or Support Workbench to package the incident. 
See Note 411.1 at My Oracle Support for error and packaging details. 
ERROR: An unrecoverable error has been identified in ASM metadata. The instance will be taken down. 
....

 

This fatal assert brought down ASM instances on all nodes. After ASM instances restarted, the problem diskgroup got auto-mounted. Then RBAL process detected the corruptions again during COD recovery rollback, which triggered the fatal assert and crashed ASM instances again. This cycle repeated many times on all nodes until the problem diskgroup was manually dropped by the customer. 

 

 

CAUSE

The bug fix of bug 14467061 was already in place, so the corruptions were not caused by this bug or its related bugs. 

The cause of the corruption was found to be some lost writes in the storage layer.

However, the corruptions were only on one diskgroup, so it's expected that only the problem diskgroup should be dismounted and ASM instances should NOT have crashed. 

 

SOLUTION

 

Some specifc types of diskgroup corruption could trigger fatal asserts that could crash ASM instances on all nodes. This would cause all ASM instances and other healthy diskgroups unusable. 

The following bug fix can prevent ASM instances from crashing AFTER diskgroup corruptions are detected and fatal assert is hit. With this bug fix, only the corrupted diskgroups would be forcibly dismounted, so ASM instances can stay online to service the other healthy diskgroups.

Bug 11814376 - FORCE DISMOUNT AFFECTED DISKGROUP ON METADATA CORRUPTION INSTEAD OF CRASHING

A backport patch can be requested for 11.2.0.3 and above. The bug is fixed in 12.1.0.2 and 12.2

 

Please note that the cause of the diskroup corruptions would NOT be triggered by this bug. Root causes of diskgroup corruption usually are results of lost writes in OS/storage layer. This bug fix is to help reducing the impact caused by some diskgroup corruptions and make our recovery more robust when we encounter this type of corruption.   

 

Workaround:

Manual dismount of the problem diskgroup in SQLPLUS or ASMCMD on all nodes can stop diskgroup automount upon ASM instance restarts. Then this avoids hitting the fatal assert repeatedly and stabilizes ASM instances. 

 

原创粉丝点击