HAIP Drops Route and RAC Database/ASM Instances Evicts (文档 ID 1554551.1)

来源:互联网 发布:众泰知豆和吉利知豆 编辑:程序博客网 时间:2024/05/01 22:18

转到底部转到底部

2013-9-5PROBLEM为此文档评级
通过电子邮件发送此文档的链接在新窗口中打开文档可打印页

In this Document
 Symptoms Cause Solution References


Applies to:

Oracle Database - Enterprise Edition - Version 11.2.0.2 and later
Information in this document applies to any platform.

Symptoms


11gR2 RAC with multiple private network adapters for HAIP, ASM/Database instance evicts after the following is logged in GI alert.log:

 

  • <GI_HOME>/log/<node>/alert<node>.log
2013-03-18 12:44:45.973
[/oracle/app/11.2.0/grid/bin/orarootagent.bin(3134)]CRS-5018:(:CLSN00037:) Removed unused HAIP route: 169.254.0.0 / 255.255.128.0 / 169.254.27.7 / e1000g0:1

 

  • <GI_HOME>/log/<node>/agent/ohasd/orarootagent_root/orarootagent_root.log
2013-03-18 12:44:45.911: [ GIPCNET][20]gipcmodNetworkAttrAddrOsd: failed to update name 'subnet', addr 2436b10 [0000000001311373] { gipcAddress : name '(invalid)', objFlags 0x0, addrFlags 0x3 }
2013-03-18 12:44:45.911: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos op : sgipcnRawAttributeFunc
2013-03-18 12:44:45.911: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos dep : Error 0 (0)
2013-03-18 12:44:45.911: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos loc : sgipcnRawFin
2013-03-18 12:44:45.911: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos info: empty ip address
2013-03-18 12:44:45.911: [GIPCXCPT][20] gipcInternalGetAttribute: failed during gipcInternalGetAttribute, ret gipcretFail (1)
2013-03-18 12:44:45.911: [GIPCXCPT][20] gipcGetAttributeStringF [clsinet_IntGetIpInformation : clsinet.c : 4932]: EXCEPTION[ ret gipcretFail (1) ] failure for obj 0000000001311373, name 'subnet', val fffffd7ff9b6d9a0, len 1024, flags 0x4000
2013-03-18 12:44:45.911: [ CLSINET][20] (:CLSINE0014:)failed to get IP information for mac '00-c0-dd-18-00-44', ip '169.254.27.7', ret 1
2013-03-18 12:44:45.972: [ USRTHRD][20] {0:0:2} (:CLSN00037:) Removed unused HAIP route 169.254.0.0 / 255.255.128.0 / 169.254.27.7 / e1000g0:1
....
2013-03-18 12:45:21.278: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: failed to update name 'ifName', addr 224e6d0 [0000000001311a67] { gipcAddress : name '(invalid)', objFlags 0x0, addrFlags 0x3 }
2013-03-18 12:45:21.278: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos op : sgipcnRawAttributeFunc
2013-03-18 12:45:21.278: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos dep : Error 0 (0)
2013-03-18 12:45:21.278: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos loc : sgipcnRawFin
2013-03-18 12:45:21.280: [ GIPCNET][20] gipcmodNetworkAttrAddrOsd: slos info: empty ip address
2013-03-18 12:45:21.280: [GIPCXCPT][20] gipcInternalGetAttribute: failed during gipcInternalGetAttribute, ret gipcretFail (1)
2013-03-18 12:45:21.280: [GIPCXCPT][20] gipcGetAttributeStringF [clsinet_IntGetIpInformation : clsinet.c : 4934]: EXCEPTION[ ret gipcretFail (1) ] failure for obj 0000000001311a67, name 'ifName', val fffffd7ff9b6dc50, len 1024, flags 0x4000
2013-03-18 12:45:21.280: [ CLSINET][20] (:CLSINE0014:)failed to get IP information for mac '00-21-28-44-da-c2', ip '10.11.118.240', ret 1
2013-03-18 12:45:22.951: [ora.crsd][13] {0:0:2} [check] clsdmc_respget return: status=0, ecode=10201

 

  • alert_+ASM1.log
Mon Mar 18 12:51:51 2013
IPC Send timeout detected. Receiver ospid 28471 [oracle@sasmwrtrsdp03 (LMD0)]
Mon Mar 18 12:51:51 2013
Errors in file /oracle/app/oracle/diag/asm/+asm/+ASM1/trace/+ASM1_lmd0_28471.trc:
Mon Mar 18 12:53:35 2013
Detected an inconsistent instance membership by instance 2
Evicting instance 2 from cluster
Waiting for instances to leave: 2

Mon Mar 18 12:53:36 2013
Dumping diagnostic data in directory=[cdmp_20130318125336], requested by (instance=2, osid=3690 (LMD0)), summary=[abnormal instance termination].
Mon Mar 18 12:53:55 2013
Remote instance kill is issued with system inc 6
Remote instance kill map (size 1) : 2
LMON received an instance eviction notification from instance 1
The instance eviction reason is 0x20000000
The instance eviction map is 2
Mon Mar 18 12:54:05 2013
NOTE: client rtrprd1:rtrprd registered, osid 1472, mbr 0x2
Waiting for instances to leave: 2
Mon Mar 18 12:54:36 2013
Waiting for instances to leave: 2
Mon Mar 18 12:55:05 2013
Waiting for instances to leave: 2
Mon Mar 18 12:55:17 2013
Reconfiguration started (old inc 4, new inc 8)

  

  • alert_+ASM2.log
Mon Mar 18 12:51:50 2013
IPC Send timeout detected. Sender: ospid 3690 [oracle@sasmwrtrsdp01 (LMD0)]
Receiver: inst 1 binc 56990 ospid 28471
IPC Send timeout to 1.0 inc 4 for msg type 65521 from opid 10
Mon Mar 18 12:51:52 2013
Communications reconfiguration: instance_number 1
Mon Mar 18 12:53:35 2013
Detected an inconsistent instance membership by instance 2
Mon Mar 18 12:53:35 2013
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.
Mon Mar 18 12:53:35 2013
Received an instance abort message from instance 1
Please check instance 1 alert and LMON trace files for detail.

LMD0 (ospid: 3690): terminating the instance due to error 481
Mon Mar 18 12:53:36 2013
System state dump requested by (instance=2, osid=3690 (LMD0)), summary=[abnormal instance termination].
System State dumped to trace file /oracle/app/oracle/diag/asm/+asm/+ASM2/trace/+ASM2_diag_3682.trc
Mon Mar 18 12:53:55 2013
ORA-1092 : opitsk aborting process
Mon Mar 18 12:53:55 2013
Termination issued to instance processes. Waiting for the processes to exit
Mon Mar 18 12:53:59 2013
License high water mark = 40
Instance terminated by LMD0, pid = 3690
USER (ospid: 28923): terminating the instance
Instance terminated by USER, pid = 28923

 

  • RAC database alert.log has similar messages as ASM

 

  • GI alert.log may also have the following:
2013-03-18 12:50:39.441
[cssd(27849)]CRS-1662:Member kill requested by node sasmwrtrsdp01 for member number 0, group DBRTRPRD
2013-03-18 12:50:50.444
[/oracle/app/11.2.0/grid/bin/oraagent.bin(29197)]CRS-5011:Check of resource "rtrprd" failed: details at "(:CLSN00007:)" in "/oracle/app/11.2.0/grid/log/sasmwrtrsdp03/agent/crsd/oraagent_oracle/oraagent_oracle.log"
2013-03-18 12:50:50.449
[crsd(28776)]CRS-2765:Resource 'ora.rtrprd.db' has failed on server 'sasmwrtrsdp03'.
2013-03-18 12:53:37.766
[cssd(27849)]CRS-1663:Member kill issued by PID 28465 for 1 members, group DB+ASM. Details at (:CSSGM00044:) in /oracle/app/11.2.0/grid/log/sasmwrtrsdp03/cssd/ocssd.log.
2013-03-18 12:54:00.649
[crsd(28776)]CRS-2765:Resource 'ora.asm' has failed on server 'sasmwrtrsdp01'.
2013-03-18 12:54:10.011
[crsd(28776)]CRS-2765:Resource 'ora.ACFS_DG.dg' has failed on server 'sasmwrtrsdp01'.

 

 

 

Cause

Due to transient network error that lasts for very short amount of time. 

 

Solution

Apply patch 16876500 to tolerate transient private network error.

While waiting for the patch to be applied, the workaround is to unplumb all but one private network. Note this will NOT provide fault tolerance.

 

References

BUG:16876500 - GI HAIP AGENT DROPS A ROUTE FREQUENTLY AND THAT LEADS TO THE INSTANCE EVICTION
BUG:16896235 - HAIP FAILS AND BRINGS DOWN DB AND ASM INSTANCES.
BUG:16610080 - OUTAGE AFTER "GIPCMODNETWORKATTRADDROSD: FAILED TO UPDATE NAME 'SUBNET'"
BUG:16985519 - HAIP FAILURE CAUSING INSTANCE EVICTIONS
1 0
原创粉丝点击