[Oracle 11g r2(11.2.0.4.0)]案例分析4-由gipc 进程导致的节点无法启动

来源:互联网 发布:电路接线仿真软件 编辑:程序博客网 时间:2024/05/29 17:31

环境:RHEL 5.5 + 11.2.0.3 GI, 双节点。
问题描述: OBA 发现节点2 被重新启动, 之后无法加入到集群。
分析过程:
实际上这是两个不同的问题, 首先节点2被重新启动, 之后节点2无法加入集群。这需要从集群的告警日志开始分析。
1 . 节点1的集群alert.log

2014-08-26 17:23:09.846[cssd(l3527)]CRS-1612:Network communication with node **** (2) missing for 50% oftimeout interval. Removal of this node from cluster in 14.190 seconds2014-08-26 17:23:16.860[cssd(l3527) ]CRS-1611:Network communication with node **** (2) missing for 75% oftimeout interval. Removal of this node from cluster in 7.180 seconds2014-08-26 17:23:21.870[cssd(l3527) ]CRS-1610:Network communication with node **** (2) missing for 90% oftimeout interval. Removal of this node from cluster in 2.170 seconds2014-08-26 17:23:24.038[cssd(l3527)JCRS-1607:Node **** 工s being evicted in cluster incarnation 76357678;details at (:CSSNM00007:) in /***/***/log/*****/cssd/ocssd.log.2014-08-26 17:23:25.0202014-08-26 17:23:55.048[cssd(l3527) ]CRS-1601:CSSD Reconfiguration complete. Active nodes are ******

2 . 节点2的集群alert.log

2014-08-26 17:23:10.052[cssd(26141) ]CRS-1612:Network communication with node ****** (1) missing for 50%of timeout interval. Removal of this node from cluster in 14.180 seconds2014-08-26 17:23:17.066[cssd(26141) ]CRS-1611:Network communication with node ****** (1) missing for 75%of timeout interval. Removal of this node from cluster in 7.170 seconds2014-08-26 17:23:22.076[cssd(26141) ]CRS-1610:Network communication with node ****** (1) missing for 90%of timeout interval. Removal of this node from cluster in 2.160 seconds2014-08-26 17:23:24.243[cssd(26141) JCRS-1608:T h i s node w a s evicted by node 1, ******; details at(:CSSNMOOOOS:) in /***/***/log/****/cssd/ocssd.log.2014-08-26 17:23:24.243[cssd(26141) ]CRS-1608:T h i s n o d e w a s evicted by node 1, *****; details at(:CSSNMOOOOS:) in /***/***/log/****/cssd/ocssd.log.2014-08-26 17:23:24.243[cssd(2614l)]CRS-1656:The CSS daemon is terminating due to a fatal error; Detailsat (:CSSSC00012:) in /***/***/log/****/cssd/ocssd.log2014-08-26 17:23:24.243[cssd(2614l)]CRS-1652:Starting clean up of CRSD resources.2014-08-26 17:23:24.2432014-08-26 17:35:33.962[ohasd(l7652)]CRS-2112:The OLR service started on node ****.

从上面的程序可看出这两个节点之间出现了私网通信问题之后节点2被驱逐出了集群。
3 . 两个节点的操作系统日志{Ivar/log/messages)

Aug 26 17: 22: 54 ****** kernel: ******: ****** NIC Link is DownAug 26 17:22:55 **** kernel: bonding: bondO: link status defin工tely down forinterface ****, disabling itAug 26 17:22:55 ****** kernel bonding: bondO: now running without any activeinterfaceAug 26 17:25:00 ****** kernel: bnx2 0000:02:00.1: ***: NIC Copper Link is Up,1000 Mbps full duplexAug 26 17:25:01 ****** kernel: bonding: bondO: link status definitely up forinterface ***.Aug 26 17:25:01 ****** kernel: bonding: bondO: making interface *** the newactive one.Aug 26 17:25:01 ****** kernel: bonding: bondO: first active interface up!Aug 2 6 1 7 : 2 5 : 15 **** kernel: ******: *** NIC Link is Up 1000 Mbps FullDuplex, Flow Control: NoneAug 26 17:25:16 ****** kernel: bonding: bondO: link status definitely up forinterface ****.

根据上面的信息, 能看到节点2被重启的原因是节点1的私网出现了问题。但是, 根据
操作系统日志中的信息, 私网( bondO)看起来在几分钟之后就已经恢复, 这无法解释为什么
节点2无法重新加入集群。因此, 还是要从节点2的ocssd.log找到它无法加入集群中的原因。
4 . 节点2 的ocssd.log
看起来节点2的确是被节点1驱逐出集群的。

2014-08-26 17:23:10.052: [CSSD] [11123445824]clssnmPollingThread: node ****** (1)at 50毡heartbeat fatal, removal in 14.180 seconds2014-08-26 17:23:10.052: [CSSD] [11123445824]clssnmPollingThread: node ****** (1)iS impending reconfig, flag 2491406, misstime 158202014-08-26 17:23:22.567: [CSSD] [11123445824JclssnmPollingThread: node ****** (1)at 90% heartbeat fatal, removal in 14.180 seconds2014-08-26 17:23:22.567: [CSSD) [11123445824JclssnmPollingThread: node ****** (1)is impending reconfig, flag 2491406, misstime 270202014 08-26 17:23:24.243: [CSSD] (1087337264) (:CSSNMOOOOS:)clssnmvDiskKillCh eck:Aborting, evicted by node ****, number 1, sync 76357678, stamp 4164343482014-08-26 17:23:24.243: [CSSD) [1115920704]clssnmQueueC11entEvent: Node[1] state= 5, birth = 237506444, unique = 13424394612014-08-26 17: 23: 24. 243: [CSSD] [1087337264]#######################################2014-08-26 17:23:24.243: [CSSD] [1115920704]clssnmQueueClientEvent: Node[2) state= 3, birth = 237506448, unique = 1342447749

cssd 被重新启动

2014-08 26 17:36:33.738: [CSSD] [1804567846)clssscmain: Starting CSS daemon,version 11.2.0.3.0, in (clustered) mode w ith uniqueness value 1342448763

看起来节点2能够发现私网, 而且通过gipc已经开始联系集群的远程节点

2014-08-26 17:36:49.217: [CLSINET)[1234094272] Returning NETDATA : 1工nterfaces2014-08-26 17:36:49.217: [CLSINET] (1234094272] # 0 Interface ’bondO ’,ip= ’****', mac= ’ ****’ , mask= ’****’ , net= ’ ****’, use= ’cluster interconnect ’2014-08-26 17:36:49.217: [GIPCH GEN] (1234094272] gipchaNodeA ddinterface: adding工nterface information for inf Oxld6e58f0 { hos t ” , haName '* * * * * * * * ’,local (nil), ip ’ ****’, subnet ’*.* .* .* ’, mask ’**** ’ , mac ’****’,if name ’ bondO ’ , numRef 0, numFail 0, idxBoot 0, flags Ox1841 }2014-08-26 17: 36: 49.217: [GIPCHTHR] (1090517312] gipch aWorkerCreateinterface:c r e a t e d lo c a l in t e r f a c e f o r nod e ’****’, haName ’ ******’ ,inf ’udp: //*.*.*.*:37726 ’2014-08-26 17: 36: 49.217: [GIPCHTHR] [1090517312] gipchaWorkerCreateinterface:created local bootstrap multicast interface for node ’ **** ’, haName ’**** ’ , inf ’mcast://224.0.0.251:42424/*.*.*.*’

但是, 由这段程序看来, 私网和节点2的通信并没有成功, 因为节点2并没有通过网络心跳
发现节点l, 只是在VF 中找到了节点l 的磁盘心跳信息。

2014-08-26 17: 37: 37. 835: [CSSD] (1108764992] clssnmvDHBVa l工dateNCopy: node 1,******, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt,1217177, LATS 4284841470, lastSeqNo 1217167, uniqueness 1342439461, timestamp1342448976/4174504242014 08-26 17:37:38.039: [CSSD] [1103366464]clssnmvDHBVa l idateNCopy: node 1,****, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt,1217178, LATS 4284841690, lastSeqNo 1217176, uniqueness 1342439461, timestamp1342448977/4174512942014-08-26 17: 37 :38 .194: [CSSD] (1075505472] clssnmvDHBValidateNCopy: node 1,* ****, has a disk HB, but no network HB, DHB has rcfg 237506450, wrtcnt,1217179, LATS 4284841840, lastSeqN。1217169, uniqueness 1342439461, timestamp1342448977/417451314

由于节点2无法连接到节点1, 所以节点2会尝试接管集群, 但是节点1仍然能够访问VF,所以节点2无法接管集群, 只能再次abort掉。同样的过程会不断地重复发生。

2014-08-26 17:37:41.806: [CSSD] [1156349762]clssnmRcfgMgr Thread: Local Join2014-08-26 17:37:41.806: [C S S D] [1156349762]clssnm Local   Join Event: begin on node(2), waittime 1930002014 08-26 17:37:41.806: [CSSD] [1156349762JclssnmLocalJoinEvent: set curt1.me(4294845440) for my node2014-08-26 17:37:41.806: [CSSD) [1156349762]clssnmLocalJoinEvent: Node *****,number 1, is in an existing cluster with disk state 32014-08-26 17:37:41.806: [CSSD] [1156349762]clssnmLocalJoinEvent: takeover aborteddue to cluster merber node found on disk2014-08-26 17 :47: 42. 783: [CSSD] [1083387938] clssgmExecuteClientRequest: MAINTrecvd from proc 2 (Oxld4527f0)2014-08-26 17 :47: 42. 783: [CSSD] [1083387938) clssgmShutDown: Received abortiveshutdown request from client.2014-08-26 17: 47: 42.

根据上面的发现可推出, 问题可能还是出现在私网通信上, 而11gR2 版本的集群私网通信是首先需要通过gipc建立连接的。接下来, 读者自然会想到要去看一下gipc.log说了些什么。
5 . 节点2 的gipcd.log
看起来节点2 的gipc 没有问题, 能够找到私网网卡, 而且能够和ocssd 进行通信。

2014-08-26 17:36:53.621: [CLSINET] [1108675468) Returning NETDATA: 1 interfaces2014-08-26 17:36:53.621: [CLSI NET] (1108675468] # 0 Interface ’bondO ’ , ip= ’****’, mac= ’****’, mask= ’ ****’, net= ’ ****’, use= ’ cluster interconnect ’2014-08-26 17:36:53.621: [GIPCDMON] [1108675468] gipcdMonitorUpdate: New Interfacefound - [ 1op ****, subnet ****, mask ****, mac ********, if name bondO]2014-08-26 17:36:53.621: [GIPCDMON] [1108675468] gipcdMonitorUpdate: msg sent toclient thread (([update(ip: *.*.*.*, mask: *.*.*.*, subnet : ****, mac:********, ifname: bondO), tate(gipcdadapterstateUp)]))2014-08-26 17 :36: 53. 621: [GIPCDNDE] [1108675468] g i p c dEnqueue MsgFor Client:Enque using a msg for client : req: [hostnam e(), id (0000000000000000,0000000000000000), len(632), req cookie(O), type(gipcdReqTypeMonitorUpdate)]2014-08-26 17:36:53.621: [GIPCDMON] [1108675468] gipcdMonitorSaveinfMetrics: inf[0] bondO - rank 99, avgms 30000000000.000000 [ 0 IO IO]2014-08-26 17:38:04.600: [GIPCDMON] (1108675468] gipcdMonitorCssinit: Successfullyconnected to CSS2014-08-26 17:38:08.593: [ CLSINET] (1108675468] Returning NETDATA: 1 interfaces2014-08-26 17:38:08.593: [ CLSINET] (1108675468] # 0 Interface ’bondO’,ip= ’****’,mac=’****’,mask=’****’,net=’****’ ,u se =’cluster interconnect’2014-08-26 17: 38 :23. 632: [GIPCDMON] (1108675468] gipcdMonitorSaveinfMetrics: inf [0]  bondO - rank 99, avgrns 30000000000.000000 [ 15 /0/0]

6 . 节点1的gipcd.log
基于之前的节点l 的操作系统日志, 看起来gipc 也发现了私网存在问题, 并将对应的网卡标记成了down 的状态。

2014-08-26 17:22:56.233: [ CLSINET] (1126453756] Returning NETDATA: 0 interfaces2014-08-26 17:22:56.233: [GIPCDMONJ (1126453756] gipcdMonitorUpdate: interfacewent down - [ ip ****, subnet *. *. *. *, mask *. *. *. *, mac ****, if name [bondO] 2014-08-26 17:22:56.233: [GIPCDMON][1126453756] ipcdMonitorUpdate: msg sent to client thread (([update(ip: *.*.*.*, mask: * *.*.*, subnet: *.*.* *, mac:****,lf name: [bondO], state(gipcdadapterstateDown)]))

私网网卡恢复正常。

2014-08-26 1 7: 23: 43. 204: [GIPCHALO] [ 1058687 4 00 J gipchaLowerProcessNode: no validinterfaces found to node for 47210 ms, node Ox2aaaac2a3e60 { host '****’,ha Name ’9 1.pcd ha name ’ , srcLuid ****, dstLuid **** numinf 0,contigSeq 23, lastAck 24, lastValidAck 23, sendSeq (25 : 28], createTime416251644, sentRegister 1, localMonitor 1, flags Ox2008 )2014-08-26 17:23:46.248: [ CLSINET] (1126453756] Returning NETDATA: 0 interfaces2014-08-26 17: 25: 31. 279: [ CLSINET] [ 1126453756] Returning NETDATA: 1 interfaces2014-08-26 17:25:01.279: [ CLSINET] (1126453756] # 0 Interface ’bondO’,ip=’****’ ,mac=’****’,mask=’****’ ,net=’****’,use =’cluster interconnect’2014-08-26 17:25:31.279: [GIPCDMON] [1126453756] gipcdMonitorUpdate: New Interfacefound - [ 1op ****, subnet ****, mask *. * . * . *, mac ****if name [bondO]

看起来gipc的确开始试着向集群发布自己的信息。

2014-08-26 17 :25: 32 .262: [GIPCHTHR] (1058687400] gipchaWorkerCreateinterface:created local interface for node ’****’, haName ’gipcd ha name’, inf ’udp://*.*.*.*:32823'2014-08-26 17 :25: 32 .262: [GIPCHTHR] [1058687400] gipchaWorkerCreateinterface:created local bootstrap multicast interface for node ’****’, ha Name ’gipcd haname ’ , inf ’ mcast://224.0.0.251:42424/****’2014-08-26 17:25:32.263: [GIPCHTHR] [1058687400] gipchaWorkerCreateinterface:created local bootstrap multicast interface for node ’****’, haName ’gipcdha name ’, inf ’mcast://230.0.1.0:42424/****’2014-08-26 17 :25: 32. 263: [GIPCHTHR) [1058687400] gipchaWorkerCreateinterface:created local bootstrap broadcast interface for node ’****', ha Name ’gipcd_ha_name ’, inf ’udp://****:42424’2014-08-26 17:25:32.263: [GIPCHGEN] [1058687400] gipchaNodeAddinterface: addinginterface information for inf **** { host ·***’, ha Name ’gipcd_ha_name’, local (nil), ip '****':32823’, subnet ’****’, mask '****', mac '****', lfname '', numRef 0, numFail 0, idxBoot 0, flags Oxl842 )2014-08-26 17:25:32.263: [GIPCHTHR] (1058687400] gipchaWorkerCreateinterfacecreated remote bootstrap multicast interface for node ’****’, haName ’gipcdha name’, inf ’mcast://224.0.0.251:42424/*.*.*.*:32823'2014-08-26 17:25:32.263: [GIPCHTHR] [1058687400] gipchaWorkerCre ateinterface:created remote bootstrap multicast interface for node '****’, haName ’gipcd_ha name’, inf ’mcast://230.0.1.0:42424/*.*.*.*:32823’2014-08-26 17:25:32.263: [GIPCHTHR] [1058687400] gipchaWorkerCreateinterface:created remote bootstrap broadcast interface for node ’****’, ha Name ’9 ipcdha_name ’, in f ’udp://*.*.*.*:32823’2014-08-26 17:25:32.263: [GIPCHGEN] [1058687400] gipchaWorke r Attachinterface:Interface attached inf ********* { host ’****’, haName ’gipcd ha name’, localOx2aaaac220350, ip '****' :32823’, subnet ’****’, mask ’****’, mac '' if name'' , numRef 0, numFail 0, idxBoot 0, flags Oxl846}

上面的信息说明gipc检查了私网的状态之后, 认为私网存在问题, 因为私网的rank值为
0或-1。

2014-08-26 17:25:36.284: [ CLSINET] (1126453756] Returning NETDATA: 1 interfaces2014-08-26 17:25:36.284: [ CLSINET] (1126453756] # 0 Interface ’bondO’,ip=’****’,mac=’****’,mask=’****’,net=’****’,use=’cluster interconnect’2014-08-26 1 7: 25: 36. 2 97: [GIPCDMONJ [ 11264537 56 J gipcdMonitorSaveinfMetrics: inf [0] bondO - rank 0, avgms 30000000000. 000000 [ 4 I O I O )《《《2014-08-26 17:25:36.363: [GIPCDMON) (1126453756) gipcdMonitorSaveinfMetrics: inf[0] bondO rank O, avgms 30000000000. 000000 [ 19 I O I O )2014-08-26 17:25:36.463: [ CLSINET) [1126453756] Returning NETDATA: 1 interfaces2014-08-26 17:25:36.463: [ CLSINET) [1126453756] # 0 Interface ’bondO ’ ,ip=’****”,mac=’****’,mask=’****’ ,net= ’****’,use=’c luster interconnect’2014-08-26 17:25:36.678: [GIPCDMON) [1126453756) gipcdMonitorSaveinfMetrics: inf[0] bondO - rank -1, avgms 30000000000.000000 [ 0 /0 /0 ]

根据以上的信息, 问题出现在节点l的gipcd在私网恢复之后看起来没有正确地检查私网
的健康性, 并将私网的rank值标记成了。或-1。由于gipcd进程是由ohasd对应的代理进程
所管理的, 即使这个进程被终止, 代理进程也会启动一个新的gipcd守护进程。
结果:将节点l的gipcd进程使用操作系统命令kill掉之后, 新的gipcd进程便产生。重
新启动节点2, 节点2可以加入集群, 问题解决。

阅读全文
0 0