何为splitbrain【记一次RAC节点异常宕机】

来源:互联网 发布:java 数据一致性 编辑:程序博客网 时间:2024/06/08 08:38
写在前面:

记录一次宕机的分析过程,主要是结合CSSD日志理解splitbrain的作用过程
==================================================================================================
上周刚给开发在测试环境搭建了一套11.2.0.3的两节点RAC

今天过来一看。。。

[wh-xn01db1@root]# crs_stat -tName           Type           Target    State     Host        ------------------------------------------------------------ora.DATA.dg    ora....up.type ONLINE    ONLINE    wh-xn01db1  ora....ER.lsnr ora....er.type ONLINE    ONLINE    wh-xn01db1  ora....N1.lsnr ora....er.type ONLINE    ONLINE    wh-xn01db1  ora....N2.lsnr ora....er.type ONLINE    ONLINE    wh-xn01db1  ora....N3.lsnr ora....er.type ONLINE    ONLINE    wh-xn01db1  ora.asm        ora.asm.type   ONLINE    ONLINE    wh-xn01db1  ora.cvu        ora.cvu.type   ONLINE    ONLINE    wh-xn01db1  ora.gsd        ora.gsd.type   OFFLINE   OFFLINE               ora....network ora....rk.type ONLINE    ONLINE    wh-xn01db1  ora.oc4j       ora.oc4j.type  ONLINE    ONLINE    wh-xn01db1  ora.ons        ora.ons.type   ONLINE    ONLINE    wh-xn01db1  ora....ry.acfs ora....fs.type ONLINE    ONLINE    wh-xn01db1  ora.scan1.vip  ora....ip.type ONLINE    ONLINE    wh-xn01db1  ora.scan2.vip  ora....ip.type ONLINE    ONLINE    wh-xn01db1  ora.scan3.vip  ora....ip.type ONLINE    ONLINE    wh-xn01db1  ora....SM1.asm application    ONLINE    ONLINE    wh-xn01db1  ora....B1.lsnr application    ONLINE    ONLINE    wh-xn01db1  ora....db1.gsd application    OFFLINE   OFFLINE               ora....db1.ons application    ONLINE    ONLINE    wh-xn01db1  ora....db1.vip ora....t1.type ONLINE    ONLINE    wh-xn01db1  ora....db2.vip ora....t1.type ONLINE    ONLINE    wh-xn01db1  ora.xn01db.db  ora....se.type ONLINE    ONLINE    wh-xn01db1  

所有东西都跑节点上来了,节点二已经死翘翘了。。。。

难道是我上周搭完忘记起了?试着起来看看

[wh-xn01db2@oracle]$ sqlSQL*Plus: Release 11.2.0.3.0 Production on Mon Jan 12 10:23:10 2015Copyright (c) 1982, 2011, Oracle.  All rights reserved.Connected to an idle instance.xn01db2@SYS> startupORA-01078: failure in processing system parametersORA-01565: error in identifying file '+DATA/xn01db/spfilexn01db.ora'ORA-17503: ksfdopn:2 Failed to open file +DATA/xn01db/spfilexn01db.oraORA-15077: could not locate ASM instance serving a required diskgroupxn01db2@SYS> 

【嗯。。CRS都没起,DB能起才怪。。】

于是试着起crs也是半天没动静

看来是真的有点问题了。。。

看errpt貌似有点动静
[wh-xn01db2@public]$ errptIDENTIFIER TIMESTAMP  T C RESOURCE_NAME  DESCRIPTIONA924A5FC   0109181415 P S SYSPROC        SOFTWARE PROGRAM ABNORMALLY TERMINATEDF7FA22C9   0108150815 I O SYSJ2          UNABLE TO ALLOCATE SPACE IN FILE SYSTEMF7FA22C9   0108150515 I O SYSJ2          UNABLE TO ALLOCATE SPACE IN FILE SYSTEM
[wh-xn01db2@public]$ errpt -aj A924A5FC|more---------------------------------------------------------------------------LABEL:          CORE_DUMPIDENTIFIER:     A924A5FCDate/Time:       Fri Jan  9 18:14:58 CST 2015Sequence Number: 40Machine Id:      00C5B5F44C00Node Id:         wh-xn01db2Class:           SType:            PERMWPAR:            GlobalResource Name:   SYSPROC      DescriptionSOFTWARE PROGRAM ABNORMALLY TERMINATEDProbable CausesSOFTWARE PROGRAMUser CausesUSER GENERATED SIGNAL        Recommended Actions        CORRECT THEN RETRYFailure CausesSOFTWARE PROGRAM       Recommended Actions        RERUN THE APPLICATION PROGRAM        IF PROBLEM PERSISTS THEN DO THE FOLLOWING        CONTACT APPROPRIATE SERVICE REPRESENTATIVEDetail DataSIGNAL NUMBER           6USER'S PROCESS ID:               8323306FILE SYSTEM SERIAL NUMBER          14INODE NUMBER           0       12354CORE FILE NAME/soft/product/11.2.0.3/gih/log/wh-xn01db2/cssd/corePROGRAM NAME<span style="color:#ff0000;">ocssd.bin</span>

ocssd.bin挂掉了!!

ocssd进程提供的是css(Cluster Synchronization Service)服务,也就是通过心跳监视和保护集群的

于是试着去ping一下对端的心跳IP,果然心跳不通了!

至于心跳为什么不通,就交给网络和系统管理员了,我开始好奇的是:心跳不通是如何导致节点宕机的,为什么是节点二宕机的?

认真比对了两个节点的CSSD日志以及查找了splitbrain的名词解释。。终于理解了这次宕机的过程,记录如下:

首先什么是splitbrain:

在集群中,节点间通过某种机制(心跳)了解彼此的健康状态,以确保各节点协调工作。 假设只有"心跳"出现问题, 各个节点还在正常运行, 这时,每个节点都认为其他的节点宕机了, 自己是整个集群环境中的"唯一建在者",自己应该获得整个集群的"控制权"。 在集群环境中,存储设备都是共享的, 这就意味着数据灾难, 这种情况就是"脑裂"解决这个问题的通常办法是使用投票算法(Quorum Algorithm). 它的算法机理如下:集群中各个节点需要心跳机制来通报彼此的"健康状态",假设每收到一个节点的"通报"代表一票。对于三个节点的集群,正常运行时,每个节点都会有3票。 当结点A心跳出现故障但节点A还在运行,这时整个集群就会分裂成2个小的partition。 节点A是一个,剩下的2个是一个。 这是必须剔除一个partition才能保障集群的健康运行。对于有3个节点的集群,A 心跳出现问题后,B 和C 是一个partion,有2票,A只有1票。 按照投票算法,B 和C 组成的集群获得控制权,A 被剔除。如果只有2个节点,投票算法就失效了。 因为每个节点上都只有1票。 这时就需要引入第三个设备:Quorum Device. Quorum Device 通常采用饿是共享磁盘,这个磁盘也叫作Quorum disk。 这个Quorum Disk 也代表一票。 当2个结点的心跳出现问题时,2个节点同时去争取Quorum Disk 这一票, 最早到达的请求被最先满足。 故最先获得Quorum Disk的节点就获得2票。另一个节点就会被剔除。

显然。。。这段参考自:http://blog.csdn.net/zhangout/article/details/7206313

另外还有一篇相当详细的文章讲脑裂工作原理:http://czmmiao.iteye.com/blog/1577531

我只讲我的日志里记录的过程

首先看节点二的CSSD日志:
2015-01-09 18:14:38.106: [    CSSD][3605]clssnmPollingThread: node wh-xn01db1 (1) at 50% heartbeat fatal, removal in 14.264 seconds<span style="color:#ff0000;">=====》提示节点一存在心跳丢失,并于14秒后T出</span>2015-01-09 18:14:38.106: [    CSSD][3605]clssnmPollingThread: node wh-xn01db1 (1) is impending reconfig, flag 2491406, misstime 157362015-01-09 18:14:38.106: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)2015-01-09 18:14:38.107: [    CSSD][2577]clssnmvDHBValidateNcopy: node 1, wh-xn01db1, has a disk HB, but no network HB, DHB has rcfg 316174449, wrtcnt, 114013, LATS 3029068518, lastSeqNo 114000, uniqueness 1420702946, timestamp 1420798476/3029041859···2015-01-09 18:14:45.116: [    CSSD][3605]clssnmPollingThread: node wh-xn01db1 (1) at 75% heartbeat fatal, removal in 7.254 seconds<span style="color:#ff0000;">=====》提示节点一存在心跳丢失,并于7秒后T出</span>2015-01-09 18:14:45.118: [    CSSD][2577]clssnmvDHBValidateNcopy: node 1, wh-xn01db1, has a disk HB, but no network HB, DHB has rcfg 316174449, wrtcnt, 114027, LATS 3029075530, lastSeqNo 114026, uniqueness 1420702946, timestamp 1420798484/30290493892015-01-09 18:14:45.182: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes2015-01-09 18:14:45.182: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes···2015-01-09 18:14:50.123: [    CSSD][3605]clssnmPollingThread: node wh-xn01db1 (1) at 90% heartbeat fatal, removal in 2.247 seconds, <span style="color:#ff0000;">=====》提示节点一存在心跳丢失,并于2秒后T出</span>2015-01-09 18:14:50.124: [    CSSD][2577]clssnmvDHBValidateNcopy: node 1, wh-xn01db1, has a disk HB, but no network HB, DHB has rcfg 316174449, wrtcnt, 114037, LATS 3029080535, lastSeqNo 114036, uniqueness 1420702946, timestamp 1420798489/3029054399···2015-01-09 18:14:52.371: [    CSSD][3605]clssnmPollingThread: Removal started for node wh-xn01db1 (1), flags 0x26040e, state 3, wt4c 02015-01-09 18:14:52.371: [    CSSD][3605]clssnmMarkNodeForRemoval: node 1, wh-xn01db1 marked for removal<span style="color:#ff0000;"> =====》节点二认为节点一挂了,把他T出自己的集群</span>2015-01-09 18:14:52.371: [    CSSD][3605]clssnmDiscHelper: wh-xn01db1, node(1) connection failed, endp (692), probe(0), ninf->endp 6922015-01-09 18:14:52.371: [    CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (692), init state 5, cur state 52015-01-09 18:14:52.371: [GIPCXCPT][3605] gipcInternalDissociate: obj 111255090 [0000000000000692] { gipcEndpoint : localAddr 'gipcha://wh-xn01db2:56be-8234-89af-30b', remoteAdd···2015-01-09 18:14:52.372: [    CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE2015-01-09 18:14:52.372: [    CSSD][4119]clssnmSendSync: syncSeqNo(316174449), indicating EXADATA fence initialization complete2015-01-09 18:14:52.372: [    CSSD][4119]List of nodes that have ACKed my sync: NULL2015-01-09 18:14:52.372: [    CSSD][4119]clssnmSendSync: syncSeqNo(316174449)2015-01-09 18:14:52.372: [    CSSD][4119]clssnmWaitForAcks: Ack message type(11), ackCount(1)2015-01-09 18:14:52.372: [    CSSD][4376]clssnmHandleSync: Node wh-xn01db2, number 2, is EXADATA fence capable2015-01-09 18:14:52.372: [    CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress  val 2, changes 52015-01-09 18:14:52.372: [    CSSD][4376]clssnmHandleSync: local disk timeout set to 27000 ms, remote disk timeout set to 270002015-01-09 18:14:52.372: [    CSSD][4376]clssnmHandleSync: initleader 2 newleader 22015-01-09 18:14:52.372: [    CSSD][4376]clssnmQueueClientEvent:  Sending Event(2), type 2, incarn 3161744482015-01-09 18:14:52.372: [    CSSD][4376]clssnmQueueClientEvent: Node[1] state = 5, birth = 316174446, unique = 14207029462015-01-09 18:14:52.372: [    CSSD][4376]clssnmQueueClientEvent: Node[2] state = 3, birth = 316174448, unique = 14207046162015-01-09 18:14:52.372: [    CSSD][4376]clssnmHandleSync: Acknowledging sync: src[2] srcName[wh-xn01db2] seq[1] sync[316174449]2015-01-09 18:14:52.372: [    CSSD][4376]clssnmSendAck: node 2, wh-xn01db2, syncSeqNo(316174449) type(11)2015-01-09 18:14:52.372: [    CSSD][4376]clssnmHandleAck: Received ack type 11 from node wh-xn01db2, number 2, with seq 0 for sync 316174449, waiting for 0 acks2015-01-09 18:14:52.372: [    CSSD][1]clssgmStartNMMon: node 1 active, birth 3161744462015-01-09 18:14:52.372: [    CSSD][1]clssgmStartNMMon: node 2 active, birth 316174448<span style="color:#ff0000;"> =====》实际上两个节点都是活的,形成两个独立集群,存在脑裂风险,需要进行仲裁</span>2015-01-09 18:14:52.372: [    CSSD][1]NMEVENT_SUSPEND [00][00][00][06]2015-01-09 18:14:52.373: [    CSSD][1]clssgmSuspendAllGrocks: done2015-01-09 18:14:52.373: [    CSSD][1]clssgmCompareSwapEventValue: changed CmInfo State  val 2, from 5, changes 132015-01-09 18:14:52.373: [    CSSD][1]clssgmUpdateEventValue: ConnectedNodes  val 316174448, changes 52015-01-09 18:14:52.373: [    CSSD][1]clssgmCleanupNodeContexts():  cleaning up nodes, rcfg(316174448)2015-01-09 18:14:52.373: [    CSSD][1]clssgmCleanupNodeContexts():  successful cleanup of nodes rcfg(316174448)2015-01-09 18:14:52.373: [    CSSD][1]clssgmStartNMMon:  completed node cleanup 2015-01-09 18:14:52.374: [    CSSD][2577]clssnmvDHBValidateNcopy: node 1, wh-xn01db1, has a disk HB, but no network HB, DHB has rcfg 316174449, wrtcnt, 114042, LATS 3029082785, lastSeqNo 114041, uniqueness 1420702946, timestamp 1420798491/30290569042015-01-09 18:14:52.374: [    CSSD][4119]clssnmSendSync: syncSeqNo(316174449), indicating EXADATA fence initialization complete2015-01-09 18:14:52.374: [    CSSD][4119]List of nodes that have ACKed my sync: 22015-01-09 18:14:52.374: [    CSSD][4119]clssnmWaitForAcks: done, syncseq(316174449), msg type(11)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmSetMinMaxVersion:node2  product/protocol (11.2/1.4)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmSetMinMaxVersion: properties common to all nodes: 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,172015-01-09 18:14:52.374: [    CSSD][4119]clssnmSetMinMaxVersion: min product/protocol (11.2/1.4)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmSetMinMaxVersion: max product/protocol (11.2/1.4)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmNeedConfReq: No configuration to change2015-01-09 18:14:52.374: [    CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, wh-xn01db1, misstime(30001) state(5)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmCheckDskInfo: Checking disk info..<span style="color:#ff0000;">.==========》完成VOTE投票,并检查投票结果</span>2015-01-09 18:14:52.374: [    CSSD][4119]clssnmCheckSplit: Node 1, wh-xn01db1, is alive, DHB (1420798491, 3029056904) more than disk timeout of 27000 after the last NHB (1420798462, 3029027063)2015-01-09 18:14:52.374: [    CSSD][4119]clssnmCheckDskInfo: My cohort: 22015-01-09 18:14:52.374: [    CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 12015-01-09 18:14:52.374: [    CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, wh-xn01db2, is smaller than cohort of 1 nodes led by node 1, wh-xn01db1, based on map type 2<span style="color:#ff0000;">===》节点二只获得一个支持者,小于节点一</span>2015-01-09 18:14:52.374: [    CSSD][4119]###################################2015-01-09 18:14:52.374: [    CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread2015-01-09 18:14:52.374: [    CSSD][4119]###################################2015-01-09 18:14:52.374: [    CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally<span style="color:#ff0000;">==========》所以节点二悲剧了</span>2015-01-09 18:14:52.374: [    CSSD][4119]

再看节点一的日志:
<pre name="code" class="html">2015-01-09 18:14:37.145: [    CSSD][3605]clssnmPollingThread: node wh-xn01db2 (2) at 50% heartbeat fatal, removal in 14.802 seconds <span style="color:#ff0000;">=======》节点一上有一样的信息,准备T出节点二</span>2015-01-09 18:14:37.145: [    CSSD][3605]clssnmPollingThread: node wh-xn01db2 (2) is impending reconfig, flag 2229260, misstime 151982015-01-09 18:14:37.145: [    CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)2015-01-09 18:14:37.147: [    CSSD][2577]clssnmvDHBValidateNcopy: node 2, wh-xn01db2, has a disk HB, but no network HB, DHB has rcfg 316174449, wrtcnt, 113662, LATS 3029042056, lastSeqNo 113659, uniqueness 1420704616, timestamp 1420798476/30290668602015-01-09 18:14:37.648: [    CSSD][2577]clssnmvDHBValidateNcopy: node 2, wh-xn01db2, has a disk HB, but no network HB, DHB has rcfg 316174449, wrtcnt, 113663, LATS 3029042557, lastSeqNo 113662, uniqueness 1420704616, timestamp 1420798477/30290678612015-01-09 18:14:37.951: [    CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 3029042860/14207984772015-01-09 18:14:38.166: [    CSSD][3862]clssnmSendingThread: sending status msg to all nodes2015-01-09 18:14:38.166: [    CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes2015-01-09 18:14:38.452: [    CSSD][2063]clssnmvDiskPing: Writing with status 0x3, timestamp 3029043360/1420798478···2015-01-09 18:14:58.473: [    CSSD][2577]clssnmReadDskHeartbeat:manual shutdown of nodename wh-xn01db2, nodenum 2 epoch 1420798498 msec -1265878681 <span style="color:#ff0000;"> ====》此时节点一赢得仲裁结果,干掉节点二</span>2015-01-09 18:14:58.473: [    CSSD][4119]clssnmWaitOnEvictions: node 2, undead 1, EXADATA fence handle 0 kill reqest id 0, last DHB (1420798498, 3029088615, 93766)2015-01-09 18:14:58.473: [    CSSD][4119]clssnmCheckKillStatus: Node 2, wh-xn01db2, down, LATS(3029063382),timeout(0)2015-01-09 18:14:58.473: [    CSSD][4119]clssnmBldSendUpdate: syncSeqNo(316174449)2015-01-09 18:14:58.473: [    CSSD][4119]clssnmBldSendUpdate: using msg version 42015-01-09 18:14:58.473: [    CSSD][4119]clssnmDoSyncUpdate: Sync 316174449 complete!2015-01-09 18:14:58.473: [    CSSD][4119]clssnmRcfgMgrThread: Reconfig in progress...2015-01-09 18:14:58.473: [    CSSD][4376]clssnmHandleUpdate: sync[316174449] src[1], msgvers 4 icin 3161744442015-01-09 18:14:58.473: [    CSSD][4376]clssnmHandleUpdate: common properties are 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,172015-01-09 18:14:58.473: [    CSSD][4376]clssnmUpdateNodeState: node wh-xn01db1, number 1, current state 3, proposed state 3, current unique 1420702946, proposed unique 1420702946, prevConuni 0, birth 3161744462015-01-09 18:14:58.473: [    CSSD][4376]clssnmUpdateNodeState: node wh-xn01db2, number 2, current state 5, proposed state 0, current unique 1420704616, proposed unique 1420704616, prevConuni 1420704616, birth 3161744482015-01-09 18:14:58.473: [    CSSD][4376]clssnmDeactivateNode: node 2, state 52015-01-09 18:14:58.473: [    CSSD][4376]clssnmDeactivateNode: node 2 (wh-xn01db2) left cluster2015-01-09 18:14:58.473: [    CSSD][4376]clssnmSendAck: node 1, wh-xn01db1, syncSeqNo(316174449) type(15)2015-01-09 18:14:58.473: [    CSSD][4376]clssnmQueueClientEvent:  Sending Event(1), type 1, incarn 3161744492015-01-09 18:14:58.473: [    CSSD][4376]clssnmQueueClientEvent: Node[1] state = 3, birth = 316174446, unique = 14207029462015-01-09 18:14:58.473: [    CSSD][4376]clssnmQueueClientEvent: Node[2] state = 0, birth = 0, unique = 02015-01-09 18:14:58.473: [    CSSD][4376]clssnmHandleUpdate: SYNC(316174449) from node(1) completed2015-01-09 18:14:58.473: [    CSSD][4376]clssnmHandleUpdate: NODE 1 (wh-xn01db1) IS ACTIVE MEMBER OF CLUSTER2015-01-09 18:14:58.473: [    CSSD][4376]clssscUpdateEventValue: NMReconfigInProgress  val -1, changes 122015-01-09 18:14:58.473: [    CSSD][4376]clssnmHandleUpdate: local disk timeout set to 200000 ms, remote disk timeout set to 2000002015-01-09 18:14:58.473: [    CSSD][4376]clssnmHandleAck: node wh-xn01db1, number 1, sent ack type 15 for wrong reconfig; ack is for reconfig 316174449 and we are on reconfig 3161744502015-01-09 18:14:58.474: [    CSSD][1]clssgmStartNMMon: node 1 active, birth 3161744462015-01-09 18:14:58.474: [    CSSD][1]clssgmStartNMMon: node 2 failed, birth (316174448, 0) (old/new)2015-01-09 18:14:58.474: [    CSSD][1]clssgmUpdateEventValue: Reconfig Event  val 1, changes 10



至此完成了这次宕机的过程分析


PS:后话
这个机器是AIX上一个分区,那么其他分区的心跳走的是同一个网口,应该也会有问题

果然这个物理机上面三套RAC,心跳全部不通,其中两套都挂了一边节点,唯独有一套。。。心跳不通但是啥事都没有

结果发现:
[wh-uat01db01@root]# oifcfg getifen0  10.170.42.0  global  cluster_interconnecten1  10.170.32.0  global  public

真正的心跳是:
10.170.42.0网段。。。。。。这个可是动态逻辑分区的IP来的

而我规划的hosts
#for rac10.170.32.93      wh-uat01db01           wh-uat01db01.srcb.com10.170.32.94      wh-uat01db02           wh-uat01db02.srcb.com10.170.32.95      wh-uat01db01-vip       wh-uat01db01-vip.srcb.com10.170.32.96      wh-uat01db02-vip       wh-uat01db02-vip.srcb.com10.170.32.97      wh-uat01db0-scan       wh-uat01db0-scan.srcb.com10.170.32.98      wh-uat01db0-scan       wh-uat01db0-scan.srcb.com10.170.32.99      wh-uat01db0-scan       wh-uat01db0-scan.srcb.com<span style="color:#ff0000;">10.102.1.1        wh-uat01db01-priv      wh-uat01db01-priv.srcb.com10.102.1.2        wh-uat01db02-priv      wh-uat01db02-priv.srcb.com</span>#end

这套库不是我搭的。。。嗯


后后话:
这次心跳不通是因为物理网口坏了,看来生产环境做etherchannel还是很有必要的,嗯!



0 0