【翻译自mos文章】什么是Oracle Clusterware 和RAC中的脑裂

来源：互联网发布：js 界面跳转编辑：程序博客网时间：2024/05/16 01:01

什么是Oracle Clusterware 和RAC中的脑裂

来源于：
What is Split Brain in Oracle Clusterware and Real Application Cluster (文档 ID 1425586.1)

适用于：
Oracle Database - Enterprise Edition - Version 10.1.0.2 and later
Information in this document applies to any platform.

目的：
本文解释了Oracle Clusterware 和RAC中的脑裂，以及与脑裂有关的错误和结果。

细节：
在通用的术语中，脑裂表示数据不一致，这个数据库不一致起源于两个不同的数据集在范围上重叠。
要么由于是server间的网络设计，要么是有故障的环境，该环境基于servers间的互相通讯和统一数据。

有两个组件会经历脑裂：

1. Clusterware 层：
集群节点之间通过私有网络和voting disk维持他们的heartbeats。
当私有网络损坏时，经过misscount setting设定的时间期间之后，集群节点之间不能通过私有网络相互通信，脑裂就会发生。
在这个case中，voting disk 将会被用来决定哪个node 幸存下来，哪个node被evict出集群。通常的voting 结果如下：

 a.The group with more cluster nodes survive b.The group with lower node member in case of same number of node(s) available in each group c.Some improvement has been made to ensure node(s) with lower load survive in case the eviction is caused by high system load.

通常，当脑裂发生时，在ocssd.log中，会看到类似如下的信息：

[ CSSD]2011-01-12 23:23:08.090 [1262557536] >TRACE: clssnmCheckDskInfo: Checking disk info...[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssnmCheckDskInfo: Aborting local node to avoid splitbrain.[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: : my node(2), Leader(2), Size(1) VS Node(1), Leader(1), Size(2)[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: ###################################[ CSSD]2011-01-12 23:23:08.090 [1262557536] >ERROR: clssscExit: CSSD aborting###################################

以上信息显示出：从node 2到node 1的通信不工作，因此 node 2 只能看到一个 node（也就是node 2自己）,但是node 1 是工作正常的，并且node 1 能看到集群中的2个node，为了避免脑裂， node 2 aborted itself.

解决方案：请联系网络管理员以检查私有网络层以消除任何的网络问题。

2. RAC(database)layer
为了确保数据一致性，RAC Database中干的每个instance 需要与其他instance 保持heartbeat。 heartbeat 是由后台进程LMON，LMD，LMS和LCK来维持的。
这些进程中的任何一个进程若是经历IPC Send time out将会导致通信重配（communication reconfiguration）和实例驱逐以避免脑裂。
类似于clusterware层面的voting disk，控制文件被用来确定哪个instance 幸存下来，哪个instance 被evict。
The voting result is similar to clusterware voting result. As the result, 1 or more instance(s) will be evicted.

Common messages in instance alert log are similar to:

alert log of instance 1:---------Mon Dec 07 19:43:05 2011IPC Send timeout detected.Sender: ospid 26318Receiver: inst 2 binc 554466600 ospid 29940IPC Send timeout to 2.0 inc 8 for msg type 65521 from opid 20Mon Dec 07 19:43:07 2011Communications reconfiguration: instance_number 2Mon Dec 07 19:43:07 2011Trace dumping is performing id=[cdmp_20091207194307]Waiting for clusterware split-brain resolutionMon Dec 07 19:53:07 2011Evicting instance 2 from clusterWaiting for instances to leave: 2 ...alert log of instance 2:---------Mon Dec 07 19:42:18 2011IPC Send timeout detected. Receiver ospid 29940Mon Dec 07 19:42:18 2011Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/BD2_lmd0_29940.trc:Trace dumping is performing id=[cdmp_20091207194307]Mon Dec 07 19:42:20 2011Waiting for clusterware split-brain resolutionMon Dec 07 19:44:45 2011ERROR: LMS0 (ospid: 29942) detects an idle connection to instance 1Mon Dec 07 19:44:51 2011ERROR: LMD0 (ospid: 29940) detects an idle connection to instance 1Mon Dec 07 19:45:38 2011ERROR: LMS1 (ospid: 29954) detects an idle connection to instance 1Mon Dec 07 19:52:27 2011Errors in file /u01/app/oracle/diag/rdbms/bd/BD2/trace/PVBD2_lmon_29938.trc  (incident=90153):ORA-29740: evicted by member 0, group incarnation 10Incident details in: /u01/app/oracle/diag/rdbms/bd/BD2/incident/incdir_90153/BD2_lmon_29938_i90153.trc

在上面的例子中, instance 2 LMD0 (pid 29940) is the receiver in IPC Send timeout. There could be various reasons causing IPC Send timeout. For example:

a. Network problem
b. Process hang
c. Bug etc

Please see Top 5 issues for Instance Eviction Document 1374110.1 for more information.

在instance驱逐的案例中, alert log and all background traces需要被检查，以确定根本原因。

Known Issues1. Bug 7653579 - IPC send timeout in RAC after only short period Document 7653579.8    Refer: ORA-29740 Instance (ASM/DB) eviction on Solaris SPARC Document 761717.1    Fixed in: 11.2.0.1, 11.1.0.7.2 PSU and 11.1.0.7 Patch 22 on Windows2. Unpublished Bug 8267580: Wrong Instance Evicted Under High CPU Load    Refer: Wrong Instance Evicted Under High CPU Load in 11.1.0.7 Document 1373749.1    Fixed in: 11.2.0.13. Bug 8365141 - DRM quiesce step hang causes instance eviction Document 8365141.8    Fixed in: 10.2.0.5, 11.1.0.7.3, 11.1.0.7 patch 25 for Windows and 11.2.0.14. Bug 7587008 - Hung RAC instance not evicted from cluster Document  7587008.8    Fixed in: 10.2.0.4.4, 10.2.0.5 and 11.2.0.1, one-off patch available for various 11.1.0.7 release5. Bug 11890804 - LMHB crashes instance with ORA-29770 after long "control file sequential read" waits Document 11890804.8    Fixed in 11.2.0.2.5, 11.2.0.3 and 11.2.0.2 Patch 10 on Windows6. BUG:13732226 - NODE GETS EVICTED WITH REASON CODE 0X2    BUG:13399435 - KJFCDRMRCFG WAITED 249 SECS FOR LMD TO RECEIVE ALL FTDONES, REQUESTING KILL    BUG:13503204 - INSTANCE EVICTION DUE TO REASON 0X200000    Refer: 11gR2: LMON received an instance eviction notification from instance n Document 1440892.1    Fixed in: 11.2.0.4 and some merge patch available for 11.2.0.2 and 11.2.0.3

0 0