hacmp中的deadman switch和split brain

来源：互联网发布：淘宝网有多少买家编辑：程序博客网时间：2024/06/05 01:12

虽然自己考过237了,可是这两个东西的记忆太模糊了,最近看到别人有这方面的疑问,我也就重新学习一下

hacmp planing guide 中的原文有这两段

Toensure a clean takeover, HACMP provides a Deadman Switch, which isconfigured to halt the unresponsive node one second before the othernodes begin processing a node failure event. The Deadman Switch usesthe Failure Detection Parameters of the slowest network to determine atwhat point to halt the node. Thus, by increasing the amount of timebefore a failure is detected, you give a node more time in which togive HACMP CPU cycles. This can be critical if the node experiencessaturation at times.
To help eliminate node saturation, modify AIX5L tuning parameters. For information about these tuning parameters,see the following sections in the Administration Guide:
•Configuring Cluster Performance Tuning in Chapter 18: Troubleshooting HACMP Clusters
•Changing the Failure Detection Rate of a Network Module in Chapter 12: Managing the Cluster Topology.
Change Failure Detection Parameters only after these other measures have been implemented.

Syncd Frequency
Thesyncd setting determines the frequency with which the I/O disk-writebuffers are flushed. Frequent flushing of these buffers reduces thechance of deadman switch time-outs.
The AIX 5L default value forsyncd as set in /sbin/rc.boot is 60. Change this value to 10. Note thatthe I/O pacing parameter setting should be changed first. You do notneed to adjust this parameter again unless time-outs frequently occur.

简单解释如下：

集群中为了正确处理节点失败，需要判断节点是否死掉。这期间deadman switch使用失败探测参数设置的相关参数进行判断

如果i/o memory等有问题都可能使集群管理器不能正常处理节点通讯，而错误地使集群节点死掉

所以要调整些参数:

1.i/o pacing

2.syncd

3.增加通信子系统使用内存量

4更改错误探测速率

split brain这个没太完全清楚，大概就是为了让hacmp知道系统故障时资源不能让多个节点同时

访问数据造成数据的破坏。这一点容易在tcpip网络发生故障时，而非tcpip网络不存在或者故障

2个节点都认为对vg等可以合法访问。于是如果出现这种情况（tcp损坏，非tcp不通），系统就让

后来想加入集群的节点down

应该没有什么大问题吧