oracle wait event:reliable message

来源：互联网发布：房屋面积计算软件编辑：程序博客网时间：2024/05/18 00:35

今天客户的一套RAC环境出现问题
双节点RAC环境中，一个节点因为锁竞争而挂起，shutdown之后无法启动。

故障出现时我正在路上，匆匆回到家中，处理故障。
解决之后查找故障原因。

检查当时的AWR信息发现Top 5 Timed Events显示如下信息：

Top 5 Timed Events                                        Avg %Total
~~~~~~~~~~~~~~~~~~                                        wait Call
Event                                Waits    Time (s) (ms) Time Wait Class
------------------------------ ------------ ----------- ------ ------ ----------
reliable message                        354          89    251 219.4      Other
CPU time                                            32          78.3
db file sequential read              2,223          12      6 30.3 User I/O
control file sequential read        29,151          8      0 20.9 System I/O
db file scattered read                  36          2    62    5.5 User I/O
          -------------------------------------------------------------

这里最显著的事件是reliable message，这个事件Metalink的解释为:

    When you send a message using the 'KSR' intra-instance broadcast
    service, the message publisher waits on this wait-event until
    all subscribers have consumed the 'reliable message' just sent.
    The publisher waits on this wait-event for three seconds and
    then re-tests if all subscribers have consumed the message, or
    until posted.

也就是说当跨实例发送消息时，发送者期望收到订阅者的回复信息，如果得不到可信回复，就会一直处于等待。等待以3秒为周期进行反复尝试，知道收到所有订阅者的回复或者被唤醒。

那么在这个环境中，也就是说两个节点的通讯已经出现问题，一个节点得不到另外一个节点的回复。
这是一个可怕的故障，reliable message也是一个让人头疼的事件。

from:http://www.eygle.com/archives/2008/02/reliable_message.html

///////////////////////////////////////////////////////////////////////////////////////////////////////////////

关于reliable message 等待事件的问题~

没怎么接触过RAC的系统~ 最近看一个两节点RAC的AWR的信息发现Top 5 Timed Events
CPU time 413 96.2
reliable message 422 40 95 9.4 Other
log file sync 38,063 32 1 7.4 Commit
log file parallel write 36,133 20 1 4.7 System I/O
Streams AQ: qmn coordinator waiting for slave to start 2 11 5,369 2.5 Other
第二位是reliable message~ 不明白这个等待事件对性能有没有影响~

网上搜了一下只搜到eygle大师关于这个等待事件的一点解释 ~
http://www.eygle.com/archives/2008/02/reliable_message.html

这里最显著的事件是reliable message，这个事件Metalink的解释为:

        When you send a message using the 'KSR' intra-instance broadcast
        service, the message publisher waits on this wait-event until
        all subscribers have consumed the 'reliable message' just sent.
        The publisher waits on this wait-event for three seconds and
        then re-tests if all subscribers have consumed the message, or
        until posted.

from itpub.net