ORACLE RAC技术笔记3

来源:互联网 发布:手机送货单软件 编辑:程序博客网 时间:2024/06/10 04:53

案例分析

1 丢失OLR导致节点无法启动

环境:rhel5.5+11.2.0.4 GI,双节点
问题描述:节点2无法启动GI
诊断过程:首先确认GI在哪个阶段出错
crsctl start res -t -init

[grid@testl ohasd)S crsctl stat res -t -init CRS-4639: Could not contact Oracle High Availability Services CRS-4000: Command Status failed ,。r completed with errors. 

从以上程序可以看出 ohasd 层面都没有启动, 有可能是/et c/inittab 中启动集群的 init.ohasd 脚 本没有被调用, 或者是 ohasd.bin 守护进程没有启动成功。 因此需要进一步验证:
ps -ef|grep has
这里写图片描述

根据上面的输出可推出init.ohasd 脚本的确被调用了, 而且ohasd.bin 守护进程也已经被启动,那么问题在于ohasd 没有被成功启动。因此, 需要看一下ohasd 的日志文件以进行分析

ohasd.log2014-11-15 12:29:03.167: [default) [3037648592) OHASD Daemon Starting. Commandstring :reboot2014-11-15 12:29:03.169: [default) [3037648592) Initializing OLR2014-11-15 12:29:03.172: [OCROSD) [3037648592Jutopen:6m ’ : failed in stat OCR file/disk**/**/ **/** /** / cdata/***.olr, errno=2, os err string=No such file or directory**2014 11-15 12:29:03.172: [OCROSD) [3037648592)utopen:7: failed to open any OCRfile/disk, errno=2, os err string=No such file or directory2014-11-15 12:29:03.172: [OCRRAW) [3037648592)proprinit: Could not open raw device2014-11-15 12:29:03.173: [ α:RAPI) [3037648592)a init:16!: Backend init unsuccessful : [26)2014-11-15 12:29:03.173: [CRSOCR) [3037648592) OCR context init failure. Error:PROCL-26: Error while accessing the phys工cal storage Operating System error [Nosuch file or directory) [2)2014-11-15 12:29:03.173: [ default) [3037648592) Created alert : (:OHAS00106:)OLR initialization failed, error: PROCL-2 6: Error while accessing thephysical storage Operating System error [No such file or directory] [2]2014-11-15 12:29:03.173: [ default] (3037648592] [PANIC] OHASD exiting; Could notinit OLR

根据上面的日志信息, 看起来问题是由于无法访问OLR导致的。
结果:由于OLR默认会在GI安装时产生备份, 可以从默认的备份位置进行恢复。首先检查备份OLR是否存在,然后./ocrconfig -local -restore 。
最后重新启动GI, 问题解决:
[root@testl ~]# ./crsctl start crs

2 由于HAIP 导致的数据库无法启动

环境:AIX 6.1 + 11.2.0.2 GI, 双节点。
问题描述:这是一套新安装的集群, 节点l在运行root.sh时报错, 并且数据库无法启动。
分析过程:由于是root.sh脚本报错, 所以需要看一下root.sh脚本的日志:

rootcrs_***.log2013-04-25 17:36:54: Executing cmd: /bin/crsctl start resource ora.cluster 1.nterconnect.ha1.p -1.n1.t2013-04-25 17:37:58: Command output: CRS-2672: Attempting to start ’ora.cluster interconnect.haip’ on ’***’ CR S-5017: The resource action "ora.cluster interconnect.haip start" encountered the following error:> Start action for HAIP aborted. For details refer to” (:CLSN00107:) ” ln”/***/**/log/testl/agent/ohasd/orarootagent_root/orarootagent_root.log ”.> CRS-2674: Start of ’ ora.cluster interconnect.haip’ on ’***’ failed> CRS-2679: Attempting to clean ’ora.cluster 1.nterconnect.haip ’ on ’***’> CRS-2681: Clean of ’ora.cluster interconnect.haip’ on ’***’ succeeded> CRS-4000: Command Start failed, or completed with errors.

根据上面的输出, 看来在启动HAIP时出现了问题, 所以需要看一下具体的agent日志来
进一步分析:

[CLSFRAME] (2314] {0:0:79) [TIMER] New wait delay: 16502013-04-25 17:33:36.652: [ora.cluster_interconnect.haip] (1800] {0:0:79') [start](:CLSN00107:) clsn agent::start {2013-04-25 17:33:36.653: [ora.cluster 工nterconnect.haip] (1800] (0:0:79) [start]Network.Agent::init enter {2013-04-25 17:33:36.653: [ora cluster interconnect.haip] (1800] (0:0:79) [start]NetworkAgent::init exit )2013-04-25 17:33:36.653: [ USRTHRD] (1800] {0:0:79) Thread: [NetHAMain]start {2013-04-25 17:33:36.653: [ USRTHRD] (1800] {0:0:79) Thread: [NetHAMain]start )2013-04-25 17:33:36.653: [ USRTHRD] (3343] {0:0:79) [NetHAMain] thread started2013-04-25 17:33:36.694: [ USRTHRD] (3343) {0:0:79) Ocr Context init default level3197285282013-04-25 17:33:36.694: [ default) [3343)clsvactversion:4: Retrieving ActiveVersion from local storage.2013-04-25 17:33:36.825: [ USRTHRD) (3343] (0:0:79) HAIP: mbr num is 0.[ CLWAL] [3343]clsw Initialize: OLR initlevel (70000)2013-04-25 17:33:36.952: [ USRTH RD) (3343) (0:0:79) HAIP: initializing to 1interfaces2013-04-25 17:33:36.954: [ USRTHRD) (3343) (0:0:79) HAIP: configured to use 1interfaces2013-04-25 17:33:36.959: [ USRTHRD] (3343] (0:0:79) HAIP: Updating member inf。HAIPl;*.*.*.*#02013-04-25 17:33:36.960: [ USRTHRD] (3343) (0:0:79) InitializeHaips[ OJ infList’ inf ibO, ip *.*.*.1, sub *.*.*.*’2013-04-25 17:33:36.961: [ USRTHRD] (3343) (0:0:79) Error in getting Key SYSTEM.network.haip.group.cluster_interconnect.interface.valid in OCR2013-04-25 17:33:36.997: [ CLSINET) (3343] failed to open OLR HAIP subtype SYSTEM.network.haip.group.cluster_interconnect.interface.valid key, rc=42013-04-25 17:33:36.998: [ USRTHRD] (3343] (0:0:79) ipMapsz 0, idxMap sz 0,restart 0, numHaip 1, infListSz 12013-04-25 17:33:36.998: [ USRTHRD) (3343] (0:0:79) HAIP reset on new modifiedstartup, ipSize O ! = numinf 12013-04-25 17:33:36.998: [ USRTHRD] (3343) {0:0:79) restart 0, haipSize 0, numHaip1, numSub 0, ipMsz 02013-04-25 17:33:36.998: [ USRTHRD) (3343) (0:0:79) HAIP: starting inf ’ ibO ’ ,suggestedip ” , assignedlp ”2013-04-25 17:33:36.998: [ USRTHRD) (3343] (0:0:79) Thread: [NetHAWork]start {2013-04-25 17:33:36.998: [ USRTHRD] (3343) {0:0:79) Thread: [NetHAWork]start )2013-04-25 17:33:36.998: [ USRTHRD] (3600) {0:0:79) [NetHAWork] thread started2013-04-25 17:33:36.999: [ USRTHRD][3600) {0:0:79) Arp::sCreateSocket { 2013-04-25 17:33:36.999: [ USRTHRD) [3600) {0:0:79) failed to create arp 2013-04-25 17:33:36.999 [ USRTHRD) [3600) {0:0:79) (nul l) category: -2, operation: ssclsi_aix _get_phys_addr, loc: aixgetpa:4,n, OS error: 2, oth er: 2013-04-25 17:33:36.999: [ USRTHRD] [3600) {0:0:79) Arp: :sCreateSocket { 2013-04-25 17:33:36.999: [ USRTHRD] [3600) {0:0:79) fa工led to create arp 2013-04-25 17:33:36.999: [ USRTHRD) [3600) {0:0:79) (nul l) ca tegory: -2operation: ssclsi aix get phys addr, loc: aixgetpa:4,n, OS error: 2, other:

看起来问题是在启动 HAIP 时出现了一些和操作系统相关的错误。 因此, 需要再看一下操作系统层面私网的状态。 不过根据网卡的名称, 看起来 lnfiniband 似乎被使用了, 在和 OBA 确认
之后得到了肯定的答案。
ifcofnig -a

查看网卡状态一切正常,问题还是出现在GI层面上, 经过确认后发现, GI在门.2.0.2版本中, 针对 AIX 平台还不支持 Infiniband。因此, 暂时只能不使用 HAIP ,而需要使用初始化参数 cluster interconnects 来指定 ASM 和数据库实例的私网通信 IP 地址。
结果:使用以下命令来修改参数 cluster_interconnects 之后,数据库可以正常启动。
alter system set cluster_interconnects=’XX.XX.XX.XX’ scope=spflle sid=‘’ ;

原创粉丝点击