rac目录权限导致重启失败（转）

来源：互联网发布：网络教育和函授含金量编辑：程序博客网时间：2024/05/18 02:46

周六晚上突然接到电话，省卫生厅的客户需要紧急技术支持。联系客户询问了大概情况，客户说想在周日晚上7点对两节点RAC进行重装，并在第二天早上九点前上线。简单询问了一下数据库情况，HP-UX下一套Oracle 10g RAC，数据大概有2T左右。当时只是感觉时间稍微有些紧张，而且事前没有进行测试，也没有回退方案，有一定险。不过既然客户想重装，也只能先了解下情况再说。询问客户重装的原因，客户反映数据库平时运行问题倒不大，但是一旦遇到断电或其他故障导致小机重启后，数据库就起不来，往往需要人为干预，每次都的得找人救火。因为客户刚好在外面，简单了解情况后匆匆挂了电话。

挂完电话，和同事通了电话，才知道他去年也帮这个客户处理过一次断电导致重启后RAC起不来的故障，当时客户找了很多人都解决不了，因为元鼎在河南的技术团队帮很多客户处理过问题，有一定的知名度和影响力，最后通过厂家找到元鼎。我同事到现场通过排查，发现是存储划分的lun的权限问题导致ASM起不来，进行处理恢复了服务。因为当时情况紧急，也没有排查原因，判断可能是人为误操作导致磁盘权限修改导致的。和同事商量了一下，一起讨论了下可能导致故障的原因。他建议不重装，找出客户问题原因并帮他解决问题就行了，重装一个是时间有点紧，另外一个没有测试和回退方案，再加上时间有点紧，最好和客户再沟通下。于是又和客户通了电话，把我们的建议和他进行沟通，最后说周日晚上到现场再讨论方案。

周日晚上7点到达客户现场，已经到了很多人，有存储厂家的，小机厂家的十来号人。原来客户准备升级存储和操作系统，害怕停库后出现故障，所以想直接重装RAC。大概讨论了一下，甲方的领导也到了，同意了我的建议。方案定了，就开始工作了。先让甲方工程师把数据库执行全备，配好网络，连上系统，先查看了服务状态：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
wst1$ crs_stat -t -v
Name           Type           R/RA   F/FT   Target    State     Host       
----------------------------------------------------------------------
ora.wst.db     application    0/0    0/1    ONLINE    OFFLINE              
ora....t1.inst application    0/5    0/0    ONLINE    OFFLINE              
ora....t2.inst application    0/5    0/0    ONLINE    OFFLINE              
ora....SM1.asm application    0/5    0/0    ONLINE    UNKNOWN   wst1       
ora....T1.lsnr application    0/5    0/0    ONLINE    ONLINE    wst1       
ora.wst1.gsd   application    0/5    0/0    ONLINE    ONLINE    wst1       
ora.wst1.ons   application    0/3    0/0    ONLINE    ONLINE    wst1       
ora.wst1.vip   application    0/0    0/0    ONLINE    ONLINE    wst1       
ora....SM2.asm application    0/5    0/0    ONLINE    UNKNOWN   wst2       
ora....T2.lsnr application    0/5    0/0    ONLINE    ONLINE    wst2       
ora.wst2.gsd   application    0/5    0/0    ONLINE    ONLINE    wst2       
ora.wst2.ons   application    0/3    0/0    ONLINE    ONLINE    wst2       
ora.wst2.vip   application    0/0    0/0    ONLINE    ONLINE    wst2  
             
  然后查看了一下数据库日志和crs日志、css日志，发现数据库确实建库以来出现过几次重启，每次重启都伴随着CRS-1205错误：
           
2011-01-29 01:49:33.579
[crsd(21048)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
2011-01-29 01:49:40.952
[crsd(21048)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
[cssd(21108)]CRS-1603:CSSD on node wst1 shutdown by user.
2011-01-29 02:16:02.222
[cssd(2073)]CRS-1605:CSSD voting file is online: /dev/rdisk/disk33. Details in /opt/oracle/crs/log/wst1/cssd/ocssd.log.
[cssd(2073)]CRS-1601:CSSD Reconfiguration complete. Active nodes are wst1 wst2 .
2011-01-29 02:16:03.166
[crsd(1757)]CRS-1012:The OCR service started on node wst1.
2011-01-29 02:16:03.169
[evmd(1728)]CRS-1401:EVMD started on node wst1.
2011-01-29 02:16:04.095
[crsd(1757)]CRS-1201:CRSD started on node wst1.
2011-01-29 02:16:05.770
[crsd(1757)]CRS-1205:Auto-start failed for the CRS resource . Details in wst1.
2011-01-29 02:16:05.783
[crsd(1757)]CRS-1205:Auto-start failed for the CRS resource . Details in wst1.
2011-01-29 02:16:12.740
[crsd(1757)]CRS-1205:Auto-start failed for the CRS resource . Details in wst1.
2011-01-29 02:16:13.498
[crsd(1757)]CRS-1205:Auto-start failed for the CRS resource . Details in wst1.
           
2011-01-31 11:27:19.829
[evmd(16007)]CRS-1401:EVMD started on node wst1.
2011-01-31 11:27:20.756
[crsd(16035)]CRS-1201:CRSD started on node wst1.
2011-01-31 11:27:27.719
[crsd(16035)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
2011-01-31 11:27:27.729
[crsd(16035)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
[cssd(16272)]CRS-1601:CSSD Reconfiguration complete. Active nodes are wst1 wst2 .
           
[crsd(2606)]CRS-1012:The OCR service started on node wst1.
2012-05-12 11:15:40.140
[evmd(2579)]CRS-1401:EVMD started on node wst1.
2012-05-12 11:15:41.223
[crsd(2606)]CRS-1201:CRSD started on node wst1.
2012-05-12 11:15:55.616
[crsd(2606)]CRS-1205:Auto-start failed for the CRS resource . Details in wst1.
2012-05-12 11:20:08.620
[crsd(2606)]CRS-1205:Auto-start failed for the CRS resource . Details in wst1.
           
2012-06-20 21:04:01.723
[crsd(14516)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
2012-06-20 21:04:01.785
[crsd(14516)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
2012-06-20 21:04:03.334
[crsd(14516)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
2012-06-20 21:04:03.344
[crsd(14516)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.
2012-06-20 21:04:20.033
[crsd(14516)]CRS-1205:Auto-start failed for the CRS resource . Detailsin wst1.

检查了一下用户设置、环境变量设置、目录权限等等，没有发现有太多问题。继续查看日志，这时甲方工程师说数据备好了。因为小机工程师要对系统升级，商量了一下，先关闭数据库服务并重启了服务器，在用srvctl关闭instance和asm时发现不能正常关闭，需要用sqlplus登陆关闭，当时也没太多考虑，按顺序关闭了asm和instance以及crs。

服务器重启后，发现asm和实例都没有起来，但是crs服务都起来了：

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
wst2$ crs_stat -t -v
Name           Type           R/RA   F/FT   Target    State     Host       
----------------------------------------------------------------------
ora.wst.db     application    0/0    0/1    ONLINE    OFFLINE              
ora....t1.inst application    0/5    0/0    ONLINE    OFFLINE              
ora....t2.inst application    0/5    0/0    ONLINE    OFFLINE              
ora....SM1.asm application    0/5    0/0    ONLINE    UNKNOWN   wst1       
ora....T1.lsnr application    0/5    0/0    ONLINE    ONLINE    wst1       
ora.wst1.gsd   application    0/5    0/0    ONLINE    ONLINE    wst1       
ora.wst1.ons   application    0/3    0/0    ONLINE    ONLINE    wst1       
ora.wst1.vip   application    0/0    0/0    ONLINE    ONLINE    wst1       
ora....SM2.asm application    0/5    0/0    ONLINE    UNKNOWN   wst2       
ora....T2.lsnr application    0/5    0/0    ONLINE    ONLINE    wst2       
ora.wst2.gsd   application    0/5    0/0    ONLINE    ONLINE    wst2       
ora.wst2.ons   application    0/3    0/0    ONLINE    ONLINE    wst2       
ora.wst2.vip   application    0/0    0/0    ONLINE    ONLINE    wst2   
               
wst2$ crsctl check crs
CSS appears healthy
CRS appears healthy
EVM appears healthy

用srvctl启动报错：

1
2
3
4
5
6
wst1$ srvctl start asm -n wst1
PRKS-1009 : Failed to start ASM instance "+ASM1" on node "wst1", [PRKS-1009 : Failed to start ASM instance "+ASM1" on node "wst1", [CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.wst1.ASM1.asm' has placement error.]]
[PRKS-1009 : Failed to start ASM instance "+ASM1" on node "wst1", [CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.wst1.ASM1.asm' has pl：
acement error.]]

尝试在sqlplus里启动asm，可以起来，同样instance也可以起来。看来问题基本可以定位为“发现srvctl和crs_start命令无法启动ASM和实例，但用sqlplus却可以正常启动。”google了一下，有很多人遇到过类似问题，但有和我的情况不一样，metalink也一样，按照其中一个troubleshooting的文档：Troubleshooting when srvctl can't start RAC instance, but sqlplus can start it [ID 844272.1]仔细做了排除，但是也没有发现问题。期间尝试过export SRVM_TRACE=TRUE，跟踪srvctl start asm -n wst1，但是也是报CRS-1028: Dependency analysis failed because of:

1
CRS-0223: Resource 'ora.wst1.ASM1.asm' has placement error.

折腾了1个多小时，眼看就要没希望了，但是又不能放弃，肯定要给客户解决这个问题。重新按照844272.1进行排查,注意到最后有一个排查:

1
2
e. Stop the instance in case the instance startup failed somewhere
In case of instance startup failure, a trial is performed to stop the instance and racgimon to clear leftover resources. The main logging to check the instance failure reason stands in the $ORACLE_HOME/log/<hostname>/racg directory (the $ORACLE_HOME of the rdbms or asm instance) under the name 'imon_<dbname>.log'. When the rdbms is in version 10.1, the $ORACLE_HOME/racg/dump directory was used.

查看了下$ORACLE_HOME/log/wst1/racg目录的权限，竟然owner是root，同时没有imon_wst.log,难道是因为这个日志的问题导致的，看来问题有转机了：

1
2
3
4
5
6
7
8
9
10
wst1:/opt/oracle/102/db_1/log/wst1/racg#ls -lrt
total 256
drwxr-xr-x   2 root       sys             96 Jan 25  2011 racgmain
drwxr-xr-x   2 root       sys             96 Jan 25  2011 racgeut
-rw-r--r--   1 root       sys         124905 Jan 31  2011 ora.wst1.LISTENER_WST1.lsnr.log
wst1:/opt/oracle/102/db_1/log/wst1/racg#cd ..
wst1:/opt/oracle/102/db_1/log/wst1#chown -R oracle:dba racg
wst1:/opt/oracle/102/db_1/log/wst1#ls -lrt
total 0
drwxr-xr-x   4 oracle     dba             96 Jan 25  2011 racg

同样在节点2做了同样操作，再次srvctl start asm -n wst1，果然asm起来了，接着启动实例也成功了。看来问题解决了，但是为什么racg的属主会是root呢？先不管了，通知小机工程师重启小机测试下看重启后服务能不能起来。接近十分钟的等待，crs_stat -t -v ，看来问题解决了。

以上是CRS-0223问题的解决，处理过程中有些日志不是很全了。整个处理过程中其实走了一些弯路，包括当时想重新注册asm服务等等，另外最后也没弄明白为什么racg这个目录的属主会改动，只能猜测是人为的吧。客户看见问题解决了也表示很满意，至于目录权限的问题只能让他们内部做一些规范了。

附上metalink上Troubleshooting when srvctl can't start RAC instance, but sqlplus can start it [ID 844272.1]：

Troubleshooting Details

I. Introduction

srvctl is the client utility to start/stop RAC instances/database. The srvctl command forward the start request to the crsd scheduler daemon to perform the start/stop of the instances/database on the different nodes. The result of the start/stop request is responded back to the client srvctl utility. The instance start job can be decomposed in different parts:

a. Initiate the instance startup

Start dependent resources like ons/listeners/asm in case they are not started yet. ASM is e.g. a required resource when used by the rdbms instance, so ASM need to be known as started by the clusterware before an instance can be started via srvctl command. It is not necessary to prestart the dependent resources manually. CRS knows those dependencies and will autostart first those resources. In case dependent resources like the ASM instance can't be started, then no further trial will be done to start the rdbms instance.

Prepare the init.ora to use under $CRS_HOME/racg/tmp/ora.<dbname>.<instancename>.inst.ora

(retrieved from the srvctl config database -d <dbname> -a command).

b. Start the racgimon to monitor the instance

Start the racgimon process that will monitor the instance and collect statistics like service metrics.

c. Start the instance

It is done via a spawned sqlplus session. Hence, when a instance can't be started via sqlplus, it will not

be startable via srvctl.

d. racgimon connect towards the newly started instance

The racgimon instance monitor connect to the new instance. On success, racgimon will send this info toward evmd and ons to tell the whole cluster+users that the instance is started. The racgimon will further start to collect service metrics and monitor the instance.

e. Stop the instance in case the instance startup failed somewhere

In case of instance startup failure, a trial is performed to stop the instance and racgimon to clear leftover resources. The main logging to check the instance failure reason stands in the $ORACLE_HOME/log/<hostname>/racg directory (the $ORACLE_HOME of the rdbms or asm instance) under the name 'imon_<dbname>.log'. When the rdbms is in version 10.1, the $ORACLE_HOME/racg/dump directory was used.

II. Troubleshooting

Check 1. Check the srvctl start command is received by the crsd daemon

When the 'srvctl start' command is executed, the crsd logging should show a line that tells it attempts to start the instance, e.g.

2009-03-04 11:42:14.889: [ CRSRES][35737] Attempting to start `ora.ORCL.ORCL1.inst` on member `machine1`

The crsd logging stands in $CRS_HOME/log/<hostname>/crsd/crsd.log in 10gR2 and further and in $CRS_HOME/crs/log in 10gR1. In case such logging appear, skip this section and go to Check2 further. In case such logging don't appear, the following troubleshooting checks should be followed.

a. The srvctl may be malfunctioning

In case the srvctl don't report CRS-* errors like e.g

srvctl start database -d V120

PRKP-1001 : Error starting instance V1201 on node lnx10gr2n1

CRS-0233: Resource or relatives are currently involved with another operation.

PRKP-1001 : Error starting instance V1202 on node lnx10gr2n2

CRS-0233: Resource or relatives are currently involved with another operation

but only report PRKP-* errors, then most likely, the srvctl java code is malfunctioning.

There are two srvctl command (in the ORACLE_HOME and ORA_CRS_HOME). The best is to check whether any of them give the same error and trace the srvctl by setting environment variable SRVM_TRACE to trace the srvctl command and check for errors.

The srvctl version need to be in the same version as the database to be started

(e.g. see note.279429.1)

b. The srvctl look to function correctly but crsd don't show startup trials

The srvctl show CRS-* errors or hang because crsd don't give him the response to his command requests.

=> Check other resources that are using different ORACLE_HOME,e.g. the CRS_HOME can be started/stopped (e.g. the nodeapps).

In case nodeapps can't be stopped/started, then the rdbms startup problem is more a general resource startup problem. Problem can be that the crsd.bin daemon hang or is malfunctioning. The crsd.log in

$CRS_HOME/log/<hostname>/crsd/crsd.log need to be reviewed.

Also check whether the clusterware is responding, e.g. crsctl check crs should report

crsctl check crs

CSS appears healthy

CRS appears healthy

EVM appears healthy

on all nodes. Restarting the crsd.bin (as mentioned in note:726925.1) can be a troubleshooting step.

=> Check other resources that using the same ORACLE_HOME can be started/stopped (e.g. listeners)

In case nodeapps can be started/stopped, then the crsd daemon looks to work and accept start/stop commands. However, it is possible all resource from one ORACLE_HOME can't be started. Most likely, there is a

1. a racgwrap problem (see note:740319.1)

2. a permission problem in that ORACLE_HOME

The startup of the instance is done as the user mentioned via the command:

crs_getperm ora.<dbname>.<instance_name>.inst

In case the scripts used to start the instance can't write in the $ORACLE_HOME/log/<hostname>/racg directory, then the clusterware will not be able to start the instance and it will reach the status 'UNKNOWN' (i.e. it can't be stopped either) (see note:741212.1)

Check all files with the database references are owned by the oracle user, e.g.

cd $ORACLE_HOME

find . -name '*ORCL*'

Check 2. Check the status of the instance in the cluster is OFFLINE before starting it

In case the instance resource reach the state 'UNKNOWN' (viewable via crs_stat -t)

then the instance can't be started either without a manual intervention:

1
2
3
4
5
6
crs_stat -t
Name           Type           Target    State     Host
------------------------------------------------------------
ora....01.inst application    ONLINE    UNKNOWN   lnx10gr2n1
ora....02.inst application    OFFLINE   OFFLINE
ora.V120.db    application    OFFLINE   OFFLINE

1
2
3
4
$ srvctl start database -d fsqatr
PRKP-1001 : Error starting instance fsqatr1 on node adc17
CRS-1028: Dependency analysis failed because of:
CRS-0223: Resource 'ora.fsqatr.fsqatr1.inst' has placement error.

srvctl status database -d V120

PRKO-2015 : Error in checking condition of instance on node: lnx10gr2n1

Instance V1202 is not running on node lnx10gr2n2

So, the first action is to 'stop' the instance for the CRS, i.e. make that the CRS don't think the instance is still running, via

srvctl stop instance -d V120 -i V1201

or, in case srvctl stop don't permit to do it,

crs_stop -f ora.fsqatr.fsqatr1.inst

Once the instance is in state OFFLINE, a restart trial via srvctl can be done.

Always check no leftover instance processes are running (via 'ps -ef | grep <SID>'). Leftover processes can block the semaphore and shared memory segment used by the failed instance startup and inhibit further instance startup.

Check 3. Check the instance can be started via srvctl when the instance was prestarted via sqlplus

When the instance is started via sqlplus, then the instance should be detected as started by the clusterware.

In case the instance can't be detected as started via the clusterware, i.e. the instance is started via

sqlplus and the 'srvctl start instance -d <dbname> -i <instance_name>' don't permit the clusterware to

detect the instance as started, then the problem is most likely that the racgimon process can't connect to

the instance. This problem can inhibit the instance to be started via srvctl when the instance is down.

racgimon is using net8 to connect to the instance, so net8 configuration problems can inhibit it to occur, e.g.

=> sqlnet.inbound_connect_timeout and note.402437.1

=> permission problems and note.391453.1

=> sqlnet.ora misconfiguration and note:387526.1

=> left over debugging info and note.605540.1

The racgimon logging in the $ORACLE_HOME/log/<hostname/racg/imon* should be checked.

Check 4. check the database is correctly configured in the clusterware

a. The database configuration can be seen via 'srvctl config database -d <DBNAME> -a ', e.g.

srvctl config database -d V120 -a

lnx10gr2n1 V1201 /home/oracle/oracle/product/10.2.0/db_1

lnx10gr2n2 V1202 /home/oracle/oracle/product/10.2.0/db_1

DB_NAME: null

ORACLE_HOME: /home/oracle/oracle/product/10.2.0/db_1

SPFILE: /ocfs/admin/V120/pfile/spfileV120.ora

DOMAIN: null

DB_ROLE: null

START_OPTIONS: null

POLICY: AUTOMATIC

ENABLE FLAG: DB ENABLED

The START_OPTIONS can be incorrectly set (see note:311321.1) as well as the

ORACLE_HOME and SPFILE.

The spfile/pfile used by the sqlplus when starting the instance manually stands in $ORACLE_HOME/dbs (see note:162491.1) and is thus potentially not the same as the spfile defined in the CRS.

The clusterware use the spfile that is copied to the $CRS_HOME/racg/tmp/<instance_resource_name>.ora file

before starting the instance with it. Check whether the instance can be started with that spfile, e.g.

sqlplus /nolog

connect / as sysdba

startup pfile=/oracle/crs/racg/tmp/ora.orcl.orcl1.inst.ora

In case the sqlplus can't start the instance with that clusterware spfile, check for differences

between it and the spfile/pfile used by sqlplus and correct the differences in the file

mentioned in the 'srvctl config database -d <DBNAME> -a' command. Use note:137483.1 to correct it.

The logging $ORACLE_HOME/log/<hostname>/racg/imon_<dbname>.log contains the error reported by

the sqlplus startup trial (see note.732683.1 and note.360575.1)

b. Database resources should be viewable

The 'crs_stat' command should show one *.db resource + multiple *.inst resources

for each defined instances. In case they are not viewable, then the best is to

recreate the database configuration in the clusterware (e.g. see note:455226.1 )

Check 5. check the system settings of the root user compared to the oracle user

Since the sqlplus used to start the instance is launched by the crsd.bin daemon process,the OS limitation of the crsd daemon are inherited by that sqlplus session. Since crsd.bin is a process started as root, the OS limitations applicable for the root user are used instead of the ones set for the oracle user.

In case of OS limitations set on root, but not on the oracle user, it is possible the instance

can't be started via srvctl ( see note.367442.1 and note.753516.1) but is started with sqlplus and the oracle user. Other linked issues are note.369424.1 and note.758131.1)

References

NOTE:162491.1 - Startup of an Oracle 9i, 10g, 11g Instance using SPFile or Init.ora Parameter File

NOTE:279429.1 - PRKR-1007, PRKO-2005 using the 10g srvctl on a 9i database

NOTE:311321.1 - Srvctl Cannot Start database

NOTE:360575.1 - CRS-0215: Srvctl Cant Start Instance But Sqlplus Can

NOTE:367442.1 - 'srvctl' Unable to Start Large SGA Instance ORA-27102

NOTE:387526.1 - Getting PRKP-1001, CRS-1005, CRS-0223 When Trying to Startup Instance Using Srvctl

NOTE:391453.1 - SRVCTL Does Not Work When RACGIMON Process Cannot Connect to the Database

NOTE:402437.1 - SRVCTL: PRKP-1001 : CRS-0215: Could not start resource, racgimon killed by NS

NOTE:455226.1 - Database / Instances Not Starting With the srvctl Command, Errors PRKP-1001 & CRS-0212

NOTE:605540.1 - Can not Start Instance, Get ORA-3113 via sqlplus and PRKP-1001 CRS-215 via srvctl

NOTE:726925.1 - srvctl start instance fails with PRKP-1001; srvctl trace shows error connecting to CRSD

NOTE:732683.1 - Cannot Start Instance Using SRVCTL but SQLPLUS Can

NOTE:740319.1 - CRS-215 srvctl unable to start ASM, Listener, RDBMS Resources With Error %ORACLE_HOME%/bin/

NOTE:741212.1 - Cannot Start Instance Using Srvctl, No Info in Imon Logs

NOTE:753516.1 - The difference between using srvctl vs using sqlplus for start/stop one or more database nodes

NOTE:844272.1 - Troubleshooting when srvctl can't start RAC instance, but sqlplus can start it

来自：http://www.yuandingit.com/expert/8.html