2017-07-17 DBA日记，凭直觉发现CRSDG的磁盘问题及处理

来源：互联网发布：专业翻译软件编辑：程序博客网时间：2024/04/28 02:02
1 问题:
2017-07-14 8：00左右，收到短信报警“数据库db1的安装目录/u01可用空间低于5%”，是什么原因导致的呢？
2 分析：
2.1 具体描述:2017-07-14 8：00左右, 数据库db1的安装目录/u01可用空间低于5%
2.2 收集数据：
ssh登录到db1所在服务器使用命令,显示那个日志目录占用的空间最多，并逐层目录查找:
  cd /u01
 du -sk * 
查找结果，发现在asm日志目录.\trace空间占用最大
进入到trace目录，发现有一个trace文件大小为10G，文件改变时间为当前时间。
除了这个trace文件外，有一个alert_+ASM1.log的文件更新时间也是当前时间。
2.3 分析
大小为10g的trace文件《《〈〈〈〈〈〈应该是这次报警的元凶。
这个大小为10G的文件是慢慢累积而成，还是由于慢个事件迅速增长到10G。
还有一个值得在意的地方就是trace文件与alert文件的更新时间不断变化。
2.4 结论
基本可以断定清除10g的trace文件能解除报警，但未必能发现报警代表的更深层次原因。
采用后续行动，进一步分析trace文件的容量是迅速生成还是慢慢累积的？
3. 第二次分析：
3.1 具体描述：大小为10G的trace文件，是什么原因生成的？
3.2 收集数据：
打开大小为10g的trace文件，查看内容，由于文件太大，打开失败。
打开alert文件，查看内容，发现如下错误信息：
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
NOTE: Attempting voting file relocation on diskgroup CRSDG
NOTE: Failed voting file relocation on diskgroup CRSDG
检查这些日志的最早生成时间及持续多长，频率有多大，信息如下：
发现自2017-05-16 HP存储出现故障后，该报警信息不断出现。
持续至：2017-07-14
频率：每20秒报一次警。
查看crsdg，asm disk状态：select * from v$asm_disk; 发现有两个磁盘的mount_status:IGNORED
3.3 分析
在2017-05-16， HP存储出现故障时，CRSDG的磁盘由3个可用，变成只得一个可用。即使用存储故障修复后，oracle asm也不能自动重新识别，导致集群的voting file重新放置失败。
IGNORED - Disk is present in the system but is ignored by Oracle ASM because 
      of one of the following:
                   The disk is detected by the system library but is ignored because an  
                Oracle ASM library discovered the same disk
                    Oracle ASM has determined that the membership claimed by the disk  
                 header is no longer valid
进一步确认，查看support.oracle.com，其中文章“V$ASM_DISK View Shows Some Disk Header Status as IGNORED and Group Number as "0". (文档 ID 1299866.1)”给出很详细的说明，摘录如下：
This is not a normal situation but can happen in the following scenario.
1] If there are more than 2 disks dropped forcefully from the same diskgroup created with normal redundancy while the diskgroup is still mounted, these disks still show "MEMBER" and group number as "0". If a new disk is added to the diskgroup or existing disks are added in different order, there is a possibility that the same disk number and disk name can be assigned to the ones that was assigned to one of dropped disks which has not been added yet. 
This situation can happen either when one of SAN failgroup crash or when one of cell node from Exadata crash.
2] In RAC environment, ASM disk added to a existing diskgroup should be seen from all nodes. If one of nodes doesn't see the disk being added, the operation will fail with "ORA-15075 disk(s) are not visible cluster-wide" but the disk being added keeps the new disk number and disk name showing "MEMBER" status. After this, a new disk is added to the RAC node successfully, chances are the new disk gets the same disk number and diskname with the one that was failed in the first place.
The following similar error occurs when these disks showing IGNORED status are attempted to be added an existing diskgroup.
SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' ;
alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/oracleasm/disks/ASM1' belongs to diskgroup "DATA"
3] Wrong multipath configuration where multiple single device path are visbile in asm_diskstring.
那么，到底crsdg那些offline了磁盘现在发生了什么事呢？
4. 第三次分析
4.1 具体描述：crsdg那些offline的磁盘是否出现硬件故障。
4.2 收集数据：使用root用户，执行命令ioscan -fnkCdisk 检查磁盘的路径状态，发现有的磁盘路径状态是NO_HW
4.3 分析：
NO_HW 此地址的硬件不再响应
ioscan 中的 NO_HW 表示当计算机引导时，该设备正在响应，但是现在没有响应。
NO_HW 可能来自一个有问题的设备，或者一个已经移动删除的设备。
4.4 结论：
CRSDG中ignore的磁盘，在os层、存储层上已经被识别为无效设备。
5. 解决方案：   
5.1 根据oracle的官方文档，制定如下的解决方案：
修复损坏的路径，让OS层重新识别
执行alter diskgroup crsdg add disk 'path' force;
5.2 但是由于当时处理该事件没有进行充分的调研oracle官方处理方法，在与存储管理员确认后，采用重启主机的方法，修复路径，然后重启DB实例，发现还是没有自动识别磁盘，于是采立即新增了两个磁盘，并划入到crsdg中，最后alert日志中的报错“Failed voting file relocation on diskgroup CRSDG”消除。并将ignore的磁盘权限及用户组变更回原始状态，如root sys  ，然后由存储管理员回收。
6. 引用
V$ASM_DISK View Shows Some Disk Header Status as IGNORED and Group Number as "0". (文档 ID 1299866.1)转到底部转到底部
In this Document
Symptoms
Cause
Solution
References
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.2 [Release 10.2 to 11.2]
Information in this document applies to any platform.
SYMPTOMS
1] V$ASM_DISK view shows disk header status as IGNORED and group number as "0".
2] Background ASM trace file corresponding sqlplus shows the following similar message.
Ignoring dsk because it is a duplicatekfdsk:0xb7b91620                                  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
disk: num: 21/47626513311654 grp: 0/47622597378048 compat: 10.1.0.0.0 dbcompat:10.1.0.0.0
fg: path: /dev/oracleasm/disks/ASM1                                                            <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
mnt: O hdr: M mode: v v(--) p(r-) a(-) d(-) sta: N flg: 1001
slot 65535 ddeslot 65535 numslots 65535 dtype 0 enc 0 part 0 flags 0
kfts: 2013/03/02 10:11:53.307000
kfts: 2013/11/15 01:18:09.263000
pcnt: 0 ()
kfkid: 0xb7bbf940, kfknm: , status: IDENTIFIED
fob: (KSFD)ba77c470, magic: bebe ausize: 0
kfdds: dn=21 inc=3915933606 dsk=0xb7b91620 usrp=0x2b50f0992118  
kfkds 0x2b50f08b07f8, kfkid 0xb7bbf940, magic abbe, libnum 0, bpau 0, fob 0xba77dec0
Ignoring dsk because it is a duplicatekfdsk:0xb7b919a0
 
CAUSE
This is not a normal situation but can happen in the following scenario.
1] If there are more than 2 disks dropped forcefully from the same diskgroup created with normal redundancy while the diskgroup is still mounted, these disks still show "MEMBER" and group number as "0". If a new disk is added to the diskgroup or existing disks are added in different order, there is a possibility that the same disk number and disk name can be assigned to the ones that was assigned to one of dropped disks which has not been added yet. 
This situation can happen either when one of SAN failgroup crash or when one of cell node from Exadata crash.
2] In RAC environment, ASM disk added to a existing diskgroup should be seen from all nodes. If one of nodes doesn't see the disk being added, the operation will fail with "ORA-15075 disk(s) are not visible cluster-wide" but the disk being added keeps the new disk number and disk name showing "MEMBER" status. After this, a new disk is added to the RAC node successfully, chances are the new disk gets the same disk number and diskname with the one that was failed in the first place.
The following similar error occurs when these disks showing IGNORED status are attempted to be added an existing diskgroup.
SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' ;
alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1'
*
ERROR at line 1:
ORA-15032: not all alterations performed
ORA-15033: disk '/dev/oracleasm/disks/ASM1' belongs to diskgroup "DATA"
3] Wrong multipath configuration where multiple single device path are visbile in asm_diskstring.
SOLUTION
IGNORED status can be seen for ASM disks which are not part of any currently mounted diskgroup but it shows the same disk header information with one of disks that are currently mounted such as disk number, disk name and diskgroup and the status as "MEMBER" and group number as "0". 
Check whether there are at last 2 ASM disks showing the same disk header information using kfed described in Appendix
1] When the diskgroup from which the disk has been dropped is currently mounted, try to add the disk showing "IGNORED" status using force option to the original diskgroup from which the disk has been dropped. Please see Document:946213.1 - for details how to add a disk back to the original diskgroup.
Example)
SQL> alter diskgroup data add failgroup F2 disk '/dev/oracleasm/disks/ASM1' force ;
Diskgroup altered
2] When the diskgroup from which the disk has been dropped is not currrently mounted, v$asm_disk shows the correct disk as MEMBER by comparing creation time from disk directory . If a disk showing the same disk number and has a different creation time from disk directory, the status is seen as IGNORED. 
example) /dev/oracleasm/disks/ASM1 and /dev/oacleasm/disks/ASM2 disks below show the same disk header information from kfed output.
SQL>@diskifno
G_N D_N NAME FAILGROUP M_STATU H_STATUS MO_STAT STATE PATH M_DATE
---- ---- ----------- ---------- ------- ---------- ------- -------- ------------------------- --------------------
0 0 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM2 2011/02/25 21:07:49 
0 1 IGNORED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM1 2011/02/25 18:08:43 
0 2 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM4 2011/02/25 20:59:01
0 3 CLOSED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM3 2011/02/25 21:07:49
1 0 PLAY_0000 PLAY_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM5 2011/02/20 20:52:31
3 0 KYLE_0000 KYLE_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM6 2011/02/20 20:52:44
2-1. ASM discards the disk showing IGNORED status automatically by comparing creation time from ASM metadata called disk directory when mounting DATA diskgroup.
SQL> alter diskgroup data mount ;
Diskgroup altered.
Note that /dev/oracleasm/disks/ASM1 has been excluded from DATA diskgroup. 
SQL> @diskinfo
G_N D_N NAME FAILGROUP M_STATU H_STATUS MO_STAT STATE PATH M_DATE
---- ---- ----------- ---------- ------- ---------- ------- -------- ------------------------- --------------------
0 1 IGNORED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM1 2011/02/25 18:08:43                                     <<<<<<<<<<<<<<<<<<< Here, ASM1 has been excluded from DATA diskgroup
1 0 PLAY_0000 PLAY_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM5 2011/02/20 20:52:31
2 0 DATA_0000 F1 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM2 2011/02/25 21:09:47
2 1 DATA_0001 F2 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM4 2011/02/25 21:09:47
2 2 DATA_0002 F2 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM3 2011/02/25 21:09:47
3 0 KYLE_0000 KYLE_0000 CACHED MEMBER ONLINE NORMAL /dev/oracleasm/disks/ASM6 2011/02/20 20:52:44
2-2. And then the disk showing IGNORED can be added using force option described in step 1.
Appendix
1. How to check whether there are duplication disks showing the same disk header information.
   o Check whether kfed executable exist in your <ASM_HOME/bin>
   o Rebuild kfed if kfed executable doesn't exist in $GRID_HOME/bin using the following way.
      $cd $ORACLE_HOME/rdbms/lib
      $make -f ins_rdbms.mk ikfed
   o Take a kfed output for the disks. If the following information of the header is the same, the 2 disks are considered to be the same. 
      Note mntstmp.hi and mntstmp.lo (disk mount time) can be different to be the same disk in this exercise.
      ex) $ kfed read /dev/oracleasm/disks/ASM1 | egrep 'dsknum|dskname|grpname|fgname|hdrsts|mntstmp.hi|mntstmp.lo'
2. diskinfo.sql
set linesize 200
col g_name format a10
col g_n format 99
col d_n format 999
col m_status format a7
col mo_status format a7
col h_status format a11
col name format a30
col path format a45
col failgroup format a15
select g.group_number g_n,
g.disk_number d_n,
g.name name,
g.failgroup,
g.mount_status m_status,
g.header_status h_status,
g.mode_status mo_status,
g.path ,
to_char(g.mount_date, 'YYYY/MM/DD HH24:MI:SS') m_date
from v$asm_disk g
order by g_n, d_n
/
阅读全文
0 0