Resolving ORA-752 or ORA-600 [3020] During Standby Recovery (Doc ID 1265884.1)
来源:互联网 发布:淘宝联盟优惠券那里找 编辑:程序博客网 时间:2024/06/05 17:02
In this Document
APPLIES TO:
Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.4 [Release 10.2 to 11.2]Information in this document applies to any platform.
SYMPTOMS
Standby Redo Apply can terminate due to a failure of redo-data consistency checks, a problem called stuck recovery. Stuck recovery can occur when an underlying operating system or storage system loses a write issued by the Primary or Standby database during normal operation. Because there is an inconsistency between the information stored in the redo and the information stored in a database block being recovered, the database signals an internal error when applying the redo.
ORA-10567: Redo is inconsistent with data block (file# 1, block# 419819)
ORA-10564: tablespace USER1
ORA-01110: data file ’/oracle/datafiles/user1.dbf’
CAUSE
The ORA-600 [3020] stuck recovery error could occur on the Standby database for several reasons including: a lost write on the Primary, a lost write on the Standby, missing redo, or logical corruption on the primary resulting in an incomplete redo chain.
Note:With DB_LOST_WRITE_PROTECT enabled on the Primary and Standby, the Standby Redo Apply terminates with the ORA-752 error when a Primary lost write is detected.
This ORA-752 error indicates a lost write occurred on the Primary database. Oracle strongly recommends enabling DB_LOST_WRITE_PROTECT (and DB_BLOCK_CHECKSUM=FULL) for greater detection and protection from lost writes. Studies have shown the impact on the primary database is negligible.
SOLUTION
In the majority of cases, Standby stuck recovery errors indicate a corruption of the Primary database. No errors may have been reported on the Primary.
An ORA-752 error definitively identifies a lost write on the Primary. Consider failing over to the Standby immediately if data integrity is critical and some data loss is acceptable. Oracle Support should also be engaged immediately when an ORA-600 [3020] error occurs by opening a Service Request via My Oracle Support.
When media recovery encounters a problem, the alert log may indicate that recovery can continue if it is allowed to corrupt the data block causing the problem. The alert log contains information about the block: its block type, block address, the tablespace it belongs to, and so forth. For blocks containing user data, the alert log may also report the data object number.
For a block containing user data, you can query the database to find out which object or table owns this block. If the block belongs to an object that can be recreated or is unimportant then it might be advisable to allow recovery to proceed after marking the block corrupt. This procedure is covered in a subsequent section.
Determining the Extent of Corruption
To determine if the corruption is isolated run a diagnostic trial recovery, this scans the redo for problems but does not actually make any changes to the recovered database. Trial recovery reports any additional corruptions in the alert_<SID>.log. You can use the RECOVER ... TEST statement to invoke trial recovery. Refer to Document 283262.1 for additional details on trial recovery.
Determining Root Cause
Information that needs to be gathered and sent to Oracle Support immediately:
- Complete or an extract from the alert.log covering at least the period from the last successful database startup.
- RDA report (or at least init.ora or the spfile). See Document 314422.1
- All tracefiles generated at and after the time of failure.
- Any system and I/O subsytem log/error files covering the period from the last successful startup.
- Dump of the controlfilesSQL> alter session set events 'immediate trace name controlf level xx';
- Dump of the datafile headers: SQL> alter session set events 'immediate trace name file_hdrs level 10';
- Dump of the redo log headers: SQL> alter session set events 'immediate trace name redohdr level 10';
What actions can be taken when an ORA-752 lost write error is signalled?
Option 1: Determine if affected objects can be recreated and recovery allowed to continue:
- First determine the affected objects. The alert log message will provide the datafile number along with the corresponding block number. For blocks containing user data, the alert log may also report the data object number. Using this information you can determine which objects are affected by the corruption:
Spool the output and keep them handy.SQL> Select * from DBA_EXTENTS
where FILE_ID=&file_number and
&block_number BETWEEN BLOCK_ID and BLOCK_ID+BLOCKS-1;
If the error provides the object number determine the affected object with the following query:
SQL> Select * from DBA_OBJECTS where DATA_OBJECT_ID = &object_number; - If feasible, drop and recreate the affected objects on the primary.
- Once the objects have been recreated, use the following procedure to skip corrupted block on the standby:
- Temporarily disable lost write protection on the standby: SQL> ALTER SYSTEM SET DB_LOST_WRITE_PROTECT = NONE;
- Allow recovery to proceed in spite of block corruptions by running the RECOVER command with the ALLOW n CORRUPTION clause, where n is the number of allowable corrupt blocks. SQL> alter database recover automatic standby database
allow 1 corruption; - Once the alert log indicates the blocks have been marked corrupt, restart managed recovery. SQL> alter database recover cancel;
SQL> alter database recover managed standby database
using current logfile disconnect;
- Temporarily disable lost write protection on the standby:
Option 2: Activate the standby database
If the affected objects cannot be recreated then activate the standby database. By activating the standby database you will be experiencing data loss but data integrity will be ensured.
- Issue the following SQL statement on the standby database to convert it to a primary: SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE;
If the standby database is managed by the Data Guard broker, issue the following Data Guard broker command to perform an immediate failover to the standby database:
Note: Under certain situations, "alter database activate standby database" command can fail on a physical standby database because of inability to archive the final partial standby redo logs. To work around the failure to archive the partial archived redo logs, please perform the following steps:
1) Shutdown & Mount one instance of the physical standby database
2) Drop all the standby redo log groups which are ACTIVE using ALTER DATABASE DROP LOGFILE GROUP <standby_group_#>. The group# of the standby redo log groups can be found by doing a query on V$standby_log
3) Repeat the 'alter database activate standby database' command.
4) Take a full backup of the new production database.
2. Back up the new primary. Performing a backup immediately is a necessary safety measure, because you cannot recover changes made after the failover without a complete backup copy of the database. As a result of the failover, the original primary database can no longer participate in the Data Guard configuration, and all other standby databases will now receive and apply redo data from the new primary database.
3. Open the new primary database.
4. An optional step is to recreate the failed primary as a physical standby. This can be done using the database backup taken at the new primary in step 3. (You cannot use flashback database or the Data Guard broker to reinstantiate the old primary database in this situation.)
Be aware that a physical standby created using the backup taken from the new primary will have the same datafiles as the old standby. Therefore, any undetected lost writes that the old standby had before it was activated will not be detected by the new standby, since the new standby will be comparing the same blocks. Any new lost writes that happen on either the primary or the standby will be detected.
What action can be taken when an ORA-600 [3020] is signalled?
Engage Oracle Support immediately when an ORA-600 [3020] error occurs. Be prepared to supply the information listed under the “Determining the Root Cause” section when opening a Service Request via My Oracle Support.
Protecting Against Lost Writes
Protect against lost write by setting DB_LOST_WRITE_PROTECT to TYPICAL on primary and standby databases. By doing so the physical standby Redo Apply process will compare the block SCN on the standby to the block SCN stored in the primary redo stream (when the block was read) to decide whether there is a lost write on the primary. Redo Apply is able to do this using the additional information logged at the primary when DB_LOST_WRITE_PROTECT is enabled. If the block SCN on the primary database is lower than on the standby database, then it detects a lost write on the primary database and throws an external error (ORA-752).
Oracle strongly recommends enabling DB_LOST_WRITE_PROTECT and DB_BLOCK_CHECKSUM=FULL for greater detection and protection from lost writes. Studies have shown the impact on the primary database is negligible.
Note that DB_LOST_WRITE_PROTECT is only available in Oracle 11g and later.
Steps to recover from bug 11674485
These steps in this section will help you determine if you have hit the bug 11674485 and explain the subsequent workaround necessary to repair the standby database.
If you detect you have hit bug 11674485 (see below for details), the primary has not lost a write nor does it have a corruption. The only workaround necessary is to recover the datafile at the standby that has been marked with a corrupt block by managed standby recovery when the false positive was triggered.
Detection – how do we know “Lost write at the Primary” detected at the standby is a false positive?
When standby recovery detects that a lost write that happened on the primary database, standby recovery signals ORA-752 and stops. Below is an example output from the standby alert log when this happens:
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 657290, FILE 786
NO REDO AT OR AFTER SCN 10753662227200 CAN BE USED FOR RECOVERY.
Tue Apr 26 22:21:08 2011
Slave exiting with ORA-752 exception
Errors in file
/diag/rdbms/gmfcdwp_lvt/gmfcdwp8/trace/gmfcdwp8_pr0m_28759.trc:
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 786, block# 657290, file offset is 1089552384 bytes) <ORA-10564: tablespace ODS_DATA
ORA-01110: data file 786: '+DATA/gmfcdwp/datafile/ods_data_306.dbf'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object#4718509
Tue Apr 26 22:21:09 2011
Recovery Slave PR0M previously exited with exception 752
LOST A DISK WRITE OF BLOCK 657290, FILE 786
The block read on the primary had SCN 10753662227193 (0x09c7.c83792f9) seq 1
(0x01)
while expected to have SCN 10753662227199 (0x09c7.c83792ff) seq 1 (0x01)
The block was read at SCN 10753662227200 (0x09c7.c8379300), BRR:
CHANGE #1 TYP:0 CLS:32772 AFN:786 DBA:0x95ca078a OBJ:4718509 SCN:0x09c7.c83792f9 SEQ:1 OP:23.2 ENC:0 RBL:1
- The redo record SCN should be the same SCN observed in the trace file message "The block was read at SCN ..., BRR:" In the case of the example above, the redo record SCN should be 10753662227200 (0x09c7.c8379300)
- The redo record should have the exact matching information as the change vector header dump line above "CHANGE #1 TYP: 0 ...". (There may be multiple change vectors in the redo dump matching this line).
SCN: 0x09c7.c8379300 SUBSCN: 1 04/26/2011 21:59:30
CHANGE #1 TYP:0 CLS:32772 AFN:786 DBA:0x95ca078a OBJ:4718509
SCN:0x09c7.c83792f9 SEQ:1 OP:23.2 ENC:0 RBL:1
Block Read - afn: 786 rdba: 0x95ca078a BFT:(1024,2513045386) non-BFT:
599,657290)
scn: 0x09c7.c83792f9 seq: 0x01
flags: 0x00008004 ( ckval ping )
- Change vector header has "RBL:1" set.
- The change vector has the "ping" flag. (There may or may not be other flags set (e.g. chval) too, but most important thing is that "ping" flag is set).
- On the primary database force a checkpoint via the alter system checkpoint command or by putting the tablespace associated with the affected datafile in hot backup mode followed by taking it out of hot backup mode.
- Copy the above datafile from the primary to the standby using the procedure outlined in the following MOS note:
How to Copy ASM datafile From Primary Database to Standby Database on ASM using RMAN (Doc ID 605234.1)
REFERENCES
NOTE:283262.1 - Trial Recovery
NOTE:314422.1 - Remote Diagnostic Agent (RDA) - Getting Started
BUG:11689702 - ORA-600 [3020] DURING RECOVERY AFTER DATAFILE RESIZE
NOTE:1302614.1 - Rman or User Managed Restore/Recovery Fails With Ora-600 [3020] if Datafile resize Operation was Performed.