Resolving ORA-752 or ORA-600 [3020] During Standby Recovery (Doc ID 1265884.1)

来源：互联网发布：淘宝联盟优惠券那里找编辑：程序博客网时间：2024/06/05 17:02

To Bottom

In this Document

Symptoms Cause Solution Determining the Extent of Corruption Determining Root Cause What actions can be taken when an ORA-752 lost write error is signalled? What action can be taken when an ORA-600 [3020] is signalled? Protecting Against Lost Writes Steps to recover from bug 11674485 Detection – how do we know “Lost write at the Primary” detected at the standby is a false positive? References

APPLIES TO:

Oracle Database - Enterprise Edition - Version 10.2.0.1 to 11.2.0.4 [Release 10.2 to 11.2]
Information in this document applies to any platform.

SYMPTOMS

Standby Redo Apply can terminate due to a failure of redo-data consistency checks, a problem called stuck recovery. Stuck recovery can occur when an underlying operating system or storage system loses a write issued by the Primary or Standby database during normal operation. Because there is an inconsistency between the information stored in the redo and the information stored in a database block being recovered, the database signals an internal error when applying the redo.

ORA-00600: internal error code, arguments: [3020], [2885689059], [1], [419819],[26750], [808], [], [] 
ORA-10567: Redo is inconsistent with data block (file# 1, block# 419819) 
ORA-10564: tablespace USER1 
ORA-01110: data file ’/oracle/datafiles/user1.dbf’

CAUSE

The ORA-600 [3020] stuck recovery error could occur on the Standby database for several reasons including: a lost write on the Primary, a lost write on the Standby, missing redo, or logical corruption on the primary resulting in an incomplete redo chain.

Note:With DB_LOST_WRITE_PROTECT enabled on the Primary and Standby, the Standby Redo Apply terminates with the ORA-752 error when a Primary lost write is detected.

CAUTION: If you have a RAC primary and you don’t have the Oracle RDBMS patch that has a fix for bug 11674485, the “lost write on the primary” detected at the Standby maybe a “false positive”. Please follow the steps described in Section “Steps to recover from bug 11674485” first to make this determination before you proceed any further.

ORA-752: recovery detected a lost write of a data block

This ORA-752 error indicates a lost write occurred on the Primary database. Oracle strongly recommends enabling DB_LOST_WRITE_PROTECT (and DB_BLOCK_CHECKSUM=FULL) for greater detection and protection from lost writes. Studies have shown the impact on the primary database is negligible.

SOLUTION

In the majority of cases, Standby stuck recovery errors indicate a corruption of the Primary database. No errors may have been reported on the Primary.

WARNING: Do not repair the Standby by restoring a backup taken on the Primary, as that will ensure that the Standby is also corrupt! The only exception is when the Standby is known to have a lost write, but this determination should be made by Oracle Support.

An ORA-752 error definitively identifies a lost write on the Primary. Consider failing over to the Standby immediately if data integrity is critical and some data loss is acceptable. Oracle Support should also be engaged immediately when an ORA-600 [3020] error occurs by opening a Service Request via My Oracle Support.

When media recovery encounters a problem, the alert log may indicate that recovery can continue if it is allowed to corrupt the data block causing the problem. The alert log contains information about the block: its block type, block address, the tablespace it belongs to, and so forth. For blocks containing user data, the alert log may also report the data object number.

For a block containing user data, you can query the database to find out which object or table owns this block. If the block belongs to an object that can be recreated or is unimportant then it might be advisable to allow recovery to proceed after marking the block corrupt. This procedure is covered in a subsequent section.

Determining the Extent of Corruption

To determine if the corruption is isolated run a diagnostic trial recovery, this scans the redo for problems but does not actually make any changes to the recovered database. Trial recovery reports any additional corruptions in the alert_<SID>.log. You can use the RECOVER ... TEST statement to invoke trial recovery. Refer to Document 283262.1 for additional details on trial recovery.

Determining Root Cause

Information that needs to be gathered and sent to Oracle Support immediately:

Complete or an extract from the alert.log covering at least the period from the last successful database startup.
RDA report (or at least init.ora or the spfile). See Document 314422.1
All tracefiles generated at and after the time of failure.
Any system and I/O subsytem log/error files covering the period from the last successful startup.
Dump of the controlfiles
SQL> alter session set events 'immediate trace name controlf level xx';
Dump of the datafile headers:
SQL> alter session set events 'immediate trace name file_hdrs level 10';
Dump of the redo log headers:
SQL> alter session set events 'immediate trace name redohdr level 10';

What actions can be taken when an ORA-752 lost write error is signalled?

Option 1: Determine if affected objects can be recreated and recovery allowed to continue:

First determine the affected objects. The alert log message will provide the datafile number along with the corresponding block number. For blocks containing user data, the alert log may also report the data object number. Using this information you can determine which objects are affected by the corruption:

Spool the output and keep them handy.
SQL> Select * from DBA_EXTENTS
     where FILE_ID=&file_number and
     &block_number BETWEEN BLOCK_ID and BLOCK_ID+BLOCKS-1;

If the error provides the object number determine the affected object with the following query:

SQL> Select * from DBA_OBJECTS    where DATA_OBJECT_ID = &object_number;
If feasible, drop and recreate the affected objects on the primary.
Once the objects have been recreated, use the following procedure to skip corrupted block on the standby:
1. Temporarily disable lost write protection on the standby:
  SQL> ALTER SYSTEM SET DB_LOST_WRITE_PROTECT = NONE;
2. Allow recovery to proceed in spite of block corruptions by running the RECOVER command with the ALLOW n CORRUPTION clause, where n is the number of allowable corrupt blocks.
  SQL> alter database recover automatic standby database
  allow 1 corruption;
3. Once the alert log indicates the blocks have been marked corrupt, restart managed recovery.
  SQL> alter database recover cancel;
  SQL> alter database recover managed standby database
  using current logfile disconnect;

Option 2: Activate the standby database

If the affected objects cannot be recreated then activate the standby database. By activating the standby database you will be experiencing data loss but data integrity will be ensured.

Issue the following SQL statement on the standby database to convert it to a primary:
SQL> ALTER DATABASE ACTIVATE STANDBY DATABASE;

If the standby database is managed by the Data Guard broker, issue the following Data Guard broker command to perform an immediate failover to the standby database:

DGMGRL> FAILOVER TO database-name IMMEDIATE;

Note: Under certain situations, "alter database activate standby database" command can fail on a physical standby database because of inability to archive the final partial standby redo logs. To work around the failure to archive the partial archived redo logs, please perform the following steps:

1) Shutdown & Mount one instance of the physical standby database
2) Drop all the standby redo log groups which are ACTIVE using ALTER DATABASE DROP LOGFILE GROUP <standby_group_#>. The group# of the standby redo log groups can be found by doing a query on V$standby_log
3) Repeat the 'alter database activate standby database' command.
4) Take a full backup of the new production database.

2. Back up the new primary. Performing a backup immediately is a necessary safety measure, because you cannot recover changes made after the failover without a complete backup copy of the database. As a result of the failover, the original primary database can no longer participate in the Data Guard configuration, and all other standby databases will now receive and apply redo data from the new primary database.

3. Open the new primary database.

4. An optional step is to recreate the failed primary as a physical standby. This can be done using the database backup taken at the new primary in step 3. (You cannot use flashback database or the Data Guard broker to reinstantiate the old primary database in this situation.)

Be aware that a physical standby created using the backup taken from the new primary will have the same datafiles as the old standby. Therefore, any undetected lost writes that the old standby had before it was activated will not be detected by the new standby, since the new standby will be comparing the same blocks. Any new lost writes that happen on either the primary or the standby will be detected.

What action can be taken when an ORA-600 [3020] is signalled?

Engage Oracle Support immediately when an ORA-600 [3020] error occurs. Be prepared to supply the information listed under the “Determining the Root Cause” section when opening a Service Request via My Oracle Support.

Protecting Against Lost Writes

Protect against lost write by setting DB_LOST_WRITE_PROTECT to TYPICAL on primary and standby databases. By doing so the physical standby Redo Apply process will compare the block SCN on the standby to the block SCN stored in the primary redo stream (when the block was read) to decide whether there is a lost write on the primary. Redo Apply is able to do this using the additional information logged at the primary when DB_LOST_WRITE_PROTECT is enabled. If the block SCN on the primary database is lower than on the standby database, then it detects a lost write on the primary database and throws an external error (ORA-752).

Oracle strongly recommends enabling DB_LOST_WRITE_PROTECT and DB_BLOCK_CHECKSUM=FULL for greater detection and protection from lost writes. Studies have shown the impact on the primary database is negligible.

Note that DB_LOST_WRITE_PROTECT is only available in Oracle 11g and later.

Steps to recover from bug 11674485

These steps in this section will help you determine if you have hit the bug 11674485 and explain the subsequent workaround necessary to repair the standby database.

If you detect you have hit bug 11674485 (see below for details), the primary has not lost a write nor does it have a corruption. The only workaround necessary is to recover the datafile at the standby that has been marked with a corrupt block by managed standby recovery when the false positive was triggered.

Detection – how do we know “Lost write at the Primary” detected at the standby is a false positive?

When standby recovery detects that a lost write that happened on the primary database, standby recovery signals ORA-752 and stops. Below is an example output from the standby alert log when this happens:

Hex dump of (file 786, block 657290) in trace file /diag/rdbms/gmfcdwp_lvt/gmfcdwp8/trace/gmfcdwp8_pr0m_28759.trc
STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 657290, FILE 786
NO REDO AT OR AFTER SCN 10753662227200 CAN BE USED FOR RECOVERY.
Tue Apr 26 22:21:08 2011
Slave exiting with ORA-752 exception
Errors in file
/diag/rdbms/gmfcdwp_lvt/gmfcdwp8/trace/gmfcdwp8_pr0m_28759.trc:
ORA-00752: recovery detected a lost write of a data block
ORA-10567: Redo is inconsistent with data block (file# 786, block# 657290, file offset is 1089552384 bytes) <ORA-10564: tablespace ODS_DATA
ORA-01110: data file 786: '+DATA/gmfcdwp/datafile/ods_data_306.dbf'
ORA-10561: block type 'TRANSACTION MANAGED INDEX BLOCK', data object#4718509
Tue Apr 26 22:21:09 2011
Recovery Slave PR0M previously exited with exception 752

If the primary is RAC, then it is possible that this incident is caused by bug 11674485. When bug 11674485 is encountered, standby recovery declares that it found a primary lost write even though there was no lost write or any other corruption at the primary. If primary is not RAC, then bug 11674485 cannot occur.

In order to determine if a particular incident of ORA-752 was caused by bug 11674485, you need to look into the trace file of the recovery process that detected the lost write. In the case of example above, we need to look into trace file gmfcdwp8_pr0m_28759.trc.

The trace file should contain the following message:

STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE

Immediately following the above message is information of the redo change vector header that triggered standby recovery to declare primary lost write. For example:

STANDBY REDO APPLICATION HAS DETECTED THAT THE PRIMARY DATABASE
LOST A DISK WRITE OF BLOCK 657290, FILE 786
The block read on the primary had SCN 10753662227193 (0x09c7.c83792f9) seq 1
(0x01)
 while expected to have SCN 10753662227199 (0x09c7.c83792ff) seq 1 (0x01)
The block was read at SCN 10753662227200 (0x09c7.c8379300), BRR:
CHANGE #1 TYP:0 CLS:32772 AFN:786 DBA:0x95ca078a OBJ:4718509 SCN:0x09c7.c83792f9 SEQ:1 OP:23.2 ENC:0 RBL:1

The lines starting with “The block was read at SCN…” and the immediate next line starting with “CHANGE #1 TYP:0 CLS:…” are important for the analysis below, so please make a note of these lines.

The same trace file will contain the redo dump of all relevant redo for the particular block (in the case of the example above, that block is file# 786, block# 657290). The particular redo record can be found as follows:

The redo record SCN should be the same SCN observed in the trace file message "The block was read at SCN ..., BRR:" In the case of the example above, the redo record SCN should be 10753662227200 (0x09c7.c8379300)
The redo record should have the exact matching information as the change vector header dump line above "CHANGE #1 TYP: 0 ...". (There may be multiple change vectors in the redo dump matching this line).

Below is an example of redo dump that matches the two criteria described above:

REDO RECORD - Thread:8 RBA: 0x000058.000a60b9.0038 LEN: 0x0034 VLD: 0x10
SCN: 0x09c7.c8379300 SUBSCN:  1 04/26/2011 21:59:30
CHANGE #1 TYP:0 CLS:32772 AFN:786 DBA:0x95ca078a OBJ:4718509
SCN:0x09c7.c83792f9 SEQ:1 OP:23.2 ENC:0 RBL:1
 Block Read - afn: 786 rdba: 0x95ca078a BFT:(1024,2513045386) non-BFT:
599,657290)
              scn: 0x09c7.c83792f9 seq: 0x01
              flags: 0x00008004 ( ckval ping )

The lost write detection is caused by bug 11674485 if both the following two conditions are met:

Change vector header has "RBL:1" set.
The change vector has the "ping" flag. (There may or may not be other flags set (e.g. chval) too, but most important thing is that "ping" flag is set).

Workaround for Bug 11674485

When bug 11674485 has been encountered managed standby recovery will marked one of the blocks in a datafile as corrupt. The name and number of the of the datafile is easily determined from the standby alert log line below:

Reading datafile '+DATA/gmfcdwp/datafile/ods_data_306.dbf' for corruption at rdba: 0x95ca078a (file 786, block 657290)

On the primary database force a checkpoint via the alter system checkpoint command or by putting the tablespace associated with the affected datafile in hot backup mode followed by taking it out of hot backup mode.
Copy the above datafile from the primary to the standby using the procedure outlined in the following MOS note:

How to Copy ASM datafile From Primary Database to Standby Database on ASM using RMAN (Doc ID 605234.1)

REFERENCES

NOTE:283262.1 - Trial Recovery
NOTE:314422.1 - Remote Diagnostic Agent (RDA) - Getting Started
BUG:11689702 - ORA-600 [3020] DURING RECOVERY AFTER DATAFILE RESIZE
NOTE:1302614.1 - Rman or User Managed Restore/Recovery Fails With Ora-600 [3020] if Datafile resize Operation was Performed.