RMAN Restore Performance

来源：互联网发布：实况足球手柄按键mac 编辑：程序博客网时间：2024/06/10 22:04

Quite often when considering RMAN restore performance, the first thing that springs to mind is the speed of the tape device, the media manager configuration, the restore script and the no:channels being allocated. Yet the backup itself and how the backup was originally written can have a huge impact on restore performance and this is what this note is primarily concerned with.

This note is intended for use by Database Administrators and Support personnel investigating RMAN restore performance specifically from a tape backup. It specifically looks at issues related to the processing of RMAN metadata, retrieval of the data from tape and when a restore is performing badly, how to determine where the time is being spent : in Oracle or in the Media Manager Layer. Problems related to writing back to disk/storage devices are outside the scope of this document.

A basic understanding of RMAN and Oracle is assumed.

RMAN Restore Performance

Backup Fundamentals

RMAN creates one or more backupsets.
Each backupset maps to one or more backuppieces.
Each backuppiece maps to one or more physical tapes.
RMAN will try to balance the workload across the allocated channels during backup such that each channel processes the same no: input blocks based on physical datafile size; so files from the same tablespace may well end up in different backuppieces and on different physical tapes; this is worth remembering when specifying a tablespace name for restore.
Allocating more channels than there are available tapes during backup results in several backuppieces being written to the same tape and as channels process in parallel, the backuppieces are multiplexed; this is referred to as HARDWARE MULTIPLEXING which gives improved backup perfornance at the cost of restore performance.

Restore Fundamentals

In general a restore script should allocate the same number of channels as was used in the original backup; this allows RMAN to restore files in the same order that they were written and with the same degree of parallelism.
RMAN knows from inspecting backup metadata exactly how a database backup was written in terms of file:backuppiece mapping and will generate pls/sql such that files in the same backuppiece are restored together.
A channel can only process one backuppiece at a time so the level of parallelism achieved during a restore is determined by the file:backuppiece:tape mapping and NOT by the no:channels allocated.
Files are multiplexed into a backuppiece with the file header from each file written LAST; during restore RMAN knows when all blocks for a particular file have been retrieved when the file header is retrieved.
The most efficient restore is a restore of the whole database or a restore of ALL the files in a particular backuppiece such that every block read is written back to disk.
The higher the filesperset value used during backup, the larger the backuppiece and the more inefficient the restore of a single file.

Restore Performance

RMAN restore performance may be poor if :

The backup was hardware multiplexed
Individual files or tablespaces are being restored
A different no:channels is used for restore

Worse case scenario:

a single file is being restored
the file's block header is the last block written to the backuppiece
the backuppiece size is very large
a very high level of multiplexing has been used
the backuppiece is hardware multiplexed with a number of other backkuppieces to a single tape

Case Histories

Consider the following backups:

This backup is not hardware multiplexed; each channel writes to its own tape device.

A full restore using 3 channels is the most efficient as all blocks are restored in a single scan of each backuppiece
As 3 tape devices are accessible concurrently, the restore is parallelised
A restore of a single file will scan at most 8 Gb (a 3rd of the total database backup)
Allocating more than 3 channels will not make the restore run any faster and results in idle channels as no more than 3 backuppieces are processed

This backup is hardware multiplexed; all 3 channels write to a single tape device.

A full restore using 3 channels will request all 3 backuppieces at the same time so the media manager MUST ALSO SUPPORT MULTIPLEXED BACKUPS if ALL three backupieces are to be retrieved with a single scan of the tape.
10G allows media managers to tell RMAN whether or not multiplexing is supported - be aware ofNote 433335.1: RMAN Restore Of Incremental Backups Will Not Parallelise and Uses Only One Channel
Certain media managers have specific parameters that affect restore performance from multiplexed backups – always check with your media manager vendor.
A restore of a single file needs a scan of potentially 24Gb (the whole database backup) depending on where the header of the file being restored is positioned on the tape.

If the media manager does NOT support multiplexing only ONE backuppiece can be returned at a time; bearing in mind that each backuppiece is multiplexed with two others, this restore would require THREE scans of at most 24Gb each (with a tape rewind after each scan).

This backup is NOT hardware multiplexed; each channel writes to its own tape device with a multiplex level of 4, creating a single 32Gb backuppiece spanning 4 tapes. A full restore using 3 channels is the most efficient as the restore is parallelised with each tape read ONCE.

Particular care must be taken when restoring individual files/tablespaces.

Consider the following script:

run {  
allocate channel t1 type sbt....; 
allocate channel t2 type sbt....; 
allocate channel t3 type sbt....; 
restore datafile 2;  
restore datafile 3;  
restore datafile 5;  
restore datafile 7;  
restore datafile 11;  
}

RMAN commands within a script are executed serially – so this restore is NOT parallelised
Only one channel is ever in use at any one time
Each restore retrieves a single file and potentially scans up to 32Gb of backup data
A tape rewind is required after each restore
Backuppiece 1 is scanned 3 times
Backupiece 2 is scanned twice

A better script to maximise throughput and parallelise the restore uses a single restore command:

run {    
allocate channel t1 type sbt....;   
allocate channel t2 type sbt....;   
allocate channel t3 type sbt....;   
restore datafile 2,3,5,7,11;     
}

RMAN will work out how to balance the restore across the allocated channels based on file:backuppiece mappings such that:

Files 2,5,7 are restored in a single scan of backuppiece 1
Files 3 and 11 are restored in a single scan of backuppiece 2
If only one channel is allocated then only one backuppiece can be processed at a time and the restore will take longer
Two channels will allow both backuppieces to be processed in parallel
The script allocates 3 channels because this is what was used for backup however, the 3rd channel will be idle

If available disk space is limited such that no more than N files can be restored at any one time, plan the restore according to file:backuppiece mappings and group together files from the same backuppiece so that each backuppiece is scanned no more than once. You can if you wish, assign datafiles to individual channels:

run {      
allocate channel t1 type sbt....;     
allocate channel t2 type sbt....;     
allocate channel t3 type sbt....;     
restore datafile 2, 5, 7 channel t1    
datafile 3 channel t2    
datafile 10 channel t3;        
}

Be careful when restoring tablespaces and do not assume that a single tablespace restore requires a single channel. This script is fine if all files for tablespace X were written to a single backuppiece:

run {    
allocate channel t1 type sbt....;    
restore tablespace X;    
}

However, if tablespace X comprises files 8,9 and 10 then the following script will run 3 times faster as the restore would be parallelised:

run {    
allocate channel t1 type sbt………;    
allocate channel t2 type sbt………;    
allocate channel t3 type sbt………;    
restore tablespace X;    
}

Restore Performance Checklist

The following checklist will help to assess the current restore configuration and performance. Working through this will identify where the time is being spent during restore; once this is done, actions can be taken to resolve the issue. If having worked through this checklist restore performance is still poor, then use this checklist to collect diagnostics - capture the results from each point and raise an SR with Oracle Support Services, uploading the results along with any trace files generated.

a. RMAN restore performance may be poor if any of the following is true:

The backup was hardware multiplexed
Individual files or tablespaces are being restored
A different no:channels is used for restore

Get the RMAN backup script and backup log for the backup that is being restored and check if any of the 3 cases described above apply - if so, amend the restore script accordingly.

Assuming that a full database restore is being done and poor restore performance cannot be explained by any of the scenarios above, there are THREE potential areas to investigate:

Oracle
Disk IO
Media Manager Layer

b. Oracle

Before a restore even begins the the controlfile is queried, datafile headers are read, media managers are initialised, the catalog is resynced, pl/sql is generated and compiled, rman metadata either in the controlfile or the catalog is queried. All this incurs IO against the controlfile, datafile headers and the catalog.

How long does it take before the physical restore starts? Check the RMAN log eg

Recovery Manager: Release 10.2.0.4.0 Production on Thu Sep 18 10:22:15 2008   
.....   
RMAN-03090: Starting restore at 18-SEP-08 10:23:35 2008

Physical restore takes a long time to start

It should take at most a few minutes (dependant on the amount of metadata to be processed) for the restore to start. If a catalog is used, run the restore again without the catalog connection; if the ‘time to start’ improves dramatically then the problem lies in catalog performance – see Note 748257.1: RMAN Troubleshooting Catalog Performance Issues. If using nocatalog makes no difference then most likely the problem lies in processing the controlfile metadata – refer to the same note for guidelines on troubleshooting this.

Physical restore starts very quickly

Processing of RMAN metadata is clearly not an issue if the physical restore starts very quickly.
Once the RMAN-03090 message is logged, each channel passes the name of the backuppieces to be restored to the media manager for validation – this step must complete before the backukppieces can be opened and read. Once we start to read a backuppiece, the channel process is primarily concerned with processing IO. So we need to determine where the time is being spent – in the media manager or in writing back to disk:

RMAN>restore validate X;

Where X is database|tablespace|datafile spec. When using the VALIDATE option:

only the backuppiece is scanned
no physical files are restored to disk
the difference in runtime between this and the normal restore represents time spent reading from tape in the media manager layer

c. Disk IO

If runtime for VALIDATE is very fast then there is a bottleneck writing back to disk. Refer toNote 850988.1: RMAN Restore Performance from Tape is Very Poor - check the disk buffer size used when writing back to disk per this note.

If restoring to ASM you may find setting the following two parameters useful:

_ BACKUP_KSFQ_BUFCNT
_BACKUP_KSFQ_BUFSZ

The default is 16 for _BACKUP_KSFQ_BUFCNT and 1 MB for _BACKUP_KSFQ_BUFSZ.
These defaults are acceptable for ASM diskgroups with fewer than 32 disks and when ASM stripe size is 1 MB otherwise to set:

_BACKUP_KSFQ_BUFCNT to the number of disks in the ASM disk group and _BACKUP_KSFQ_BUFSZ to the size of the ASM stripe.

Where the database resides on ASM and the platform is Linux, make sure that ASMLIB has been installed and configured correctly (Note 394953.1:Tips On Installing and Using ASMLib on Linux).
Bearing in mind that RMAN is simply a piece of software making IO requests to the OS, the physical implementation of the datafiles, thedisk sub-system and how the OS is handling the IO requests may need to be investigated with collaboration from storage and OS vendors; this is outside the scope of this document.

d. Media Manager Layer

If restore validate performance is still very slow then the bulk of the time is being spent in reading from tape and we need to investigate what is happening in the media manager layer: are the backuppieces still being validated or are we scanning the backuppieces?

Check the RMAN processes and wait events:

col program format a20   
col action format a20    
SELECT s.sid, p.spid, s.program, s.client_info,s.action, seq#, event, wait_time,   
seconds_in_wait AS sec_wait    
from v$session s, v$process p    
WHERE s.paddr = p.addr AND s.program like '%rman%';

Run the above query several times and note the value of seq# and event for the channel process.
If neither seq# nor event changes then the restore process has hung; the event will identify the resource that we were waiting for when the hang occurred. A wait for any sbt resource indicates a hang in the media manager layer.

Before killing the rman processes get a process stack (pstack) if you can or use truss/strace/tusc or similar routine on the channel process (identified by spid value above) to see if you can determine what the process is doing. A hang in the media manager layer must be investigated by the media manager vendor (unless ofcourse, you are using Oracle Secure Backup) as unfortunately, Oracle Support does not have access to third party media manager software or code. However, the following traces will supply additional information that can be passed to the media manager vendor with :

SBT Trace:

allocate channel t1 type sbt parms 'env=........' trace=2;

Look for the trace file in the Target udump directory. The process id (spid) of rman channel servers can be identified using:

SQL>SELECT s.client_info , s.sid, p.spid, s.program, s.action from v$session s, v$process p  
WHERE s.paddr = p.addr AND s.program like '%rman%';

The trace file for a channel will include the spid value in the file name: <sid>_ora_<spid>.trc

Example trace file from a hung session:

*** SESSION ID:(65.29461) 2008-07-29 10:27:29.937  
skgfgdvi(se=0xffff79c0, ctx=0x101b8600, dev=0x102205b0)   
skgfidev(se=0xffff79c0 ctx=0x101b8600, dev=0x102205b0)   
skgfalo(se=0xffff7de0, ctx=0x101b8600, dev=0x102205b0, devparms=ENV=(NSR_SERVER=<svr>,NSR_GROUP=<grp>), flags=33554432)  
skgfidev(se=0xffff7de0 ctx=0x101b8600, dev=0x102205b0)   
skgfidev(): processing: ENV=(NSR_SERVER=<svr>,NSR_GROUP=<grp>)   
skgfidev(): setting environment variable: NSR_SERVER=<svr>   
skgfidev(): setting environment variable: NSR_GROUP=<grp>   
entering sbtinit on line 2201   
return from sbtinit on line 2211   
skgfqsbi(ctx=0x101b8600, vtapi=API Version 1.1, id=MMS Version 2.2.0.1)   
skgfqcre(se=0xffff7530, ctx=0x101b8600, dev=0x102205b0, file=0x1099ce50, fparms=, flags=0x0)  
entering sbtopen on line 670

Each entry/exit to/from the media manager layer is logged to the trace file as shown in the calls to sbtinit above. A hang in the media manager is shown by an entry into a media manager module (entering sbtX...) with no corresponding exit (return from sbtX…) as shown in the call to sbtopen above.

RMAN debug trace

It is often useful to accompany sbt trace with RMAN debug trace to see what RMAN is doing at the same time:

%rman target / log rman.log trace rman.trc  
run{   
allocate channel t1 type sbt………trace=2;   
allocate channel t2 type sbt………trace=2;   
allocate channel t3 type sbt………trace=2;   
debug on;   
restore database;   
debug off;   
}

The debug trace will be written to file rman.trc.

Sbtio.log

Look for the sbtio.log file – this is the only file in the Oracle file system that is written to by 3rd party media managers; Oracle does not write to this file. The amount of information in this file varies according to the media manager vendor and some vendors only write to this file if diagnostics are specifically set.

Media Manager trace

Contact your media manager vendor for any other media manager specific environment variables that can be set to get further diagnostics