Tuning Inter-Instance Performance in RAC and OPS (Doc ID 181489.1)

来源：互联网发布：在线翻译软件有哪些编辑：程序博客网时间：2024/06/05 09:27

APPLIES TO:

Oracle Database - Enterprise Edition - Version 9.0.1.0 to 10.2.0.4 [Release 9.0.1 to 10.2]

Generic UNIX

This note was written to help DBAs and Support Analysts understand Inter-Instance

Performance and Tuning in RAC.

Real Application Clusters uses the interconnect to transfer blocks and messages

between instances. If inter-instance performance is bad, almost all database

operations can be delayed. This note describes methods of identifying and

resolving inter-instance performance issues.

TUNING INTER-INSTANCE PERFORMANCE IN RAC

@ ARCHIVED

SYMPTOMS OF INTER-INSTANCE PERFORMANCE PROBLEMS

-----------------------------------------------

The best way to monitor inter-instance performance is to take AWR or statspack

snaps on each instance (at the same time) at regular intervals.

If there are severe inter-instance performance issues or hung sessions, you

may also want to run the racdiag.sql script from the following note

to collect additional RAC specific data:

Note: 135714.1 

Script to Collect RAC Diagnostic Information (racdiag.sql)

The output of the script has tips for how to read the output.

Within the AWR, statspack report, or racdiag.sql output, you can use the wait

events and global cache statistics to monitor inter-instance performance. It

will be important to look for symptoms of inter-instance performance issues.

These symptoms include:

1. The average cr or current block receive time will be high. This value is calculated

by dividing the 'global cache cr block receive time' statistic by the

'global cache cr blocks received' statistic (in the case of cr blocks):

global cache cr block receive time

----------------------------------

global cache cr blocks received

Multiply this calculation by 10 to find the average number of milliseconds. In a

9.2 statspack report you can also use the following Global Cache Service Workload

characteristics:

Ave receive time for CR block (ms): 4.1

The following query can also be run to monitor the average cr block receive time

since the last startup:

set numwidth 20

column "AVG CR BLOCK RECEIVE TIME (ms)" format 9999999.9

select b1.inst_id, b2.value "GCS CR BLOCKS RECEIVED",

b1.value "GCS CR BLOCK RECEIVE TIME",

((b1.value / b2.value) * 10) "AVG CR BLOCK RECEIVE TIME (ms)"

from gv$sysstat b1, gv$sysstat b2

where b1.name = 'global cache cr block receive time' and

b2.name = 'global cache cr blocks received' and b1.inst_id = b2.inst_id ;

The average cr block receive time or current block receive time should typically be

less than 15 milliseconds depending on your system configuration and volume, is the

average latency of a consistent-read request round-trip from the requesting instance

to the holding instance and back to the requesting instance.

Please note that if you are on 9i and the global cache current block receive

time is abnormally high and the average wait time for the 'global cache null

to x' wait event is low (under 15ms) then you are likely hitting bug 2130923

(statistics bug). This is a problem in the way statistics are reported and does

not impact performance.

More about that issue is documented in the following note:

Note: 243593.1 

RAC: Ave Receive Time for Current Block is Abnormally High in Statspack

2. "Global cache" or "gc" events will be the top wait event. Some of these wait

events show the amount of time that an instance has requested a data block for a

consistent read or current block via the global cache.

When a consistent read buffer cannot be found in the local cache, an attempt is

made to find a usable version in another instance. There are 3 possible outcomes,

depending on whether any instance in the cluster has the requested data block

cached or not:

a) A cr block is received (i.e. another instance found or managed to produce the

wanted version). The "global cache cr blocks received" statistic is incremented.

b) No other instance has the block cached and therefor the requesting instance

needs to read from disk, but a shared lock will be granted to the requestor

The "global cache gets" statistic is incremented

c) A current block is received (the current block is good enough for

the query ). The " global cache current blocks received" statistic is

incremented.

In all three cases, the requesting process may wait for global cache cr request.

The view X$KCLCRST (CR Statistics) may be helpful in debugging 'global cache cr

request' wait issues. It will return the number of requests that were handled for

data or undo header blocks, the number of requests resulting in the shipment of a

block (cr or current), and the number of times a read from disk status is returned.

It should be noted that having 'global cache' or 'gc' waits does not always

indicate an inter-instance performance issue. Many times this wait is

completely normal if data is read and modified concurrently on multiple

instances. Global cache statistics should also be examined to determine if

there is an inter-instance performance problem.

3. The GCS/GES may run out of tickets. When viewing the racdiag.sql output

Note: 135714.1 or querying the gv$ges_traffic_controller or 

gv$dlm_traffic_controller views, you may find that the TCKT_AVAIL shows '0'. To

find out the available network buffer space we introduce the concepts of tickets.

The maximum number of tickets available is a function of the network send buffer

size. In the case of lmd and lmon, they always buffer their messages in case of

ticket unavailability. A node relies on messages to come back from the remote

node to release tickets for reuse.

4. The above information should be enough to identify an inter-instance performance

problem but additional calculations can be made to monitor inter-instance

performance can be found in the documentation.

IDENTIFYING AND RESOLVING INTER-INSTANCE PERFORMANCE PROBLEMS

-------------------------------------------------------------

Inter-Instance performance issues can be caused by:

1. Under configured network settings at the OS. Check UDP, or other network protocol

settings and tune them. See your OS specific documentation for instructions on how

to do this. If using UDP, make sure the parameters relating to send buffer space,

receive buffer space, send highwater, and receive highwater are set well above the

OS default. The alert.log will indicate what protocol is being used. Example:

cluster interconnect IPC version:Oracle RDG

IPC Vendor 1 proto 2 Version 1.0

Changing network parameters to optimal values:

Sun (UDP Protcol)

UDP related OS parameters can be queried with the following command:

ndd /dev/udp udp_xmit_hiwat

ndd /dev/udp udp_recv_hiwat

ndd /dev/udp udp_max_buf

Set the udp_xmit_hiwat and udp_recv_hiwat to the OS maximum with:

ndd -set /dev/udp udp_xmit_hiwat

ndd -set /dev/udp udp_recv_hiwat

ndd -set /dev/udp udp_max_buf <1M or higher>

IBM AIX (UDP Protocol)

UDP related OS parameters can be queried with the following command:

no -a

Set the udp_sendspace and udp_recvspace to the OS maximum with:

no -o

Linux (edit files)

/proc/sys/net/core/rmem_default

/proc/sys/net/core/rmem_max

/proc/sys/net/core/wmem_default

/proc/sys/net/core/wmem_max

HP-UX (HMP Protocol):

The file /opt/clic/lib/skgxp/skclic.conf contains the Hyper Messaging Protocol (HMP)

configuration parameters that are relevant for Oracle:

- CLIC_ATTR_APPL_MAX_PROCS Maximum number of Oracle processes. This includes

the background and shadow processes. It does not

include non-IPC processes like SQL client processes.

- CLIC_ATTR_APPL_MAX_NQS This is a derivative of the first parameter; it will

be removed in the next release. For the time being, this should be set to

the value of CLIC_ATTR_APPL_MAX_PROCS.

- CLIC_ATTR_APPL_MAX_MEM_EPTS Maximum number of Buffer descriptors. Oracle

seems to require about 1500-5000 of them depending on the block size (8K or

2K). You can choose the maximum value indicated above.

- CLIC_ATTR_APPL_MAX_RECV_EPTS Maximum number of Oracle Ports. Typically,

Oracle requires as many ports as there are processes. Thus it should be

identical to CLIC_ATTR_APPL_MAX_PROCS.

- CLIC_ATTR_APPL_DEFLT_PROC_SENDS Maximum number of outstanding sends. You

can leave it at the default value of 1024.

- CLIC_ATTR_APPL_DEFLT_NQ_RECVS Maximum number of outstanding receives on a

port or buffer. You can leave it at the default value of 1024.

HP-UX (UDP Protcol):

Not tunable before HP-UX 11i Version 1.6

For HP-UX 11i Version 1.6 or later be able to use below command to set socket_udp_rcvbuf_default & socket_udp_sndbuf_default

ndd -set /dev/udp socket_udp_rcvbuf_default 1048576

echo $?

ndd -set /dev/udp socket_udp_sndbuf_default 65535

echo $?

HP Tru64 (RDG Protocol):

RDG related OS parameters are queried with the following command:

/sbin/sysconfig -q rdg

The most important parameters and settings are:

- rdg_max_auto_msg_wires - MUST be set to zero.

- max_objs - Should be set to at least <# of Oracle processes * 5> and up to

the larger of 10240 or <# of Oracle processes * 70>. Example: 5120

- msg_size - Needs to set to at least , but we recommend

setting it to 32768, since Oracle9i supports different block sizes for each

tablespace.

- max_async_req - Should be set to at least 100 but 256+ may provide better

performance.

- max_sessions - Should be set to at least <# of Oracle processes + 20>,

example: 500

HP TRU64 (UDP Protocol):

UDP related OS parameters are queried with the following command:

/sbin/sysconfig -q udp

udp_recvspace

udp_sendspace

2. If the interconnect is slow, busy, or faulty, you can look for dropped packets,

retransmits, or cyclic redundancy check errors (CRC). You can use netstat commands

to check the networks. On Unix, check the man page for netstat for a list of options.

Also check the OS logs for any errors and make sure that inter-instance traffic is

not routed through a public network.

With most network protocols, you can use 'oradebug ipc' to see which interconnects

the database is using:

SQL> oradebug setmypid

SQL> oradebug ipc

This will dump a trace file to the user_dump_dest. The output will look something

like this:

SSKGXPT 0x1a2932c flags SSKGXPT_READPENDING info for network 0

socket no 10 IP 172.16.193.1 UDP 43749

sflags SSKGXPT_WRITESSKGXPT_UP info for network 1

socket no 0 IP 0.0.0.0 UDP 0...

So you can see that we are using IP 172.16.193.1 with a UDP protocol.

3. A large number of processes in the run queue waiting for CPU or scheduling

delays. If your CPU has limited idle time and your system typically processes

long-running queries, then latency may be higher. Ensure that LMSx processes get

enough CPU.

4. Latency can be influenced by a high value for the DB_FILE_MULTIBLOCK_READ_COUNT

parameter. This is because a requesting process can issue more than one request

for a block depending on the setting of this parameter.

ADDITIONAL RAC AND OPS PERFORMANCE TIPS

---------------------------------------

1. Poor SQL or bad optimization paths can cause additional block gets via the

interconnect just as it would via I/O.

2. Tuning normal single instance wait events and statistics is also very

important.

3. A poor gc_files_to_locks or db_files_to_locks setting can cause problems. In

almost all cases in RAC, gc_files_to_locks does not need to set at all.

4. The use of locally managed tablespaces (instead of dictionary managed) with

the 'SEGMENT SPACE MANAGEMENT AUTO' option can reduce dictionary and freelist

block contention. Symptoms of this could include 'buffer busy' waits. See the

following notes for more information:

Note: 105120.1

Advantages of Using Locally Managed vs Dictionary Managed Tablespaces

Note: 103020.1 

Migration from Dictionary Managed to Locally Managed Tablespaces

Note: 180608.1

Automatic Space Segment Management in RAC Environments

Following these recommendations can help you achieve maximum performance in

your clustered environment.

0 0