Oracle RAC的GES/GCS原理（转）

来源：互联网发布：手机能编程么编辑：程序博客网时间：2024/05/22 01:29

一、RAC的GES/GCS原理（1）

为了保证群集中的实例的同步，两个虚拟服务将被实现：全局排队服务(GES),它负责控制对锁的访问。全局内存服务（GCS)，控制对数据块的访问。

GES 是分布式锁管理器(DLM)的扩展，它是这样一个机制，可以用来管理oracle 并行服务器的锁和数据块。在一个群集环境中，你需要限制对数据库资源的访问，这些资源在单instance数据库中被latches 或者locks 来保护。比如说，在数据库字典内存中的对象都被隐性锁所保护，而在库高速缓存中的对象在被引用的时候，必须被pin所保护。

在RAC群集中，这些对象代表了被全局锁所保护的资源。GES 是一个完整的RAC组件，它负责和群集中的实例全局锁进行沟通，每个资源有一个主节点实例，这个实例记录了它当前的状态。而且，资源的当前的状态也记录在所有对这个资源有兴趣的实例上。

GCS,是另一个RAC组件，负责协调不同实例间对数据块的访问。对这些数据块的访问以及跟新都记录在全局目录中（GRD）,这个全局目录是一个虚拟的内存结构，在所有的实例中使用扩张。

每个块都有一个master实例，这个实例负责对GSD的访问进行管理，GSD里记录了这个块的当前状态信息。GCS 是oracle 用来实施Cache fusion的机制。被GCS 和GES 管理的块和锁叫做资源。对这些资源的访问必须在群集的多个实例中进行协调。这个协调在实例层面和数据库层面都有发生。实例层次的资源协调叫做本地资源协调；数据库层次的协调叫做全局资源协调。

本地资源协调的机制和单实例oracle的资源协调机制类似，包含有块级别的访问，空间管理，dictionary cache、library cache管理，行级锁，SCN 发生。全局资源协调是针对RAC的，使用了SGA中额外的内存组件、算法和后台进程。

GCS 和GES 从设计上就是在对应用透明的情况下设计的。换一句话来说，你不需要因为数据库是在RAC上运行而修改应用,在单实例的数据库上的并行机制在RAC上也是可靠地。

支持GCS 和GES的后台进程使用私网心跳来做实例之间的通讯。这个网络也被Oracle 的群集组件使用，也有可能被群集文件系统（比如OCFS)所使用。GCS 和GES 独立于 Oracle 群集组件而运行。但是，GCS 和GES依靠这些群集组件获得群集中每个实例的状态。如果这些信息不能从某个实例获得，这个实例将被关闭。这个关闭操作的目的是保护数据库的完整性，因为每个实例需要知道其他实例的情况，这样可以更好的确定对数据库的协调访问。

GES控制数据库中所有的 library cache锁和dictionary cache锁。这些资源在单实例数据库中是本地性的，但是到了RAC群集中变成了全局资源。全局锁也被用来保护数据的结构，进行事务的管理。一般说来，事务和表锁在RAC环境或是单实例环境中是一致的。

Oracle的各个层次使用相同的GES 功能来获得，转化以及释放资源。在数据库启动的时候，全局队列的个数将被自动计算。

GES 使用后台进程 LMD0和LCK0来执行它的绝大多数活动。一般来说，各种进程和本地的LMD0 后台进程沟通来管理全局资源。本地的LMD0 后台进程与别的实例上的 LMD0进程进行沟通。

LCK0 后台进程用来获得整个实例需要的锁。比如，LCK0 进程负责维护dictionary cache 锁。

影子进程(服务进程）与这些后台进程通过AST(异步陷阱）消息来通信。异步消息被用来避免后台进程的阻塞，这些后台进程在等待远端实例的的回复的时候将阻塞。后台进程也能发送 BAST(异步锁陷阱）来锁定进程，这样可以要求这些进程把当前的持有锁置为较低级限制的模式。

英文原文/翻译

In order to archive synchronization between the instances of a cluster,two virtual services are implemented :The

Global Enqueue Service(GES),which controls access to locks,and the Global Cache Service(GCS),which controls access

to blocks.

为了保证群集中的实例的同步，两个虚拟服务将被实现：全局排队服务(GES),它负责控制对锁的访问。全局内存服务（GCS)，控制对数据块的访问。

The GES is a development of the Distributed Lock Mnanager(DLM),which was the mechanism used to manage both locks and

blocks in Oracle Parallel Server(OPS).Within a clustered enviroment,you need to restrict access to database resources

that are typically protected by latches or locks in a single-instance database.

GES 是分布式锁管理器(DLM)的扩展，它是这样一个机制，可以用来管理oracle 并行服务器的锁和数据块。在一个群集环境中，你需要限制对数据库资源的访问，这些资源在单instance数据库中被latches 或者locks 来保护。

For example,objects in the dictionary cache are protected by implicit locks,and objects in the library cache must be

protected by pins while they are being referenced.

比如说，在数据库字典内存中的对象都被隐性锁所保护，而在库高速缓存中的对象在被引用的时候，必须被pin所保护。

In a RAC cluster,these objects are represented by resources that are protected by global locks.

在RAC群集中，这些对象代表了被全局锁所保护的资源。

GES is an integrated RAC component that coordinates global locks between the instances in the cluster.

GES 是一个完整的RAC组件，它负责和群集中的实例全局锁进行沟通，

Each resource has a master instance that records its current status.In addtion,the current status is recorded in all

instances with an interest in that resource.

每个资源有一个主节点实例，这个实例记录了它当前的状态。而且，资源的当前的状态也记录在所有对这个资源有兴趣的实例上。

The GCS,which is another integrated RAC component,coordinates access to database blocks by the varous instances.Block

access and update are recorded in the Global Resource Directory(GRD),which is a virtual memory structure spanning across

all instances.

GCS,是另一个RAC组件，负责协调不同实例间对数据块的访问。对这些数据块的访问以及跟新都记录在全局目录中（GRD）,这个全局目录是一个虚拟的内存结构，在所有的实例中扩张。

Each block has a master instance that matains an entry in the GRD describing the current status of the block.GCS is the

mechanism that Oracle uses to implement Cache Fusion.

每个块都有一个master实例，这个实例负责对GSD的访问进行管理，GSD里记录了这个块的当前状态信息。GCS 是oracle 用来实施Cache fusion的机制。

The blocks and locks maintained by GCS and GES are known as resources.Access to these resources must be coordinated between all instances in the cluster.This coordination occurs at both instance level and database level.Instance-level resource

coordination is known as local resource coordination;database-level coordination is known as global resource coordination.

被GCS 和GES 管理的块和锁叫做资源。对这些资源的访问必须在群集的多个实例中进行协调。这个协调在实例层面和数据库层面都有发生。实例层次的资源协调叫做本地资源协调；数据库层次的协调叫做全局资源协调。

Local resource coordination in a RAC instance is identical to that in single-instance Oracle and includes block level

access,space management,dictionary cache and library cache management,row-level locking,and System Change Number(SCN) generation.Global resource coordination is specific to RAC and uses additional SGA memory structures,algorithms,and

background processes.

Both GCS and GES are designed to operate transparently to application . In other words,you do not need to modify

applications to run on a RAC cluster,as the same concurrency mechanisms are available in RAC as are found in single-instance

Oracle databases.

GCS 和GES 从设计上就是在对应用透明的情况下设计的。换一句话来说，你不需要因为数据库是在RAC上运行而修改应用,在单实例的数据库上的并行机制在RAC上也是可用地。

The background processes that support GCS and GES use the interconnect network to communicate between instances.This network is also used by Oracle Clusterware and may optionally be used by the cluster file system(e.g OCFS) GCS and GES operate independently of Oracle clusterware .

支持GCS 和GES的后台进程使用私网心跳来做实例之间的通讯。这个网络也被Oracle 的群集组件使用，也有可能被群集文件系统（比如OCFS)所使用。

GCS 和GES 独立于 Oracle 群集组件而运行。但是，GCS 和GES依靠这些群集组件获得群集中每个实例的状态。如果这些信息不能从某个实例获得，这个实例将被关闭。这个关闭操作的目的是保护数据库的完整性，因为每个实例需要知道其他实例的情况，这样可以更好的确定对数据库的协调访问。

Global Enqueue Services

In a RAC database,GES is responsible for interinstance resource coordination.GES manages all non-Cache Fusion intro-instance

resource operations.It tracks the status of all Oracle enqueue mechanisms for resources that are accessed by more than

one instance.Oracle uses GES enqueues to manage concurrency for resources operating on transactions,tables,and other

structures with a RAC enviroment.GES is also responsible for deadlock detection.

全局队列服务

在一个RAC数据库中，GES 是对实例间资源协调负责的。GES 负责所有的非 cache fusion 实例间资源操作。它将跟踪所有的被两个以上实例访问的 Oracle 资源队列机制。Oracle 使用GES 队列来访问并行的资源，这些资源被事务、表和RAC环境下其它组件所使用。GES 也负责对死锁进行检测。

GES controls all library cache locks and dictionary cache locks in the database.These resources are local in a

single-instance database but global in a RAC database.Global locks are also used to protect the data structures

used for transation management.In general ,transation and table lock processing operate the same way in RAC as

they do in single-instance Oracle databases.

ALL layers of Oracle use the same GES functions to acquire,convert,and release resources.The number of global enqueues

is calculated automatically at start-up.

As with enqueues on single-instance Oracle ,deadlocks may occur with global enqueues in a RAC cluster.For example,

Instance 1 has an exclusive lock on Resource A,and Instance 2 has an exclusive lock on Resource B.This deadlocak

situation will be detected by the LMD0 back-ground process,which will write an error message to the alert log,

for example:

Global Enqueue Services Deadlock detected.More info in file /u01/app/oracle/admin/RAC/bdump/rac1_lmd0_25084.trc

Oracle的各个层次使用相同的GES 功能来获得、转化以及释放资源。

Background Processes

GES performs most of its activities using the LMD0 and LCK0 background processes.In general ,processes communicate

with their local LMD0 background process to manipulate the global resources.The local LMD0 background process

communicates with the LMD0 processes on other instances.

The LCK0 background process is used to obtain locks that are required by the entire instance.For example,LCK0 is

responsible for maintaining dictionary cache locks.

LCK0 后台进程用来获得整个实例需要的锁。比如，LCK0 进程负责维护dictionary cache 锁。

Server processes communicate with these background processes using messages known as Asynchronous Traps(AST).

Asynchronous messages are used to avoid the background processes having to block while they are waiting for

replies from instances on remote nodes.Background processes can also send Blocking Asynchronous Traps(BAST)

to lock holding processes to request that they downgrade a currently held lock to a less restrictive mode.

二、RAC的GES/GCS原理（2）

资源的概念：

资源是内存结构，这些结构代表了数据库中的组件，对这些组件的访问必须为限制模式或者串行化模式。换一句话说，这个资源只能被一个进程或者一直实例并行访问。如果这个资源当前是处于使用状态，其他想访问这个资源的进程必须在队列中等待，直到资源变得可用。

队列是内存结构，它负责并行化对特殊资源的访问。如果这些资源只被本地实例需求，那么这个队列可以本地来获得，而且不需要协同。但是如果这个资源被远程实例所请求，那么本地队列必须变成全球化。

优化全局队列

全局锁将明显的影响性能，这将造成增加的等待次数，甚至是死锁。然而，一些简单的方法能够极大地减小全局锁的影响。

很多全局锁与分析活动有关，因此，你应该随处避免不必要的解析。有很多种方式可以达到这个目的。在OLTP 环境中，文字应该被绑定变量所替代。这个置换方式实现的最好方式是修改源代码。但是，如果你没有访问源代码的权限，也许你可以考虑使能游标共享，这可以达到一样的目的，但会增加轻微的额外开销，因为在数据库中每个被执行的语句必须被做为文字的扫描，然后再做语句的解析。

PL/SQL 将包含一系列的优化策略，目的是提升性能。比如，当一个语句执行完毕之后，pl/sql 不会关闭游标。替代的是，它将高效的把游标放在一个池里，以备他们被再次需要，这种情况下，如果近期这个语句被再次执行可以避免再次做重复的软解析。如果你用C或者JAVA来开发应用，你可以把上面的信息保留下来并因此减少被解析的量。

另一种减少解析的方式是优化library cache的大小，减少发生这种情况游标的个数，这些游标从cache中age out之后很快又被reloaded。你可以把常用的包以及游标 pin到内存中，这样来提高数据库的效率。

你也应该尝试从你的应用移除不必要的DDL语句。最常见的原因是，在复杂的步骤中，有中间步骤需要建立临时表，这些步骤包括报告或者批量子进程。这些表可以经常被全局临时表所代替。

最终，减少全局enqueue 的一个方式是执行更少的SQL语句，要达到这个效果，我们可以通过减少不必要的语句来实现。它也可以通过合并存在的SQL语句来实现。

比如说，使用UNION ALL.而且，我们也值得花时间检查下你的应用逻辑，确认是否正在被周期的执行的sql语句能够被作为一个集合操作执行。比如说，你可能能够修改一个sql语句，这个语句能够使用一条语句一次更新所有的行，代替老的单命令语句100次更新语句。

英文/翻译

Resources and Enqueues

A resource is a memory structure that represents some component of the database to which access must be restricted

or serialized.In other words,the resource can only be accessed by one process or one instance concurrently.If the

resource is currently in use,other processes or instances needing to access the resource must wait in a queue until

the resource becomes available.

资源是内存结构，这些结构代表了数据库中的组件，对这些组件的访问必须为限制模式或者串行化模式。换一句话说，这个资源只能被一个进程或者一直实例并行访问。如果这个资源当前是出于使用状态，其他想访问这个资源的进程必须在队列中等待，直到资源变得可用。

An enqueue is a memory structure that serializes access to particular resource.If the resource is only required

by the local instance,then the inqueue can be acquired locally,and no coordination is necessary.However,if the

resource is required by a remote instance,then the local enqueue must become global.

优化全局队列:

Optimizing Global Enqueues

Global locking can significantly impact performance causing increased wait times and possibly even dead locks.

However,a few simple measures can greatly reduce the impact of global locking.

全局锁将明显的影响性能，这将造成增加的等待次数，甚至是死锁。然而，一些简单的方法能够极大地减小全局锁的影响。

Much global locking is related to parsing activity.Therefore ,you should avoid unnecessory parsing wherever possible.

There are many ways to archive this.

很多全局锁与分析活动有关，因此，你应该随处避免不必要的解析。有很多种方式可以达到这个目的。

In OLTP environments,literals should be replaced by bind variables.This replacement is best archived by modifying

the source code.

在OLTP 环境中，文字应该被绑定变量所替代。这个置换方式实现的最好方式是修改源代码。

Howerver，if you do not have access to the source code,you might consider enabling cursor sharing,which archives the

same goal but incurs a slight overhead,as the text of every statement executed in the database must be scanned for

literals before the statement is parsed.

但是，如果你没有访问源代码的权限，也许你可以考虑使能游标共享，这可以达到一样的目的，但会增加轻微的额外开销，因为在数据库中每个被执行的语句必须被做为文字的扫描，然后再做语句的解析。

PL/SQL contains a number of optimazions aimed at improving performance.For example,it does not close cursors when a

statement completes.Instead,it effectively retains cursors in a pool in case they are needed again,which avoids the

need to perform a soft parse if the cursor is executed again in the near future.If you develop your own applications in

C or Java，you can copy this behavior and thereby reduce the amount of parsing required.

Another way to reduce parsing is simply to optimize the size of the library cache to reduce the number of cursors

that are aged out of the cache and subsequently reloaded.You may also benifit from pinning commonly used packages

and cursors in memory.

You shoud also attempt to remove unnecessary DDL statements from your application.The most common cause of these

is the creation of temporary tables for intermediate steps in complex tasks,such as reports and batch processes.

These tables can often be replaced by global temporary tables.

Finally,the impact of global enqueue can be reduced by simply executing fewer SQL statements,which can often be

archived by eliminating unnecessary statements.It can also be archived by combining existing SQL statements.

for example,using a UNION ALL .It is also worth examining your application logic to establish whether statements

that are being executed procedurally(ie,row by row)could be executed as a set operation.For example,you might be

able to replace 100 single-row update statements with a single statement that updates all rows at the same time.

三、RAC的GES/GCS原理（3）

Lock Types

Every lock has a type,which is a two character alphabetic identifier(e.g ,BL,CU,SE,NB).The number of lock types

varies with each release.Some lock types are only used in RDBMS instances,others in ASM instances,and the reminder

are used in both.

每一种锁都有个类型，它是一个两字母的标识符（e.g BL,CU,SE,NB). 每种锁类型的个数因为数据库的版本而不同。某些类型的锁仅仅用在RDBMS 实例中，某些在ASM 实例中，其它剩下的在两种实例中都有。

Each lock type has two parameters ,which are called tags.Each tag value is a 32-bit number.The tag values differ

according to the lock type,but the name and the two tag values form a unique identifier for the lock.For example,

for a library cache object LB lock,the parameters represent a portion of the hash value for the object,which is

derived from the object name.On the other hand,for a TM lock,the first parameter contains the object number,and the

second parameter describes whether the object is a table or a partition.

每种锁类型都有2个参数，它们被叫做标签。每个标签值为一个32位的数字。标签的数值因为锁类型的不同而不同，它们的格式一般如下，名字以及两标签的数值形成了锁类型的唯一标识符。比如说，对于一个library cache 对象 LB 锁，这个参数代表了对象hash 数值的一部分，他来自于对象名。另一方面，针对一个TM锁，第一个参数包含了对象号，第二个参数代表对象是一个表还是一个分区。

In oracle 10.1 and above ,the V$LOCK_TYPE dynamic performance view summarizes all implemented lock types.

在oracle 10.1 以及以上版本中，v$LOCK_TYPE 动态视图概括了所有的当前锁类型。

Some lock types ,for example ,The Tx transaction lock and the CU cursor lock,only affect the local instance;

therefore ,they can be managed locally.Other lock types,such as the TM table lock and the all library cache locks

and pins,must be observed by all instances in the database; therefore ,they must be managed globally.

某些锁类型，比如说，TX 事务锁和CU 游标锁，仅仅影响本地实例；因此，它们可以被本地管理。其它的锁类型，比如TM 表锁和所有的library cache锁和pins，必须被数据库中所有的实例来观察；因此，它们必须被全局来管理。

The most common lock types seen in a RAC database are listed in Table 22-1.

Common Lock Types

Type Description

BL Block(GCS)

CU Cursor lock

HW High water mark lock

L* Library cache lock

N* Library cache pin

Q* Dictionary cache lock

SQ Sequence cache

TM Table lock

TS Temporary segment

TT Tablespace lock(for DDL)

TX Transaction lock

library cache locks

Each RAC instance has its own library cache.The library cache contains all statemets ad packages currently in use by

the instance.In addition ,the library cache contains all objects that are referenced by these statements and packages.

每个RAC下的实例都有它自己的library cache。Library cache 包含了所有被当前实例所使用的包以及语句。而且，library cache 包含了所有被这些语句和包引用到的对象。

When a DML or DDL statement is parsed,all database objects that are referenced by that statement are locked using a library

cache lock for the duration of the parse call.These objects include tables ,indexes,views,packages,procedures,and functions.

Referenced objects are also locked in the library cache during the compliation of all PL/SQL packages and Java classes.

当一个DML或DDL操作被解析的时候，所有被引用的数据库对象将被锁定，使用的是library cache锁，指导这个语句分析结束，这个锁才会被释放。这些对象包含有表，索引，视图，包，存储过程和函数。被引用的对象在 PL/SQL 和JAVA类被编译的过程中，也是被锁定在library cache里的。

When a statement is executed,all referenced objects are locked briefly to allow them to be pinned.Objects are pinned

during statement execution to prevent modification of them by other processes ,such as those executing DDL statements.

当一个语句被执行的时候，所有它涉及到的对象将被很快的锁定，并允许它们被pin操作读取。在这个语句执行的时候，这些对象被pin住了，

防止被其他进程所访问，比如哪些执行DDL 语句的情况。

Namespaces

Every objects in the library cache belongs to a namespace.The number of namespaces is release dependent;in Oracle 10.2

there can be a maximum of 64,although not all are used in that release.

Library cache中的每个对象都属于一个 namespace. namespace的个数是根据版本的，在ORACLE 10.2 版本里，最多可以有64个namespace，尽管不是所有的namespace都被使用。

Within a namespace,each object name must be unique.For example,one of the namespaces is called TABL/PRCD/TYPE,which

ensures that no table ,procedure,or user-defined type can have the same name.

在一个namespace中，每个对象都必须唯一。比如说，有一个namespace叫做 TABL/PRCD/TYPE,可以保证表、存储过程、用户定义的类型不能同名。

The namespace for each object is externalized as a number in the KGLHDNSP column of the X$KGLOB family of views.

You can obtain limited statistics for objects in the library cache,such as the number of gets and pins from the

V$LIBRARYCACHE view.Note,however,that this view returns only a subset of namespaces from the x$KGLST base view.

命名空间对应了 X$KGLOB 家族视图中 KGLHDNSP列的一个数字。你可以获得library cache中对象的有限的信息，比如说从

v$librarycache视图中 gets 和 pins的个数。请注意，这个视图仅仅返回命名空间x$KGLST 基本视图的子集。

Prior to Oracle10g,you could also identify the namespace for each object from a libary cache dump as follows:

SQL>Alter session set events 'immediate trace name library_cache level 8';

In this dump,each object in the library cache has a namespace attribute.Unfortunantely in Oracle 10.1 and 10.2,

this attribute has become confused with the object type,which is externalized as KGLOBTYP in the X$KGLOB

family of views.Although the namespace attribute is incorrect in the dump.you can still determine the true

namespace by inspecting the instance lock types for the library cache locks and pins as described later in

this section.

在这个dump里面，每个library cache里的对象都有一个命名空间属性。不幸的是在Oracle 10.1和 10.2 ，这个属性和对象类型容易混淆，对象类型在X$KGLST 家族视图的KGLOBTYP列中可以找到。虽然在dump文件里命名空间属性是不正确的，你仍然可以从查看实例锁类型来决定真正的命名空间，这个将在这个章节的后面介绍这些实例锁（针对library cache locks 和 pins 的）。

Hash Values

Every object in the library cache has a hash value.This is derived from a combination of the namespace and a

name.In the case of stored objects ,such as tables,the name is derived from the owner name,object name,and

optionally,a remote link.In the case of transient objects,such as SQL statements,the name is derived from

the text of the SQL statement.

Library cache里面的每个对象都有一个hash数值。这个数值从命名空间和对象名的集合获得。对一个储存的对象来说，比如表，

它的hash名是从对象属主名、对象名，还有可选的，从远程链接名获得。针对一个临时对象，比如SQL语句，它的hash 名是从

SQL 语句的文字获得的。

Prior to Oracle 10.1,the hash value was represented by a 32-bit number,which is still calculated and

externalized in the KGLNAHSH column of X$KGLOB.This 32-bit hash value was sufficient for most purposes,

but could ,on occasion,lead to collisions.Therefor,in Oracle 10.1 and above,a new hash value is calculated

using a 128-bit value,which more or less guarantees uniqueness.This new value is externalized in the

KGLNAHSV column of X$KGLOB.

在10.1版本前，hash数值是一个32位的数字，它将被计算之后在 X$KGLOB 视图的 KGLNAHSH 列上表示出来。针对绝大多数情况，这个32位的 hash数值对绝大多数情况来说是足够了。不过在某些情况下，将导致冲突。因此，在ORACLE 10.1 以及以上版本，将使用128位数值value，这或多或少保证了唯一性。这个新的数值可以通过查看 X$KGLOB 的 KGLNAHSV列来得到。

四、RAC的GES/GCS原理（4）

资源的信息被保存在 GRD中，由GCS、GES 来管理。GRD 是一个内存结构，在所有的实例中分配。GRD的目的是提供优化的表现。每个实例负责对SGA中部分的GRD信息进行管理，因此，访问GRD的开销在RAC的所有实例中共享。GRD中的信息对于所有的实例都是可以访问的，如果这个信息是在本地，可以通过直接访问；如果不在本地，可以通过和远程节点的后台进程通信来访问。GRD 也被设计来提供容错。如果发生一个节点失败事件，GRD将被剩下来的实例所重构。在恢复之后，只要还有一个活动的实例。这个共享的数据库还是可以被访问的。GCS和GES 被设计来在多个并行节点的失败情况下恢复。一个节点加入或离开群集都会导致GRD被重建。GRD的动态执行将方便RAC的任何实例都可以在任何时候以任何顺序启停。节点关系的任何改动都将导致群集信息的重构。每个资源初始时候都是通过hash 算法来指派给某一个实例的。这个实例叫做资源属主(resource master)。某个特定资源的 master 实例可能在每次群集信息重构(cluster reconfiguration)的时候改变，这种改动方式叫做静态资源管理。在oracle 10.1以及以上，资源能够通过使用模式来被重新指定属主，目的是降低网络访问和随之的CPU资源损耗。这种叫做动态资源管理。在ORACLE 10.1上，GCS 将定时的评估资源的管理情况。如果它发先某个实例和某个数据文件上的数据块资源之间有密切的关系，那么这个文件上的所有块的master 属主将被动态分配给这个实例。在ORACLE 10.2以及以上，动态资源属主管理（dynamic resource mastering)是在段级别来实施的，GCS 发现了某个实例和一个segment上的数据有密切的联系后，将启动重新指派属主（initiate remastering）的动作。每个实例保存了GRD 的一部分，它包含了全局资源的某个子集的当前状态。这个信息，在实例进行失败恢复的时候或群集信息重配的时候都被使用，包含有数据块的识别符号，数据块的当前版本的位置，该数据块被任何实例持有的模式，模式可以是null（N)，shared(S),或exclusive(X).而每个持有数据块的角色可以是本地的或者全局的。

当一个实例申请一个资源的时候，比如某个数据块，它首先会和资源属主(resource master)进行沟通,确定资源的当前状态。如果这个资源目前不被使用，则可以本地获取。如果这个资源正在被使用，资源属主将向占用此资源的实例要求把资源发送给要求资源的实例。如果资源随后被1个或多个实例要求修改，GRD 将被修改并申明这个资源是全局资源。如果本地实例需要某个数据块的读一致性版本，它会先联系资源属主来确认是否这个版本的数据块是否在远程节点的buffer cache中有同样版本或更新版本的数据块。如果这个数据块存在，那么资源属主将对远程实例发送一个要求，要求把读一致性版本的数据块

发送到本地。如果远程实例持有要求SCN 时间点版本的数据块，它将立即发送数据块。如果远程实例持有一个更新的数据块版本，它将建立数据块的一个副本，然后应用undo信息把这个副本回滚到要求的时间点scn。当某个 RAC 实例要求一个数据块，这个数据块当前正在被本地实例修改，那么这个需求在RAC实例的处理方式将和本地单实例处理方式一直。然而，当一个RAC实例要求某个被别的实例跟新的数据块的时候，那么这个块信息将首先被定位，然后准备，最后通过远程实例的GCS 的后台进程LMSn 传输到本地。

一个数据块可以存在多个buffer cache中。它可以被多个实例以不同的模式持有，持有的模式要依据这个数据块的状态，是被读还是被更新。GCS使用持有模式来决定是否实例当前拥有修改这个数据块的权限。有三种持有模式：null 模式(N)，共享模式（S),排他模式（X)。这些模式在下表中可以看到

Null N 没有访问权限

Shared S 共享访问权限，可以被多个实例读，不能被任何实例修改。

Exclusive X 持有X模式的实例有权限可以修改这个数据块，只有一个实例

能够以X模式来访问资源

你可以通过访问V$BH动态视图的STATUS列来查看某个实例的buffer 在buffer cache中的状态。STATUS 列包含下表22-5所示的信息：

V$BH 状态列数值

资源属主的状态信息

FREE Buffer当前没有在使用状态

CR NULL 一致性读（只读）

SCUR S 共享当前的数据块（只读）

XCUR X 对当前块的专用模式（可以被修改）

PI NULL 旧映象（只读）

SCUR 和 PI 这两个状态是RAC独有的。数据块被某个实例修改前，必须先把它修改为XCUR状态。在一个群集数据库 buffer cache中,可能有某个块的多个拷贝，但任何时间点但是其中只有一个拷贝是出于XCUR状态。

Global Resource Directory (GRD)

Information about resources is maintained in the GRD by the GCS and GES. The GRD is a memory

structure that is distributed across all instances.

资源的信息被保存在 GRD中，由GCS、GES 来管理。GRD 是一个内存结构，在所有的实例中分配。

The GRD is designed to provide enhanced runtime performance. Each instance is responsible for maintaining part of the GRD in its SGA; therefore, the overhead of maintaining the GRD is shared between all active instances.

GRD的目的是提供优化的表现。每个实例负责对SGA中部分的GRD信息进行管理，因此，访问GRD的开销在RAC的所有实例中共享。

Information in the GRD is available to all instances, either directly if that information is maintained locally, or indirectly through communication with background processes on the remote node.

GRD中的信息对于所有的实例都是可以访问的，如果这个信息是在本地，可以通过直接访问；如果不在本地，可以通过和远程节点的后台进程通信来访问。

The GRD is also designed to provide fault tolerance. In the event of a node failure, the GRD is

reconstructed by the remaining instances.

GRD 也被设计来提供容错。如果发生一个节点失败事件，GRD将被剩下来的实例所重构。

As long as at least one active instance remains after recovery is completed, the shared database will still be accessible. GCS and GES are designed to be resilient in the event of multiple concurrent node failures.

在恢复之后，只要还有一个活动的实例。这个共享的数据库还是可以被访问的。GCS和GES 被设计来在多个并行节点的失败情况下恢复。

The GRD is reconstructed whenever a node joins or leaves the cluster. The dynamic implementation

of the GRD enables RAC instances to start and stop at any time and in any order. Every change

in node membership results in a cluster reconfiguration.

一个节点加入或离开群集都会导致GRD被重建。GRD的动态执行将方便RAC的任何实例都可以

在任何时候以任何顺序启停。节点关系的任何改动都将导致群集信息的重构。

Each resource is initially mapped onto an instance using a hashing algorithm. This instance is

called the resource master. The master instance for a specific resource may change each time there

is a cluster reconfiguration, which is known as static resource mastering.

每个资源初始时候都是通过hash 算法来指派给某一个实例的。这个实例叫做资源属主(resource master)。某个特定资源的 master 实例可能在每次群集信息重构(cluster reconfiguration)的时候改变，这种改动方式叫做静态资源管理。

In Oracle 10.1 and above, resources can also be remastered based on usage patterns to reduce

network traffic and the consequent CPU resource consumption. This is known as dynamic resource

mastering.

在oracle 10.1以及以上，资源能够通过使用模式来被重新指定属主，目的是降低网络访问和随之的

CPU资源损耗。这种叫做动态资源管理。

In Oracle 10.1, GCS evaluates resource mastering periodically. If it detects a high level of

affinity between a particular instance and blocks from a specific data file, then all blocks in the file

may be remastered by that instance.

在ORACLE 10.1上，GCS 将定时的评估资源的管理情况。如果它发先某个实例和某个数据文件

上的数据块资源之间有密切的关系，那么这个文件上的所有块的master 属主将被动态分配给这个

实例。

In Oracle 10.2 and above, dynamic resource mastering is performed on a segment level, and GCS will initiate remastering if there is a high level of affinity between a particular instance and blocks from a specific segment.

在ORACLE 10.2以及以上，动态资源属主管理（dynamic resource mastering)是在段级别来实施的，GCS 发现了某个实例和一个segment上的数据有密切的联系后，将启动重新指派属主（initiate remastering）的动作。

Each instance maintains a portion of the GRD containing information about the current status

of a subset of the global resources. This information, which is used during instance failure recovery

and cluster reconfigurations, includes data block identifiers, the location of the current version of

the data block, modes in which the data block is held by each instance, which can be null (N), shared (S),or exclusive (X), and the role in which each instance is holding the data block, which can be local or

global.

每个实例保存了GRD 的一部分，它包含了全局资源的某个子集的当前状态。这个信息，在

实例进行失败恢复的时候或群集信息重配的时候都被使用，包含有数据块的识别符号，数据块

的当前版本的位置，该数据块被任何实例持有的模式，模式可以是null（N)，shared(S),或

exclusive(X).而每个持有数据块的角色可以是本地的或者全局的。

When an instance requests a resource, such as a data block, it first contacts the resource master

to ascertain the current status of the resource. If the resource is not currently in use, it can be acquired locally.

当一个实例申请一个资源的时候，比如某个数据块，它首先会和资源属主(resource master)进行沟通,确定资源的当前状态。如果这个资源目前不被使用，则可以本地获取。

If the resource is currently in use, then the resource master will request that the holding

instance passes the resource to the requesting resource.

如果这个资源正在被使用，资源属主将向占用此资源的实例要求把资源发送给要求资源的实例。

If the resource is subsequently required for modification by one or more instances, the GRD will be modified to indicate that the resource is held globally.

如果资源随后被1个或多个实例要求修改，GRD 将被修改并申明这个资源是全局资源。

If the local instance requires a read-consistent version of a block, it still contacts the resource

master to ascertain if a version of the block that has the same or a more recent SCN exists in the buffercache of any remote instance.

如果本地实例需要某个数据块的读一致性版本，它会先联系资源属主来确认是否这个版本的数据块

是否在远程节点的buffer cache中有同样版本或更新版本的数据块。

If such a block exists, then the resource master will send a request to the relevant remote instance to forward a read-consistent version of the block to the local instance.

如果这个数据块存在，那么资源属主将对远程实例发送一个要求，要求把读一致性版本的数据块

发送到本地。

If the remote instance is holding a version of the block at the requested SCN, it sends the block

immediately. If the remote instance is holding a newer version of the block, it creates a copy of the

block and applies undo to the copy to revert it to the correct SCN.

如果远程实例持有要求SCN 时间点版本的数据块，它将立即发送数据块。如果远程实例持有一个

更新的数据块版本，它将建立数据块的一个副本，然后应用undo信息把这个副本回滚到要求

的时间点scn。

When a RAC instance requires a data block that is currently being updated on the local instance,

the request is processed in exactly the same way that it would be in a single instance database.

However, when a RAC instance requests a data block that is being updated on another instance, the

block images are located, prepared, and transmitted by the GCS background processes (LMSn) on theremote instance.

当某个 RAC 实例要求一个数据块，这个数据块当前正在被本地实例修改，那么这个需求

在RAC实例的处理方式将和本地单实例处理方式一直。然而，当一个RAC实例要求某个

被别的实例跟新的数据块的时候，那么这个块信息将首先被定位，然后准备，最后通过

远程实例的GCS 的后台进程LMSn 传输到本地。

五、Rac 的GES/GCS原理（5)

Resource Modes

A data block can exist in multiple buffer caches. It can be held by multiple instances in different

modes depending on whether the block is being read or updated by the instance.

一个数据块可以存在多个buffer cache中。它可以被多个实例以不同的模式持有，持有的模式

要依据这个数据块的状态，是被读还是被更新。

GCS uses the resource mode to determine whether the instance currently holding the block can modify it. There are three modes: null (N) mode, shared (S) mode, and exclusive (X) mode. These modes are summarized in Table 22-4.

GCS使用这种持有模式来决定是否实例当前拥有修改这个数据块的权限。有三种持有模式：null 模式(N)，共享模式（S),排他模式（X)。这些模式在下表中可以看到：

Table 22-4. Resource Modes

Resource Mode Identifier Description

Null N No access rights.

Shared S Shared resources can be read by multiple instances but

cannot be updated by any instance.

Exclusive X An instance holding a block in exclusive mode can modify

the block. Only one instance can hold the resource in

exclusive mode.

Null N 没有访问权限

Shared S 共享访问权限，可以被多个实例读，不能被任何实例修改。

Exclusive X 持有X模式的实例有权限可以修改这个数据块，只有一个实例

能够以X模式来访问资源

You can verify the current state of any buffer in the buffer cache of an instance by selecting the

STATUS column from the V$BH dynamic performance view. The STATUS column can contain the values

shown in Table 22-5.

你可以通过访问V$BH动态视图的STATUS列来查看某个实例的buffer 在buffer cache中的状态。

STATUS 列包含下表22-5所示的信息：

Table 22-5. V$BH Status Column Values

Status Resource Mode Description

FREE Buffer is not currently in use

CR NULL Consistent read (read only)

SCUR S Shared current block (read only)

XCUR X Exclusive current block (can be modified)

PI NULL Past image (read only)

V$BH 状态列数值

资源属主的状态信息

FREE Buffer当前没有在使用状态

CR NULL 一致性读（只读）

SCUR S 共享当前的数据块（只读）

XCUR X 对当前块的专用模式（可以被修改）

PI NULL 旧映象（只读）

The SCUR and PI states are RAC specific. The XCUR state must be assigned before the block can

be modified. There can be only one copy of a block in the XCUR state in any buffer cache in the cluster

database at any one time

SCUR 和 PI 这两个状态是RAC独有的。数据块被某个实例修改前，必须先把它修改为XCUR状态。

在一个群集数据库 buffer cache中,可能有某个块的多个拷贝，但任何时间点但是其中只有一个拷贝是出于XCUR状态。

Resource Roles

每个分配给某个实例的资源组都会被分配一个角色。这个角色可以使本地的或者全局的。

当某个数据块最被读入某个buffer cache中，而没有其他实例读过这个数据块，那么这个

数据块可以被本地管理。

GCS分配一个本地角色给某个数据块。如果这个数据块被某个实例修改并被传输到另一个实例

去，那么它将被全局化管理，而且GCS将分配一个全局角色给这个数据块。

当一个数据块被被传送的过程中，资源模式可能会维持为专有模式，或者它将从专有改为共享模式。

GCS 将跟踪所有实例中每个buffer cache 中的每个块的位置、资源属主、资源角色。GCS被用来

确保cache 的一致性，如果buffer cache中的数据块的当前版本被另一个实例申请修改的时候。

Cache的同步

cache同步在很多计算技术中是一个重要的概念。在ORACLE 的RAC数据库中，它被定义为

在多个cache中进行数据块的同步。

GCS 为了保证cache的同步，需要通过要求实例在全局级别来申请资源。GCS将同步全局访问，

同一时间只允许一个实例来修改数据块。

Oracle使用多版本结构，这种结构下，在群集的多个实例中数据块只有一个当前版本。只有

数据块的当前版本容许被修改。同时，允许存在一系列数据块的读一致性版本。某个数据块的

读一致性版本代表了某个数据块在特殊时间点的快照。时间是通过SCN来代表。

一致性读数据块可以被修改，虽然它能够作为建立更早一致性数据块的起点。GCS管理数据块的

当前版本和一致性读版本。如果一个本地实例修改了一个数据块，某个远程实例需要他，那么本

地实例将建立一个数据块的过去镜像，然后再把这个数据块发到远程镜像。在某个实例、节点

失败的情况下，pi能够被用来构建数据块的当前以及一致性读版本。

Cache fusion

cache融合表示了几种类型的不同节点间的同步机制

同步读

同步读和写

同步写

同步读

当两个实例需要访问同样的数据块的时候，在多个节点间发生同步读事件。在这种情况下，

不需要任何同步机制，因为多个实例可以共享数据块读而不发生冲突。

同步读和写

如果一个实例需要读某个数据块，而这个数据块已经被别的实例改动过了，但脏数据还没有

写回硬盘，这个数据块可以通过内联网络从持有的实例分发到要求的实例上来。

同步写

当一个实例修改buffer cache中的数据块，修改之后的数据块叫做dirty buffer.只有数据块的当前

版本能够被修改。实例必须先获取数据块的当前版本，然后才能修改它。如果数据块的当前版本

无法获取，实例必须等待。

在一个实例可以修改buffer cache中的数据块之前，它首先需要建立针对这个数据块的所有

redo信息。当redo信息被拷贝至redo buffer之后，redo信息就可以被应用到buffer cache中的

数据块了。脏数据块将随后通过DBWn后台进程被写到磁盘。然而，必须先把数据块从redo

log buffer 写入到 redo log文件之后，脏块才可以写到磁盘上来。

如果本地实例需要修改一个数据块，它并不持有那个数据块，而是和资源属主（resource master)

联系，确认是否其他的实例正持有这个数据块。如果远程实例正持有一个数据块的脏版本，那么

远程实例将通过内联网络传输脏块，然后本地实例可以对这个数据块的当前版本进行修改操作。

注意数据块不必要被某个实例锁以专有模式持有直到事务结束。一旦某个本地实例已经修改

了某个数据块的当前版本的某行，这个数据块可以被传输到远程实例上，另外一个事务可以修改

不同的行。然而，远程实例不能够修改被本地实例修改过的行，除非本地的事务已经commits或回滚。

在这个方面，行级锁的表现与单实例数据库上行几所表现是相似的。

在oracle 8.1.5之前，如果一个本地实例需要一个数据块，而这个数据块在另外一个实例的

buffer cache中处于"脏"的状态,远程实例将把数据块写回数据文件，并通知本地实例。

本地实例将把数据块从磁盘读到buffer cache中来，这个过程叫做disk ping。Disk ping是非常

消耗资源的，他们将请求磁盘IO以及实例间的IPC通信。Oracle 8.1.5以及以上，如果本地实例

需要一个数据块的一致性读版本，而这个数据块在远程实例的buffer

cache中是脏的状态。远程实例将先构建该数据块此SCN时间点的一致性读镜像，

再通过内连网络发送一致性数据块。这个算法叫做Cache fusion 阶段1 。

这个技术在群集数据库技术中是相当前进的一步。oracle 9.0.1以上，

一致性数据块和当前的数据块都能通过内联网络进行传输。而传输当前

块的技术是因为PI 存在而变得可行的。这个算法叫做cache fusion 阶段2。

虽然在RAC数据库，cache fusion 进程因为额外的信息传递而导致大的开销，但这不一定会

增加对磁盘阵列的 I/O 开销。

当一个本地实例尝试读一个数据块，而这个数据块不再本地的log buffer中，它会首先和资源属主

(resource master)联系，资源属主将在GRD中检查当前数据块的状态。如果有一个远程实例持有

这个数据块，那这个资源属主将要求远程实例把数据块发送到本地的实例上来。为了一致性

读，远程实例将应用undo信息来吧数据块恢复到某个时间点。

因此，如果本地实例尝试读一个远程实例cache中的数据块，它将通过内连网络接收这个数据块的

副本。这种情况下，我们没有必要通过本地实例来从磁盘读取数据块。因为这个机制需要两个或三个节点的加入，

耗费CPU和网络资源，这个操作仅仅比采用物理磁盘I/O 读消耗更少。

当一个本地实例修改一个数据块的时候，当事务commit，修改的信息将立即写入redo buffer，

并通过lgwr后台进程写入redo log。然而，当buffer cache 需要空闲的buffer或者check point发生

的时候，修改后的数据块才会被写入到磁盘中。

如果一个远程实例需要修改某个数据块，这个数据块不会通过本地实例写入到磁盘。而是通过

内联网络传输到远程节点，进行进一步的修改。而PI 数据块，它是数据块在某个时间点的拷贝，

它存在于本地实例的buffer cache中，直到它收到远程节点的确认已经把数据写回到磁盘。（

这个PI数据块才可以被干掉）

针对读数据块，这个机制将保证2个或三个节点的加入，而且目的是通过耗费额外的CPU和网络资源

来避免磁盘IO。

在读、写过程中，需要涉及到的节点个数依据于资源属主(resource master)的位置。

如果资源属主和读数据块的源实例是同一个实例，或者写数据块的目标实例是同一个实例，

那么只有两个实例将加入这个操作。而如果资源属主和源或目实例都不是一个实例，三个实例将加入到这个操作。

当然，前提是群集至少有个三节点。

中英文对照如下：

||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Resource Roles

A role is assigned to every resource held by an instance. This role can be either local or global. When

a block is initially read into the buffer cache of an instance and no other instance has read the same

block, the block can be locally managed.

每个分配给某个实例的资源组都会被分配一个角色。这个角色可以使本地的或者全局的。

当某个数据块最被读入某个buffer cache中，而没有其他实例读过这个数据块，那么这个

数据块可以被本地管理。

The GCS assigns a local role to the block. If the block has been modified by one instance and is transmitted to another instance, then it becomes globally managed, and the GCS assigns a global role to the block.

GCS分配一个本地角色给某个数据块。如果这个数据块被某个实例修改并被传输到另一个实例

去，那么它将被全局化管理，而且GCS将分配一个全局角色给这个数据块。

When the block is transferred, the resource mode may remain exclusive, or it may be converted from exclusive to shared.

当一个数据块被被传送的过程中，资源模式可能会维持为专有模式，或者它将从专有改为共享模式。

The GCS tracks the location, resource mode, and resource role of each block in the buffer cache

of all instances. The GCS is used to ensure cache coherency when the current version of a data block

is in the buffer cache of one instance and another requires the same block for update.

GCS 将跟踪所有实例中每个buffer cache 中的每个块的位置、资源属主、资源角色。GCS被用来

确保cache 的一致性，如果buffer cache中的数据块的当前版本被另一个实例申请修改的时候。

Cache Coherency

Cache coherency is an important concept in many computing technologies. In an Oracle RAC

database, it is defined as the synchronization of data in multiple caches, so that reading a memory

location through any cache will return the most recent data written to that location through any

other cache. In other words, if a block is updated by any instance, then all other instances will be

able to see that change the next time they access the block.

Cache的同步

cache同步在很多计算技术中是一个重要的概念。在ORACLE 的RAC数据库中，它被定义为

在多个cache中进行数据块的同步。

The GCS ensures cache coherency by requiring instances to acquire resources at a global level

before modifying a database block. The GCS synchronizes global cache access, allowing only one

instance to modify a block at a time.

GCS 为了保证cache的同步，需要通过要求实例在全局级别来申请资源。GCS将同步全局访问，

同一时间只允许一个实例来修改数据块。

Oracle uses a multiversioning architecture, in which there can be one current version of a block

throughout all instances in the cluster. Only the current version of a block may be updated. There

can also be any number of consistent read (CR) versions of the block. A consistent read version of

a block represents a snapshot of the data in that block at a specific point in time. The time is represented by the SCN.

Oracle使用多版本结构，这种结构下，在群集的多个实例中数据块只有一个当前版本。只有

数据块的当前版本容许被修改。同时，允许存在一系列数据块的读一致性版本。某个数据块的

读一致性版本代表了某个数据块在特殊时间点的快照。时间是通过SCN来代表。

Consistent read blocks cannot be modified, though they can be used as a starting

point to construct earlier consistent blocks. The GCS manages both current and consistent read

blocks.

一致性读数据块可以被修改，虽然它能够作为建立更早一致性数据块的起点。GCS管理数据块的

当前版本和一致性读版本。

If a local instance has modified a block and a remote instance requests it, the local instance

creates a past image (PI) of the block before it transfers the block to the remote image.

如果一个本地实例修改了一个数据块，某个远程实例需要他，那么本地实例将建立一个

数据块的过去镜像，然后再把这个数据块发到远程镜像。

In the event of a node or instance failure, the PI can be used to reconstruct current and consistent read versions of the block.

在某个实例、节点失败的情况下，pi能够被用来构建数据块的当前以及一致性读版本。

Cache Fusion

Cache Fusion addresses several types of concurrency between different nodes:

• Concurrent reads

• Concurrent reads and writes

• Concurrent writes

Concurrent Reads

Concurrent reads on multiple nodes occur when two instances need to read the same block. In this

case, no synchronization is required, as multiple instances can share data blocks for read access

without any conflict.

Cache fusion

cache融合表示了几种类型的不同节点间的同步机制

同步读

同步读和写

同步写

同步读

当两个实例需要访问同样的数据块的时候，在多个节点间发生同步读事件。在这种情况下，

不需要任何同步机制，因为多个实例可以共享数据块读而不发生冲突。

Concurrent Reads and Writes

If one instance needs to read a block that was modified by another instance and has not yet been

written to disk, this block can be transferred across the interconnect from the holding instance to

the requesting instance. The block transfer is performed by the GCS background processes (LMSn)

on the participating instances.

同步读和写

如果一个实例需要读某个数据块，而这个数据块已经被别的实例改动过了，但脏数据还没有

写回硬盘，这个数据块可以通过内联网络从持有的实例分发到要求的实例上来。

Concurrent Writes

When an instance updates a block in the buffer cache, the resulting block is called a dirty buffer.

Only the current version of the block can be modified. The instance must acquire the current version

of the block before it can modify it. If the current version of the block is not currently available,

the instance must wait.

同步写

当一个实例修改buffer cache中的数据块，修改之后的数据块叫做dirty buffer.只有数据块的当前

版本能够被修改。实例必须先获取数据块的当前版本，然后才能修改它。如果数据块的当前版本

无法获取，实例必须等待。

Before an instance can modify a block in the buffer cache, it must construct a redo record

containing all the changes that will be applied to the block.When the redo record has been copied to

the redo buffer, the changes it contains can be applied to the block(s) in the buffer cache. The dirty

block will subsequently be written to disk by the DBWn background process. However, the dirty block

cannot be written to disk until the change vector in the redo buffer has been flushed to the redo

log file.

在一个实例可以修改buffer cache中的数据块之前，它首先需要建立针对这个数据块的所有

redo信息。当redo信息被拷贝至redo buffer之后，redo信息就可以被应用到buffer cache中的

数据块了。脏数据块将随后通过DBWn后台进程被写到磁盘。然而，必须先把数据块从redo

log buffer 写入到 redo log文件之后，脏块才可以写到磁盘上来。

If the local instance needs to update a block, and it does not currently hold that block, it contacts

the resource master to identify whether any other instance is currently holding the block. If

a remote instance is holding a dirty version of the block, the remote instance will send the dirty

block across the interconnect, so that the local instance can perform the updates on the most recent

version of the block.

如果本地实例需要修改一个数据块，它并不持有那个数据块，而是和资源属主（resource master)

联系，确认是否其他的实例正持有这个数据块。如果远程实例正持有一个数据块的脏版本，那么

远程实例将通过内联网络传输脏块，然后本地实例可以对这个数据块的当前版本进行修改操作。

The remote instance will retain a copy of the dirty block in its buffer cache until

it receives a message confirming that the block has subsequently been written to disk. This copy is

called a past image (PI). The GCS manages past images and uses them in failure recovery.

Note that a block does not have to be held by an instance in exclusive mode until the transaction

has completed. Once a local instance has modified a row in current version of the block, the block

can be passed to a remote instance where another transaction can modify a different row. However,

the remote instance will not be able to modify the row changed by the local instance until the transaction on the local instance either commits or rolls back.

注意数据块不必要被某个实例锁以专有模式持有直到事务结束。一旦某个本地实例已经修改

了某个数据块的当前版本的某行，这个数据块可以被传输到远程实例上，另外一个事务可以修改

不同的行。然而，远程实例不能够修改被本地实例修改过的行，除非本地的事务已经commits或

回滚。

In this respect, row locking behavior is identical to that on a single-instance Oracle database.

Prior to Oracle 8.1.5, if a local instance required a block that was currently dirty in the buffer

cache of another instance, the remote instance would write the block back to the datafile and signal

the local instance.

在这个方面，行级锁的表现与单实例数据库上行几所表现是相似的。

在oracle 8.1.5之前，如果一个本地实例需要一个数据块，而这个数据块在另外一个实例的

buffer cache中处于"脏"的状态,远程实例将把数据块写回数据文件，并通知本地实例。

The local instance would then read the block from disk into its buffer cache. This

process is known as a disk ping. Disk pings are very resource intensive, as they require disk I/O and

IPC communication between the instances.

本地实例将把数据块从磁盘读到buffer cache中来，这个过程叫做disk ping。Disk ping是非常

消耗资源的，他们将请求磁盘IO以及实例间的IPC通信。

In Oracle 8.1.5 and above, if the local instance required a block that was currently dirty in the

buffer cache of another instance for a consistent read, the remote instance would construct a consistent image of the block at the required SCN and send the consistent block across the interconnect.

Oracle 8.1.5以及以上，如果本地实例需要一个数据块的一致性读版本，而这个数据块在远程实例的buffer cache中是脏的状态。远程实例将先构建该数据块此SCN时间点的一致性读镜像，

再通过内连网络发送一致性数据块。

This algorithm was known as Cache Fusion Phase I and was a significant step forward in cluster

database technology.In Oracle 9.0.1 and above, both consistent blocks and current blocks can be sent across the interconnect. The transfer of current blocks is made possible by the existence of past images (PI). This algorithm is known as Cache Fusion Phase II.

这个算法叫做Cache fusion 阶段1 。这个技术在群集数据库技术中是相当前进的一步。

oracle 9.0.1以上，一致性数据块和当前的数据块都能通过内联网络进行传输。而传输当前

块的技术是因为PI 存在而变得可行的。这个算法叫做cache fusion 阶段2。

Although in a RAC database, Cache Fusion processing incurs overheads in the form of additional

messaging, it does not necessarily increase the amount of I/O performed against the storage.

虽然在RAC数据库，cache fusion 进程因为额外的信息传递而导致大的开销，但这不一定会

增加对磁盘阵列的 I/O 开销。

When a local instance attempts to read a block that is not currently in the local buffer cache, it first

contacts the resource master, which checks the current status of the block in the GRD. If a remote

instance is currently holding the block, the resource master requests that the remote instance send

the block to the local instance. For a consistent read, the remote instance will apply any undo necessary to restore the block to the appropriate SCN.

当一个本地实例尝试读一个数据块，而这个数据块不再本地的log buffer中，它会首先和资源属主

(resource master)联系，资源属主将在GRD中检查当前数据块的状态。如果有一个远程实例持有

这个数据块，那这个资源属主将要求远程实例把数据块发送到本地的实例上来。为了一致性

读，远程实例将应用undo信息来吧数据块恢复到某个时间点。

Therefore, if the local instance attempts to read a block that is in the cache of any other instance, it will receive a copy of the block over the interconnect network. In this case, it is not necessary for the local instance to read the block from disk.

因此，如果本地实例尝试读一个远程实例cache中的数据块，它将通过内连网络接收这个数据块的

副本。这种情况下，我们没有必要通过本地实例来从磁盘读取数据块。

While this mechanism requires the participation of two or three instances, consuming CPU and

networking resources, these are generally less expensive than the cost of performing a single physical

disk I/O.

因为这个机制需要两个或三个节点的加入，耗费CPU和网络资源，这个操作仅仅比采用

物理磁盘I/O 读消耗更少。

When a local instance modifies a block, the changes are written immediately to the redo buffer

and are flushed to the redo log by the log writer (LGWR) background process when the transaction is

committed. However, the modified block is not written to disk by the database writer (DBWn) background

process until a free buffer is required in the buffer cache for another block or a checkpoint

occurs.

当一个本地实例修改一个数据块的时候，当事务commit，修改的信息将立即写入redo buffer，

并通过lgwr后台进程写入redo log。然而，当buffer cache 需要空闲的buffer或者check point发生

的时候，修改后的数据块才会被写入到磁盘中。

If a remote instance requests the block for modification, the block will not be written to disk

by the local instance. Instead, the block will be passed over the interconnect network to the remote

instance for further modification.

如果一个远程实例需要修改某个数据块，这个数据块不会通过本地实例写入到磁盘。而是通过

内联网络传输到远程节点，进行进一步的修改。

A past image (PI) block, which is a copy of the block at the time it was transferred, is retained in the buffer cache of the local instance until it receives confirmation that the remote instance has written the block to disk.

而PI 数据块，它是数据块在某个时间点的拷贝，它存在于本地实例的buffer cache中，直到它收到

远程节点的确认已经把数据写回到磁盘。（这个PI数据块才可以被干掉）

As with reads, this mechanism requires the participation of two or three instances and is designed to avoid disk I/Os at the expense of additional CPU and networking resources.

针对读数据块，这个机制将保证2个或三个节点的加入，而且目的是通过耗费额外的CPU和网络资源

来避免磁盘IO。

The number of nodes involved in a read or write request that is satisfied by a block transfer

across the interconnect depends on the location of the resource master.

在读、写过程中，需要涉及到的节点个数依据于资源属主(resource master)的位置。

If the resource master is the same instance as the source instance for a read or the destination instance for a write, then only two instances will participate in the operation.

如果资源属主和读数据块的源实例是同一个实例，或者写数据块的目标实例是同一个实例，

那么只有两个实例将加入这个操作。

If the resource master is on a different instance than the source or destination instance, three instances will participate in the operation. Obviously, there must be at least three active instances in the cluster for this situation to arise.

而如果资源属主和源或目实例都不是一个实例，三个实例将加入到这个操作。当然，前提是

群集至少有个三节点。

六、Rac 的GES/GCS原理（6）

Cache Fusion Examples

The following section contains examples of the different types of behaviors of the Cache Fusion

algorithm for block reads and updates. Each example will follow a single block as it is requested by

various instances for both shared and exclusive access.

cache fusion 举例

下面的场景将包含各种类型下cache fusion 算法针对块的读或跟新。每个例子将跟踪一个

数据块，看它被不同实例要求来做共享或专用访问的情况。

Example 1: Current Read with No Transfer

In the following examples, the database cluster consists of four instances, each of which has access

to a database on shared disk. The block is initially only in the database on disk. It is currently at SCN

1318 (Figure 22-1)

在下面的例子中，数据库群集包含有4个实例，每个实例都有访问共享存储上数据库的权限。

这个数据块最开始只存在于数据库的磁盘上。它的当前SCN为1318。

Instance 2 requests a current read of the block (Figure 22-2).

It first identifies the instance that is the resource master for the block by applying a hash function

to the database block address (DBA) of the resource. In this case, the resource master instance

is Instance 3. As Instance 2 only needs to perform a current read on the block, it sends a message to

Instance 3 requesting shared access to the block.

Instance 3 checks in the GRD and discovers that the block is not currently being accessed by

any instances. Therefore, Instance 3 immediately grants shared access on the block to Instance 2

(Figure 22-3).

实例2要求读这个数据块。

首先实例2将寻找这个数据块的资源属主，通过资源属主应用hash函数来找到资源的数据块地址。

在我们的例子中，数据块的资源属主是实例3。因为实例2仅仅只需要做一个对数据块的当前读，

它首先发送一个消息给实例三，要求对数据块做一个共享访问。

实例三将首先检查GRD，实例三发现这个数据没有被任何实例所访问。因此，实例三将立即

把数据块的共享访问权限赋予实例二。

Instance 2 issues a system call to read the block directly from the database (Figure 22-4). The

block is currently at SCN 1318.

The block is returned from disk and a current version is stored in the buffer cache (Figure 22-5).

The SCN is still 1318.

实例2发起一个系统调用，直接从数据库读取数据块。这个数据块的SCN 在1318。

数据块从磁盘返回，而它的当前版本储存在buffer cache中，当前的版本SCN也是1318。

Example 2: Read-to-Write Transfer

This example shows how GCS satisfies a request by an instance to read a block for update where

that block is already held read-only by another instance.

Instance 1 requests an exclusive read on the block. Initially, Instance 1 is not aware that the

block is currently held by Instance 2.

Instance 2 identifies the instance that is the resource master

for the block applying a hash function to the database block address (DBA) of the resource. Instance 1

sends amessage to Instance 3 requesting exclusive access to the block (Figure 22-6).

例2：读-写传输

这个例子解释了GCS机制在下面条件下的工作原理：一个实例准备从某个数据块读数据以便

跟新，而这个数据块已经被另外的实例以只读模式持有了。

实例1要求对数据块一个只读访问。最开始，实例1并不知道这个数据块被实例2持有。实例2

通知资源属主这个数据块正被实例2持有。实例2通知实例数据块的资源属主采用hash函数来

找到该资源的数据库的数据块地址。实例1发送一个信息到实例3，希望对数据块进行专有访问。

Instance 3 checks in the GRD and discovers that the block is currently held by Instance 2 for

shared (read) access. Instance 3 sends amessage to Instance 2 requesting that it send the block to

Instance 1 (Figure 22-7).

实例3检查GRD，发现这个数据块当前正被实例2持有，以共享读方式访问。

实例3发送消息，要求实例2把数据块发送给实例1。

Instance 2 sends the block, still SCN 1318, to Instance 2 and downgrades the shared lock to

a null lock (Figure 22-8)。

实例2将发送这个数据块，时间点仍然是1318，实例2将降级共享锁（S）到空锁(null).

Instance 1 receives the block and sends a message to Instance 3 to update the resource status

by setting the lock mode on the block to exclusive (X) (Figure 22-9). Instance 3 updates the GRD.

Instance 1 assumes an exclusive lock and updates the block at SCN 1320.

实例1收到这个数据块之后，将数据块的锁模式设置为专有模式（x),并发送消息给实例3,

要求实例3来修改资源状态。实例3修改GRD。实例1申明对此数据块一个专有锁，并修改数据

块到SCN 1320时间点。

Note that RAC is designed to minimize the number of messages that pass across the interconnect

network. Therefore, Instance 3 sends a message to Instance 2 requesting that it pass the block

to Instance 1 and downgrades the shared lock to a null lock. However, Instance 2 does not reply

directly back to Instance 3. Instead, it passes both the block and the resource status to Instance 1,

which forwards the updated resource status back to Instance 3.

注意RAC数据库的目的是减少内联网络中信息的传递。因此，实例三发送信息给实例2，要求实例

2把数据块发送给实例1并降级自己对数据块的锁模式由share 变为null。但是，实例2不会直接

把结果返回给实例三。实例2会首先把数据块和资源状态信息发送给实例1，实例1再把修改后的

资源状态信息发送给实例3。

Example 3: Write-to-Write Transfer

This example shows how GCS satisfies a request by an instance to read a block for update where

that block is already held for update by another instance.

例3：写-写传输

这个例子将展示GCS 如何满足这样的要求，当1个实例读取某个数据块以便跟新，而这个数据块

已经被另外一个instance以update 为目的持有了。

Instance 4 has also requested an exclusive read on the block. Initially, Instance 4 is not aware

that the block is currently held by Instance 1. Instance 4 identifies that Instance 3 is the resource

master instance for the block. Instance 4 sends a message to Instance 3 requesting exclusive access

to the block (Figure 22-10).

实例4也要求对这个数据块持有只读的排它锁。最开始，实例4不知道这个数据块已经被实例1

持有了。实例4将发现实例3是该数据块的资源属主。实例4将发送一个消息给实例3，要求

对这个数据块进行专有访问。

Instance 3 checks in the GRD and discovers that the block is currently held by Instance 1 for

exclusive (update) access. Instance 3 sends a message to Instance 1 requesting that it send the block

to Instance 4 (Figure 22-11).

实例3将检查GRD，并发现数据块当前正被实例1以专有(修改)模式持有。实例三发送消息给实例

1，要求它发送数据块给实例4

Instance 1 sends the block, still SCN 1320, to Instance 4 and downgrades the exclusive lock to

a null lock (Figure 22-12).。

实例1将发送这个数据块（时间点仍然是1320）给实例4，并把专有锁降级为空锁。

Instance 4 receives the block and sends a message to Instance 3 to update the resource status,

again setting the lock mode on the block to exclusive (Figure 22-13). Instance 3 updates the GRD.

Instance 4 assumes an exclusive lock and updates the block at SCN 1323.

实例4将接受这个数据块，并发送消息给实例3来跟新资源状态，再次把这个数据块上的锁

设置为专有模式。实例3跟新GRD。实例4宣称获得了专有锁，并在SCN 1323的时间点

修改这个数据块。

Note that Instance 1 will retain a past image (PI) copy of the block at SCN 1320 until the current

version of the block (SCN 1323) is written to disk by Instance 4.

注意实例1将保留这个数据块在SCN 1320的过去镜像(PI)，直到该数据块的当前版本(scn 1323)

被实例4写回到磁盘。

Example 4: Write-to-Read Transfer

The behavior of the write-to-read transfer is determined by the _FAIRNESS_THRESHOLD parameter,

which was introduced in Oracle 8.1.5 and defaults to 4. Prior to the introduction of this parameter,

when Instance A held a block in exclusive mode and Instance B requested a read-only copy of that

block, Instance A would downgrade its exclusive lock to a shared lock and send the block to Instance B,which would also set a shared lock on the block.

例子4：读写传输

读写传输的表现要根据_FAIRNESS_THREASHOLD参数，这个参数从ORACLE 8.1.5 出现，而且

默认取值为4。在引进这个参数之前，当实例A对这个数据块以排他模式持有，而实例B要求

这个数据块的只读拷贝，此时实例1将降级排他锁为共享锁模式--并把数据块发送给实例B，

而实例B将对数据块加一个共享锁。

However, if Instance A is performing frequent updates on the block, it will need to reacquire the block and set an exclusive lock again. If this process is repeated frequently, then Instance A will be continually interrupted, as it has to downgrade the exclusive lock to a shared lock and wait until Instance B has finished reading the block before it can convert the shared lock back into an exclusive lock.

然而，如果实例A在频繁的对这个数据块做跟新操作，它将需要重新获取这个数据块，并

再次设置专有共享锁，所以他必须降级转有锁到共享模式，等待实例B，直到实例B完成了

对数据块的读操作，再把数据共享锁转换回转有锁模式。

The _FAIRNESS_THRESHOLD parameter modifies this behavior. When this parameter is set,

Instance A will no longer downgrade the exclusive lock. Instead, it sends a null lock to Instance B,

and then it can continue processing. However, if instance B requests the block _FAIRNESS_THRESHOLD

times, by default 4, then Instance A will revert to the original behavior—it will downgrade the

exclusive lock to a shared lock and ship the block to Instance B, which will also set a shared lock on

the block.

而_FAIRNESS_THREASHOLD参数修改这个现象，当这个参数被设置之后，还是上面的例子，

实例将不再降级本地的专有锁。取而代之，它将发送一个空锁给实例B，然后实例A将继续做

处理操作。然而，如果实例B要求这个数据块的次数达到了_FAIRNESS_THREAHOLD次数之后，

默认是4次，实例A将改变原来的做法---它将降级专有锁到共享模式，然后把数据块传输到实例

B，实例B也会对这个数据块设置一个共享锁。

This behavior is explained in the following example.

Instance 2 requests a current read on the block. Instance 2 is not aware that the block is currently

held by Instance 4. Instance 2 identifies that Instance 3 is the resource master instance for the block

and sends amessage to Instance 3 requesting shared access to the block (Figure 22-14).

这个表现是以下面的例子来解释的：

实例2需要对这个数据块做一个当前读。实例2并不知道这个数据块被实例4锁持有。实例2发现实例

3是数据块的资源属主，于是发送消息给实例3，要求对这个数据块做一个共享访问。

Instance 3 checks in the GRD and discovers that the block is currently held by Instance 4 for

exclusive (update) access. Instance 3 sends a message to Instance 4 requesting that it send the block

to Instance 2 in shared mode (Figure 22-15).

实例三检查GRD，发现数据块正在被实例4以专有锁持有。实例3发送一个消息给实例4，要求

实例4 以共享模式传输数据块给实例2。

Instance 4 sends the block to Instance 2 (Figure 22-16). Because the block has been requested

less than _FAIRNESS_THRESHOLD times, Instance 4 retains the exclusive lock, and Instance 2 receives

a null lock.

实例4将传输这个数据块给实例2,。因为这个数据快被要求的次数少于_FAIRNESS_THRESHOLD

的次数，实例4将保留exclusive锁模式，而实例2将收到一个空锁。

Instance 2 receives the block and sends a message to Instance 3 to update the resource status

(Figure 22-17). Instance 3 updates the GRD. Instance 2 assumes a null lock on the block and reads it.

Instance 4 can update the block for which it continues to hold an exclusive process.

实例2收到了这个数据块，并发送信息到实例3来更新资源状态。实例3将跟新GRD。实例2

对这个数据块申明一个空锁并读取它。同时，实例4能够跟新这个数据块，因为它还持有这

个数据块的专有锁。

The preceding process can be repeated the number of times specified by the

_FAIRNESS_THRESHOLD parameter. By default, the value of this parameter is 4.

Instance 2 has now flushed the block from its buffer cache and needs to read it again. It sends

a read request to the resource master, which is Instance 3 (Figure 22-18).

上面的过程可以重复多次，具体次数由参数_FAIRNESS_THRESHOLD决定。

这个参数的默认值是4。实例2现在已经把数据块从buffer cache 清除了，又需要

再读取这个数据块。它这时又把要求发送给资源属主，属主是实例3

Instance 3 (the resource master) sends a message to Instance 4 requesting that it send a copy

of the block to Instance 2 (Figure 22-19). Instance 4 notes that it has already sent the block

_FAIRNESS_THRESHOLD times to Instance 2.

实例3（资源属主）发送消息给实例4，要求它发送数据块的拷贝给实例2,。实例4发现，

自己已经发送_FAIRNESS_THRESHOLD 次信息给实例2了。

Instance 4, therefore, downgrades the exclusive lock to a shared lock and sends the block to

Instance 2 (Figure 22-20).

因此，实例4降级本地对这个数据块的专有锁到共享锁模式，并把数据块发送给实例2

Instance 2 receives the block, assumes a shared lock, and sends amessage to Instance 3 to

update the resource status, setting the lock mode on the block to shared (Figure 22-21). Instance 3

updates the GRD。

实例2收到这个数据块，并宣称一个共享锁，并把资源信息发送给实例3来跟新资源状态，

设置该数据块的锁模式为共享。实例三跟新相关的GRD信息。

Past Images

Recall that a past image is a copy of the block, including any changes, that is retained by the sending

instance when a block is transferred to another instance.

过去镜像是某块的一个复制，包含所有的改动，当数据块传输到别的实例上的时候，它会被发送实例保存。

Past images are only created for write-write transfers, which occur when a block is modified by one instance and the same block then needs to be changed by another instance.

过去镜像只因为写-写传输才产生，就是当一个数据块已经被某个实例改动了，又被一个不同的实例

申请访问并修改。

Past images are retained for performance reasons, because in certain circumstances, they can reduce recovery time following an instance failure.

过去镜像只因为性能的原因而存在，因为在某些环境下，他可以减少实例的恢复时间。

You saw in Example 3 that Instance 1 kept a past image copy of the block at SCN 1320 until the

current version of the block (SCN 1323) was written back to disk by Instance 4. As all changes are

recorded in the redo logs, you may be wondering why it is necessary to retain past images.

在例子3中，实例1保存了某个数据块在scn 1320的过去镜像，直到scn 为1323的数据块的当前

版本被实例4写回到磁盘。要知道所有的改动都会被redo 日志所记载，所以你可能会奇怪为什么

有必要保留过去镜像。

Consider the following scenario: Instance A reads a block into cache and then modifies it 1,000

times. Before the block has been written back to disk, Instance B requests the current version of the

block to update it further. Instance A transfers the block and Instance B modifies it once. Instance B

fails before the block is written back to disk.

考虑下面的场景：实例A从cache 读入一个数据块，然后修改它1000次。在这个数据块被写回

磁盘之前，实例B要求这个数据块的当前版本以便跟新。实例A传输了这个当前版本数据块，

而实例B修改了这个数据块1此。实例B在把数据块写回磁盘之前失败了。

If Instance A did not retain a past image copy of the block at the time of transfer to Instance B,

then in order to perform recovery for Instance B, it would be necessary to read the block from disk

and apply all 1,001 changes.

如果在把最新时间点镜像块传输给实例B前，实例A没有保存数据块的过去时间点镜像，

如果发生了上面的情况要恢复数据，那则有必要从磁盘读取这个数据块，并应用redo 日志

1001次。

However, if Instance A retained a past image, Instance A could be signaled to write the past image of the block back to disk, and only the redo for the single change made

by Instance B would need to be applied.

然而，如果实例A保留了过去时间点镜像，实例A将发起一个操作把过去时间点块写入磁盘，

这时，只有实例B修改产生的redo需要被应用到这个时间点块，即可恢复磁盘信息。

Disk Writes

Dirty blocks in the buffer cache are written to disk when the instance requires additional buffers to

satisfy free buffer requests or when it performs a checkpoint.

磁盘写

当实例要求额外的buffer 时，当发生checkpoint 事件时，buffer cache中的脏数据需要被写回磁盘。

Write requests can originate from any instance that has a current or past image of a block. The GCS ensures that only the current version of the block is written back to disk.

写要求可以源自任何有数据块的当前镜像或过去镜像的实例。GCS将确保只有数据块的当前版本

被写回到磁盘。

In addition, it checks that any past image versions of the block are purged from the buffer caches of all other instances.

还有，它将检查所有实例，是否有这个数据块的过去镜像。这些数据信息是否已经从buffer cache

中清除。

In the following example, the instance holding a past image buffer in null mode requests that

Oracle write the buffer to disk.

在如下的列子中，持有某个数据块的过去时间点镜像（在null模式）的实例要求oracle把buffer 数据写回磁盘。

In Figure 22-22, Instance 4 has an exclusive lock on the current version of the block at SCN

1323. The block is dirty. Instance 1 has a null lock on a past image of the block at SCN 1320.

在图22-22中，实例4对某个在SCN 1323 当前版本的数据块持有一个专有锁。这个数据块是

脏数据块。而实例1拥有此数据块在SCN 1320的过去镜像。

Instance 1 requires the buffer occupied by the past image of the block at SCN 1320. The shared disk currently contains the block at SCN 1318.

实例1 要求buffer，这个buffer 被该数据块的 SCN 1320所持有。而共享磁盘中包含了这个数据块在

scn 1318的版本。

Instance 1 sends a message to Instance 3, which is the resource master instance for the block,

requesting permission to flush the past image (Figure 22-23).

实例1发送消息给实例3，实例3是这个数据块的资源属主，并要求把过去镜像写入磁盘。

Instance 3 sends a message to the Instance 4, which is holding the current version of the block,

requesting that it write the block to disk (Figure 22-24).

Instance 4 writes the block to disk (Figure 22-25).

实例3发送一个消息给实例4，实例4正持有了准备写入磁盘的数据块的当前版本。

实例4把数据块写入磁盘。

Instance 4 notifies the resource master,which is Instance 3,that the block has been successfully

written to disk.The resource role can now become local ,as only Instance 1 is holding a copy

of the block.

实例4通知资源属主(实例3）,数据块已经成功写入磁盘。这个资源现在可以本地化了，因为

目前只有实例1持有块的拷贝。

Instance 3 sends a message to Instance 1 requesting that it flush the past image for the block

(Figure 22-27). Instance 1 frees the buffer used by the past image and releases the resource.

实例3发送一个消息给实例1，要求它把该数据块的过去镜像删除。实例1清空被该数据块

占用的过去镜像，并释放资源。

At this point, if more than one instance was holding a past image of the block, the resource

master would signal each instance requesting that the block was flushed.

在这个时间点，如果超过1个实例持有了这个数据块的过去时间点镜像，资源属主

将发出信号，通知它把数据块清除。

System Change Numbers (SCNs)

In order to track changes to data blocks, Oracle assigns a numeric identifier to each version of the block.This identifier is effectively a logical time stamp and is known as an SCN. Oracle records the SCN in the redo record header for each set of changes in the redo log. It is also included in both the undo and data block changes.

为了跟踪对数据块的改变，ORACLE 分配了一个连续的数字字符来识别数据块的每个版本。

这个标识符是一有效地逻辑时间戳，被叫做SCN。Oracle 在redo log的头部记录了每次

改变量 SCN。它也每次都记录在undo 和数据块改动中。

When a session requests a consistent-read for a block, it specifies the block number and the SCN.

If the current version of the block contains changes applied after the requested SCN (i.e., the current

version of the block has a higher SCN), the current version will be cloned to a new buffer and

undo will be applied to roll back the changes on the data block to the requested SCN.

当一个会话需要某个数据块的一致性读镜像，它将指定数据块号以及SCN。如果当前版本的数据块

包含了在要求SCN时间点之后改动的数据（即数据块的当前版本比我们要的时间版本更高），

当前版本将被克隆到一个新的buffer中，而undo将被应用来把数据块恢复到所要求的时间点。

The SCN is also sometimes called the System Commit Number, because on OLTP systems with

short transactions, it appears that a new SCN is issued every time a transaction commits. However,

this is not entirely correct; for a long-running transaction, several SCNs may be used during the lifetime

of the transaction. In this case, changes written to the redo log for the same transaction will

have different SCNs.

SCN也叫做系统提交号，因为在OLTP系统中有很多短的事务，你会发现每次一个事务提交的时候

一个新的SCN就会产生。然而，这不是完全正确的；针对一个长时间运行的事务，它的生命

周期里可能会使用多个SCN。在这种情况下，针对同样的事务，写入redo log的记录可能会有不同

的SCN。

Therefore, the name System Change Number is more appropriate.

Changes are written to the redo log in SCN order; therefore, if the database needs to be recovered,

the redo log files can be scanned sequentially.

因此，名字系统改变号可能比系统提交号更加合适。写给redo 日志的改动是以SCN的顺序

提交的。因此，如果这个数据库需要被修复，我们可以顺序的扫描redo 日志文件。

In a RAC environment, the SCN is used to order data block change events within each instance

and across all instances. When a RAC database is recovered, the SCN is used to synchronize the

application of changes recorded in each of the redo logs.

在RAC环境中，SCN 是用来在一个实例内或在所有的实例间对数据块的改变事件进行排序的。

当一个RAC数据库被恢复的时候，SCN将被用来同步应用对数据库的改变，这些改变都被记录

在每个redo 日志里了。

The generation of SCNs in a RAC database is more complex. Conceptually SCNs are generated

by a single, central point similar to a sequence in a single-instance database. However, in reality,

each instance generates SCNs independently of the other databases.

在RAC数据库中,SCN的产生更加麻烦。概念化的说，SCN是被一个单独的中间点，

有点类似单实例数据库的序列产生的。然而，现实中，每个实例产生的SCN和其它数据库

是独立的。

As each instance can generate its own series of SCNs, a busy instance may generate more SCNs than a quiet instance. If this were allowed to happen, synchronizing the redo logs of the two instances during recovery processing would not be possible; SCNs must be regularly synchronized between instances to ensure that all instances are using the same SCN.

因为每个实例可以产生它自己的一系列SCN，一个繁忙的实例可能比一个安静的实例产生

更多的SCN。如果我们允许这个现象发生，在恢复过程中对两个实例的redo 日志的同步

将变得不再可能。SCN 必须在实例间定期的同步，保证所有的实例使用同样的SCN。

Therefore, Oracle maintains the SCN globally. On Linux, by default, the SCN is maintained by

GCS using the Lamport SCN generation scheme. This scheme generates SCNs in parallel on all

instances.

因此，ORACLE在全局范围内维护SCN。在linux上，默认的SCN 将被GCS 所维护，使用

Lamport SCN 产生体系。

These generated SCNs are piggybacked on all messages passed between instances. If

a local instance receives an SCN from a remote instance that is higher than its current SCN, then the

SCN of the remote instance is used instead. Therefore, multiple instances can generate SCNs in parallel without additional messaging between the instances.

这些SCN 被实例间传输的信息所承载。如果本地实例收到了远程实例的SCN，这个SCN比

当前的SCN更高，那么远程实例的SCN将被使用。因此，多个实例的进程可以并行的产生

SCN，而不需要进程之间额外的通信。

Note that if the instances are idle, messages will still be exchanged between instances at regular

intervals to coordinate the SCN. Each change to the SCN is recorded in the redo log. Therefore, an

idle instance will always be performing a small amount of work and generating redo.

注意，如果实例是空闲状态，在实例间仍然将频繁的传输信息来协同SCN。每个SCN的改变

都记录在redo 日志中。因此，一个空闲的实例也总会做一些小量的工作并产生SCN 日志。

Consequently, it is more difficult to perform controlled tests in a RAC environment than in single-instance Oracle.

因此，在RAC 环境中比在单节点环境中，做一个可控的测试往往更难。

Prior to Oracle 10.2, the Lamport SCN generation scheme was used when the value for the

MAX_COMMIT_PROPAGATION_DELAY parameter was larger than the default value of 7 seconds. If this

parameter was set to a value less than 7, Oracle would use the hardware clock for SCN generation.

在ORACLE 10.2之前，Lamport SCN产生机制在这种情况下使用，即 MAX_COMMIT_PROPAGATION_DELAY 参数比默认的 7秒更大。如果这个参数被修改到比

7小，oracle将使用硬件时钟机制来产生SCN。

In Oracle 10.2 and above, the broadcast on commit scheme is the default. You can verify which

SCN generation scheme is in use by checking the alert log. For example, the alert log may contain

the following:

Picked broadcast on commit scheme to generate SCNs

which indicates that the broadcast on commit SCN generation scheme is in use.

在ORACLE 10.2以及以上，broadcast on commit （提交即传播）机制是默认的机制。你可以

通过检查告警日志来确认那个SCN 产生算法被使用。比如说，报警日志可能包含下面的

信息：

Picked broadcast on commit scheme to generate SCNs.

这暗示broadcast on commit SCN generation 算法正在被使用。

In Oracle 10.2 and above, Oracle recommends that the MAX_COMMIT_PROPAGATION_DELAY parameter

is always set to 0.

在ORACLE 10.2以及以上，oracle建议设置 MAX_COMMIT_PROPAGATION_DELAY参数

为0。

七、Rac 的GES/GCS原理（7)

Optimizing the Global Cache

In this section, you have seen that the amount of work performed by the cluster in support of the

Global Cache Service is highly dependent on the usage of data blocks within the cluster.

在这个章节中，你已经看到了，在群集内针对全局内存的服务量主要依靠于对群集中数据块

的使用。

Although most applications will run on RAC without modification, you can reasonably expect to optimize their performance by partitioning them across the nodes.

虽然绝大多数的应用在RAC中可以直接运行而不需要修改，你可以在节点间进行应用分区的方式来优化他们的性能。

The benefit of application partitioning is that the aggregate size of the buffer cache across all instances is effectively increased, since fewer duplicate data blocks will be stored within it.

应用分区的好处是所有实例间的buffer cache 数量被有效的增加了，因为只有很少的重复数据块被存储在buffer cache中。

The partitioning option, which has been available since Oracle 8.0, allows you to physically

partition tables. The introduction of database services in Oracle 10.1 provides a built-in mechanism

that can be used to implement application partitioning.

在分区的选项里，从ORACLE 8.0就开始有一个策略，这个策略允许使用物理分区表。

从oracle 10.1 开始，已经提供了一种内在的机制，这种机制能够被实施来支持应用分区。

In particular, database services and the

partitioning option can be used together to achieve node affinity for each physical partition. Finally,

the introduction of dynamic resource remastering at object level allows Oracle to optimize the use

of partitioned objects across the cluster by ensuring that they are locally mastered.

而且，数据库服务和分区选项能够组合起来达到每个物理分区的节点亲密性。

最终，在对象层次里进行动态资源重新划分资源属主技术允许oracle能够在群集层次

优化分区对象，保证这些资源可以被本地管理。

The key, therefore, to optimizing the global cache is to design your application in such a way

that you minimize the number of blocks that are shared between instances, which will minimize the

amount of interconnect traffic and the work performed by remote instances.

因此，优化全局内存的关键是设计合理的应用，让尽量少的数据块在实例间被共享访问，

这样可以减少内联心跳的流量以及减少远程节点所需做的工作量。

Other ways to achieve this optimization might include implementing sequences, reverse key indexes, global temporary tables, or smaller block sizes or rebuilding tables with fewer rows per block.

其他可能达到这个优化效果的方法还包括采用序列、做反键索引、使用全局临时表、或采用更小的数据块，或者重建表让每个数据块中有更少的行。

Instance Recovery

In the final section of this chapter, we will briefly describe instance recovery. While it is important

that your cluster can survive an instance failure and that the impact on throughput during instance

recovery is acceptable to your business, understanding the concepts behind instance recovery is

less important than understanding those underlying GCS or GES.

在这章的最后一部分，我们将快速的介绍下实例恢复的概念。因为对于群集来说，从实例失败中

恢复是非常重要的，而且实例恢复时候会有更多的压力，他们对于你的生产应用是可接受的，

懂得在GCS或GES下的恢复的概念比懂得实例恢复的概念要更加重要。

This is because instance recovery only has a temporary performance impact,

whereas the Global Services can affect your performance at all times.

这是因为实例恢复只会对数据库的表现有临时的影响，而全局服务却可以不断的影响数据库的表现。

In RAC, recovery is performed when an instance fails and another Oracle instance detects the

failure. If more than one node fails at the same time, all instances will be recovered.

在RAC环境下，当一个实例失败了，而另外一个实例发现它的失败，恢复操作将被执行。

如果在同时有超过1个实例失败，所有的实例将被恢复。

The amount of recovery processing required is proportional to the number of failed nodes and

the amount of redo generated on those nodes since the last checkpoint. Following recovery, data

blocks become available immediately.

恢复进程所需要的数据量与上次检查点之前失败节点的个数、这些节点产生的redo数据量是成比例的。

在恢复之后，数据块将立即变得可用。

When an instance detects that another instance has failed, the first phase of the recovery is the

reconfiguration of the GES. Following this, the GCS resources can be reconfigured. During this phase,all GCS resource requests and write requests are suspended, and the GRD is frozen.

当一个实例发现另外一个实例失败了，恢复的第一步是对GES进行重新配置。然后，GCS

资源可以被重新配置。在这个阶段，所有对的GCS资源的读写请求都会被挂起，而GRD目录

被冻结。

However, instances can continue to modify data blocks as long as these blocks are in the local buffer cache and appropriate enqueues are already held.

然而，实例仍然可以修改数据块，只要这些数据块还在本地的buffer cache中，而对应的

队列仍然被持有。

In the next phase, the redo log file of the failed instance is read to identify all the blocks that

may need to be recovered.

在下一个阶段，失败实例的redo 日志已经提供给实例来识别所有待恢复的数据块。

In parallel, the GCS resources are remastered. This remastering involves

redistributing the blocks in the buffer cache of the remaining instances to new resource masters.

同时，GCS资源属主将被重新定义。这个重新定义包括重新分配剩下实例中的buffer cache

的数据块，把他们分配给新的资源属主。

At the end of this phase, all blocks requiring recovery will have been identified.

In the third phase, buffer space is allocated for recovery and resources are created for all of the

blocks requiring recovery.

在这个阶段的最后部分，所有需要恢复的数据块将被重新识别。在第三阶段，buffer 空间

将被分配以供恢复使用，同时所有需要恢复数据块的资源将被创建。

The buffer caches of all remaining instances are searched for past images

of blocks that may have been in the buffer cache of the failed instance at the time of failure.

剩下实例的buffer cache 将被搜索，寻找数据块的过去时间点镜像，这些数据块可能在实例

失败的时候存在于失败实例的buffer cache中。

If a PI buffer exists for any block, then this buffer is used as a starting point for the recovery of that block. At this point, all resources and enqueues that will be required for the recovery process have

been acquired.

如果某个数据块存在一个时间点镜像，那么这个buffer数据块将被作为恢复那个数据块的

起点。在这个起点上，所有被恢复进程所需要的资源和队列都将被获取。

Therefore, the GRD can be unfrozen, allowing any data blocks that are not involved

in the recovery process to be accessed. The system is now partially available.

因此，GRD目录可以被解冻，以方面恢复进程来访问里面的资源。这个系统现在是部分可靠地。

In the next phase, each block that has been identified as needing recovery is reconstructed

from a combination of the original block on disk, the most recent past image available, and the

contents of the redo log of the failed instance. The recovered block is written back to disk, immediately after which the recovery resources are released, so that the block becomes available again.

在下一个阶段，每个被识别为需要恢复的数据块将被重建，从磁盘上的原始块，最近的可用的过去镜像，以及失败实例的redo 日志。一旦恢复的资源被释放，被恢复的数据块将被写回磁盘。这个数据库变得又可以访问了。

When all blocks have been recovered, the system is fully available again.

A block may be modified by the failed instance but not exist as a past image or a current block

in any of the remaining instances. In this case, Oracle uses the redo log files of the failed instance to

reconstruct the changes made to the block since the block was last written to disk.

当所有的数据块已经被恢复，系统将再次完全可用。

某些被修改的数据块存在于失败的实例上，但是在所有可用实例上并不存在它的过去时间点

镜像。这种情况下，ORACLE将使用失败节点产生的redo日志来重现自从上次该数据块写入

磁盘以来所有对该数据块的改变。

The buffer cache is flushed to disk by the database writer process when free buffers are required for new blocks. It is also flushed to disk whenever there is a database checkpoint, which occurs whenever there is a log file switch.

当实例要求读入更多的数据块，它会要求空闲的buffer，此时数据库写进程将把buffer cache

写入磁盘。每次发生log file switch 的时候都会发生数据库级别的checkpoint,当发生数据库级别的checkpoint时，buffer cache的内容也会被写入磁盘，

Therefore, you only need to recover redo generated since the last log file switch. Consequently,

the amount of redo and the amount of time required to apply the redo will be proportional

to the size of the redo log file.

因此，你只需要恢复从上一次日志切换以来的所有 redo 日志。因此，恢复redo的量以及

时间和redo 日志的大小是成比例的。

You can reduce the amount of time required for recovery by adjusting the mean time to recovery

using the FAST_START_MTTR_TARGET parameter, which allows you to specify the maximum amount of

time, in seconds, in which you wish the database to perform crash recovery.

你可以通过使用参数 FAST_START_MTTR_TARGET 来减少恢复所需的时间，这将允许你指定

最多在这段时间内你的数据库能够完成失败恢复。

The FAST_START_MTTR_TARGET adjusts the frequency of checkpoints to limit potential recovery time. This parameter was introduced in Oracle 9.0.1 and replaces the FAST_START_IO_TARGET parameter, which specifies the target number of database blocks to be written in the event of a recovery, and the LOG_CHECKPOINT_INTERVAL parameter, which specifies the frequency of checkpoints in terms of 512-byte redo log blocks.

FAST_START_MTTR_TARGET参数将调解checkpoint的频率，并减少潜在的恢复时间。这个参数

在ORACLE 9.0.1的时候被引进，并代替了参数FAST_START_IO_TARGET。FAST_START_IO_TARGET参数指定在恢复时候，目标数据库数据块的个数。而LOG_CHECKPOINT_INTERVAL 参数可以指定checkpoint 的频率，

依据redo 数据块大小512字节算。

Summary

This chapter described some of the internal algorithms that are used within a RAC database. While

understanding the exact implementation of each algorithm is not essential, having an appreciation

of them when designing new systems is highly beneficial.

总结：

这个章节描述了某些RAC数据库的内部算法。虽然我们并不需要懂得每个算法的实际执行，

但对这些算法有所了解对于我们设计新的系统是大有裨益的。

As you have seen, much of the performance of an Oracle RAC database depends on the application.

Therefore, testing and benchmarking your application prior to going into production are

vital to determine its effect on GCS and GES. By aiming to minimize the amount of interconnect

traffic your application generates, you will maximize both performance and scalability.

正如你看到的，绝大多数的ORACLE RAC性能问题要依靠应用来解决。

因此，在进入生产系统之前，测试和校准你的应用，确定他们对GCS\GES的影响，

是非常关键的。为了减少你的应用数据的内联的通信，你需要最大化你的数据库性能和

可扩展性。

Oracle RAC的GES/GCS原理 （转）

一、RAC的GES/GCS原理（1）

二、RAC的GES/GCS原理（2）

三、RAC的GES/GCS原理（3）

四、RAC的GES/GCS原理（4）

五、Rac 的GES/GCS原理（5)

六、Rac 的GES/GCS原理（6）

七、Rac 的GES/GCS原理（7)

Oracle RAC的GES/GCS原理（转）