wait for a undo record模拟和解决

来源:互联网 发布:软件测试未来发展方向 编辑:程序博客网 时间:2024/04/30 15:32
       今天在生产环境的数据库被hang住,检查当时数据库的等待事件:wait for a undo record,并且有大量的的Txn Recovery Start。现在模拟和记录下这个问题的解决思路和过程

1.故障模拟
SQL> create tablespace ydtbs datafile '/u01/oradata/orcl/dytbs01.dbf' size 5000M;

Tablespace created.

SQL> create table test tablespace ydtbs as select * from dba_objects;

Table created

SQL> insert into test select * from test;

15919040 rows created.

Elapsed: 00:08:20.53

SQL>
SQL> select count(*) from test;

  COUNT(*)
----------
  31838080


在另一个会话里面:
SQL> col sql_text format a50
SQL> l
  1  select SQL_ID,HASH_VALUE,sql_text from v$sql
  2* where sql_text like '%delete%test%'
SQL> /

SQL_ID        HASH_VALUE SQL_TEXT
------------- ---------- --------------------------------------------------
b11zktbz980db 4271112619 select SQL_ID,HASH_VALUE,sql_text from v$sql where
                          sql_text like '%delete%test%'

cb5m2z04qzgzn  158318580 delete from test


SQL> select sid,serial# from v$session where sql_hash_value = 158318580;

       SID    SERIAL#
---------- ----------
       159          3



SQL> alter system kill session '159,3';

第一个会话
SQL> delete from test;
delete from test
            *
ERROR at line 1:
ORA-00028: your session has been killed

在第二个会话检查等待事件:
SQL> col event format a50
SQL> select sid, event, p1,p2,p3 from v$session_wait;

       SID EVENT                                                      P1         P2         P3
---------- -------------------------------------------------- ---------- ---------- ----------
       142 wait for a undo record                                      0          0          0
       143 db file sequential read                                     5      59452          1
       145 db file sequential read                                     5      59450          1
       146 SQL*Net message from client                        1650815232          1          0
       147 SQL*Net message to client                          1650815232          1          0
       148 db file sequential read                                     5      59451          1
       159 SQL*Net message from client                        1650815232          1          0
       160 rdbms ipc message                                         100          0          0
       161 rdbms ipc message                                         300          0          0
       162 rdbms ipc message                                         300          0          0
       163 rdbms ipc message                                         500          0          0
       164 rdbms ipc message                                      180000          0          0
       165 wait for stopper event to be increased                      0          0          0
       166 log file parallel write                                     1       1906          1
       170 pmon timer                                                300          0          0

21 rows selected.

SQL>
SQL> select pid, state, undoblocksdone from v$fast_start_servers;

       PID STATE       UNDOBLOCKSDONE
---------- ----------- --------------
        21 RECOVERING            6407
        22 RECOVERING               0
        23 RECOVERING               0
        24 RECOVERING               0

SQL> select ADDR,KTUXEUSN,KTUXESLT,KTUXESQN,KTUXESIZ from x$ktuxe where KTUXECFL='DEAD';

ADDR       KTUXEUSN   KTUXESLT   KTUXESQN   KTUXESIZ
-------- ---------- ---------- ---------- ----------
B7F38D9C          8         35        278    1116399


SQL> select pid, state, undoblocksdone from v$fast_start_servers;

       PID STATE       UNDOBLOCKSDONE
---------- ----------- --------------
        17 RECOVERING            8686
        18 IDLE                     0
        19 IDLE                     0
        20 IDLE                     0



SQL> col event format a50
select sid, event, p1,p2,p3 from v$session_wait;
SQL>

       SID EVENT                                                      P1         P2         P3
---------- -------------------------------------------------- ---------- ---------- ----------
       145 Streams AQ: waiting for time management or cleanup          0          0          0
            tasks

       148 Streams AQ: qmn slave idle wait                             0          0          0
       151 Streams AQ: qmn coordinator idle wait                       0          0          0
       152 db file sequential read                                     2     100220          1
       153 wait for a undo record                                      0          0          0
       154 db file sequential read                                     5      41930          1
       155 wait for a undo record                                      0          0          0
       158 SQL*Net message from client                        1650815232          1          0
       159 SQL*Net message to client                          1650815232          1          0

       SID EVENT                                                      P1         P2         P3
---------- -------------------------------------------------- ---------- ---------- ----------
       160 rdbms ipc message                                         100          0          0
       161 rdbms ipc message                                         300          0          0
       162 rdbms ipc message                                         500          0          0
       163 rdbms ipc message                                      180000          0          0
       164 wait for stopper event to be increased                      0          0          0
       165 rdbms ipc message                                         300          0          0
       166 rdbms ipc message                                         152          0          0
       167 rdbms ipc message                                         300          0          0
       168 rdbms ipc message                                         300          0          0
       169 rdbms ipc message                                         300          0          0
       170 pmon timer                                                300          0          0

20 rows selected.


2.故障解决
解决思路:
wait for a undo record是cpu并行恢复的资源争用undo记录的等待,调整相关参数
fast_start_parallel_rollback 为false,进行串行恢复。
调整参数  fast_start_parallel_rollback 为false,这样回滚的进程就只会有一个,速度有时候反而比并行回滚还快。

SQL> show parameter  fast_start_parallel_rollback

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
fast_start_parallel_rollback         string      LOW
SQL>


alter system set fast_start_parallel_rollback=false;
关闭并行恢复,启用串行恢复

SQL> alter system set fast_start_parallel_rollback=false;

System altered.

SQL> show parameter fast_start_parallel_rollback

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
fast_start_parallel_rollback         string      FALSE

SQL> set linesize 300
SQL> select sid, event, p1,p2,p3 from v$session_wait
SQL> /

       SID EVENT                                                      P1         P2         P3
---------- -------------------------------------------------- ---------- ---------- ----------
       170 pmon timer                                                 72          0          0
       169 rdbms ipc message                                         198          0          0
       160 rdbms ipc message                                         300          0          0
       161 rdbms ipc message                                          76          0          0
       167 rdbms ipc message                                         239          0          0
       163 rdbms ipc message                                        1152          0          0
       165 rdbms ipc message                                         258          0          0
       166 rdbms ipc message                                         241          0          0
       168 rdbms ipc message                                         207          0          0
       162 rdbms ipc message                                         300          0          0
       159 db file sequential read                                     1       5226          1

       SID EVENT                                                      P1         P2         P3
---------- -------------------------------------------------- ---------- ---------- ----------
       164 db file sequential read                                     5      35991          1
       156 SQL*Net message to client                          1650815232          1          0

13 rows selected.


SQL> select pid, state, undoblocksdone from v$fast_start_servers;

no rows selected

问题解决
-----------
结论:
1.大事务运行过程中被异常终止是一件比较严重的事情,死事务的回滚可能会占严重消耗的系统资源。
2.cpu争用回滚段造成等待事件:wait for a undo record
3. fast_start_parallel_rollback参数
false  -- 即是关闭parallel rollback功能
low    -- 也是10g的默认值,该值含有是最大的rollback进程为2*cpu_count个
high   -- 当设置为该值时,最大的rollback进程为4*cpu_count个
     参数其最大值是要受参数parallel_max_servers的限制的,如果是rac环境,那么还跟参数parallel_threads_per_cpu有关系,这里需要说明一点的是,该参数跟recovery_parallelism不同的,recovery_parallelism参数是指在进行instance crash recovery时的并行恢复进程个数。

    关于fast_start_parallel_rollback相关学习,请参考roger大师的blog,具体链接如下:

http://www.killdb.com/2011/10/07/%E5%85%B3%E4%BA%8Eparallel-rollback%E7%9A%84%E4%B8%80%E7%82%B9%E6%80%BB%E7%BB%93.html

=========================================================================
-- The End --