oracle ORA-32701 hang分析（一）

来源：互联网发布：淘宝加购什么意思编辑：程序博客网时间：2024/05/16 17:18

今日对某局方的数据库进行巡检，发现alert.log日志里面有大量的ORA-32701: Possible hangs up to hang ID=57 detected报错，完整的日志报错如下：
Sun Dec 13 01:08:12 2015
Errors in file /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/trace/huibuy1_dia0_80848.trc (incident=200081):
ORA-32701: Possible hangs up to hang ID=57 detected
Incident details in: /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/incident/incdir_200081/huibuy1_dia0_80848_i200081.trc
DIA0 requesting termination of session sid:5217 with serial # 33617 (ospid:41409) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=57.
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=57.

进一步跟踪日志：
/oracle/app/oracle/diag/rdbms/huibuy/huibuy1/trace/huibuy1_dia0_80848.trc，
发现：
Incident details in: /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/incident/incdir_200081/huibuy1_dia0_80848_i200081.trc
inst# SessId Ser# OSPID PrcNm Event

  1   6610  9703     89862  M000 enq: WF - contention  2   5217 33617     41409  M000 not in wait

M000进程出现等待，根据运维人员的反馈，出现hang的时间很短，业务侧反馈，数据库主机只能够承受1000左右的连接，256G内存，64CORE的X86性能怎么会如此低下？还会导致库经常出现hang？感觉非常不科学。并且1000左右的连接就消耗完了整个主机内存。基本定位出，用户连接不合理导致消耗大量的内存，M000等待内存分配从而出现hang。

附上完整的数据库hang分析步骤：

收集systemstate dump:

Oracle$ sqlplus -prelim / as sysdba SQL>oradebug setmypid SQL>oradebug unlimit SQL>oradebug –g all dump systemstate 266 Wait for 30 seconds SQL>oradebug -g all dump systemstate 266 Wait for 30 seconds SQL>oradebug -g all dump systemstate 266 SQL>oradebug tracefile_name 显示trace file name

hanganalyze:

SQL>sqlplus -prelim / as sysdba SQL>oradebug setmypid SQL>oradebug unlimit; SQL>oradebug –g all hanganalyze 3 Wait for 30 seconds SQL>oradebug –g all hanganalyze 3 Wait for 30 seconds SQL>oradebug –g all hanganalyze 3 SQL>oradebug tracefile_name 显示trace file name

这个hang只是一个表象，真实的原因还是内存的分配，结合awr报告，可以看到两个节点之间有大量的gc，还有当前x86主机的Pagetables占用了将近25GB。所以一个整体的优化思路是：
一、优化gc
二、优化cache buffer chain和lantch
三、优化内存，采用Hugepages。

具体优化手段见后续的（二）。

2 0