oracle ORA-32701 hang分析(一)

来源:互联网 发布:淘宝加购什么意思 编辑:程序博客网 时间:2024/05/16 17:18

今日对某局方的数据库进行巡检,发现alert.log日志里面有大量的ORA-32701: Possible hangs up to hang ID=57 detected报错,完整的日志报错如下:
Sun Dec 13 01:08:12 2015
Errors in file /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/trace/huibuy1_dia0_80848.trc (incident=200081):
ORA-32701: Possible hangs up to hang ID=57 detected
Incident details in: /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/incident/incdir_200081/huibuy1_dia0_80848_i200081.trc
DIA0 requesting termination of session sid:5217 with serial # 33617 (ospid:41409) on instance 2
due to a GLOBAL, HIGH confidence hang with ID=57.
Hang Resolution Reason: Although the number of affected sessions did not
justify automatic hang resolution initially, this previously ignored
hang was automatically resolved.
DIA0: Examine the alert log on instance 2 for session termination status of hang with ID=57.

进一步跟踪日志:
/oracle/app/oracle/diag/rdbms/huibuy/huibuy1/trace/huibuy1_dia0_80848.trc,
发现:
Incident details in: /oracle/app/oracle/diag/rdbms/huibuy/huibuy1/incident/incdir_200081/huibuy1_dia0_80848_i200081.trc
inst# SessId Ser# OSPID PrcNm Event


  1   6610  9703     89862  M000 enq: WF - contention  2   5217 33617     41409  M000 not in wait

M000进程出现等待,根据运维人员的反馈,出现hang的时间很短,业务侧反馈,数据库主机只能够承受1000左右的连接,256G内存,64CORE的X86性能怎么会如此低下?还会导致库经常出现hang?感觉非常不科学。并且1000左右的连接就消耗完了整个主机内存。基本定位出,用户连接不合理导致消耗大量的内存,M000等待内存分配从而出现hang。

附上完整的数据库hang分析步骤:

  1. 收集systemstate dump:
Oracle$ sqlplus -prelim / as sysdba SQL>oradebug setmypid SQL>oradebug unlimit SQL>oradebug –g all dump systemstate 266 Wait for 30 seconds SQL>oradebug -g all dump systemstate 266 Wait for 30 seconds SQL>oradebug -g all dump systemstate 266 SQL>oradebug tracefile_name 显示trace file name 
  1. hanganalyze:
SQL>sqlplus -prelim / as sysdba SQL>oradebug setmypid SQL>oradebug unlimit; SQL>oradebug –g all hanganalyze 3 Wait for 30 seconds SQL>oradebug –g all hanganalyze 3 Wait for 30 seconds SQL>oradebug –g all hanganalyze 3 SQL>oradebug tracefile_name 显示trace file name

这个hang只是一个表象,真实的原因还是内存的分配,结合awr报告,可以看到两个节点之间有大量的gc,还有当前x86主机的Pagetables占用了将近25GB。所以一个整体的优化思路是:
一、优化gc
二、优化cache buffer chain和lantch
三、优化内存,采用Hugepages。

具体优化手段见后续的(二)。

2 0
原创粉丝点击