ASE12.5.4 Error: 2823接着报current process infected with 11的处理

来源:互联网 发布:红外光谱数据库 编辑:程序博客网 时间:2024/06/11 06:00

        重庆校讯通数据库出现问题,进行数据库日志分析,查看日志格式如下图:


开始关键错误日志一:
03:00000:01217:2011/12/04 06:07:37.05 server  Error: 2823, Severity: 20, State: 2
03:00000:01217:2011/12/04 06:07:37.05 server  Process 1217 tried to remove a PROC_BUF 0x2ab48502a9e0 named '*ss0825993466_0743205685ss*' that it does not hold in the hold procedure linked list.
03:00000:01217:2011/12/04 06:07:37.05 kernel  ************************************
03:00000:01217:2011/12/04 06:07:37.07 kernel  SQL causing error : 
03:00000:01217:2011/12/04 06:07:37.07 kernel  ************************************
03:00000:01217:2011/12/04 06:07:37.07 server  SQL Text: 
03:00000:01217:2011/12/04 06:07:37.07 kernel  curdb = 4 tempdb = 6 pstat = 0x10000
03:00000:01217:2011/12/04 06:07:37.07 kernel  lasterror = 2823 preverror = 0 transtate = 0
03:00000:01217:2011/12/04 06:07:37.07 kernel  curcmd = 193 program = jTDS                          
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x0000000000b26449 pcstkwalk+0x289()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x0000000000b2607f ucstkgentrace+0x17f()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x0000000000adc4df ucbacktrace+0x8f()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x00000000004481ff terminate_process+0xc2f()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x00000000004a1b20 close_network+0x10()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x00000000004a1af2 hdl_default+0x42()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x00000000007bc7ed s_handle+0x8ed()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x00000000004a1462 ex_raise+0x1a2()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x000000000056ef05 procrm+0x285()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x000000000056ac6f proc_cacheplan+0x12f()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x000000000078c02f s_compile+0x66f()
03:00000:01217:2011/12/04 06:07:37.08 kernel  pc: 0x0x00000000007baef5 s_recompile+0x235()
------
经过Sybase® Adaptive ServerTM
Enterprise Troubleshooting and
Error Messages Guide
Volume 1 - Volume3中的第3卷错误号对应的说明文档翻阅查到,详见官网:
http://infocenter.sybase.com/help/topic/com.sybase.39996_1250/pdf/svrtsg.pdf
嫌慢的查
https://apps.state.or.us/tech/sybase/svrtsg3.pdf  或
http://manuals.sybase.com/onlinebooks/group-asarc/asg1150e/svrtsg/@ebt-link;pt=56;lang=de?target=%25N%15_11019_START_RESTART_N%25
 
2823 20 Process %d tried to remove a PROC_BUF 0x%lx named `%.*s' that it does not hold in the
hold procedure linked list.
Explanation: Adaptive Server encountered an internal error affecting the current
process. Report the error to your System Administrator.
 
Targeted CR List for ASE 12.5.4  ESD#1 (ebf13738)查到针对2823发生的原因,详见官网:
http://www.sybase.com/detail?id=1039924
bug id     describtion
427888Error 2823 "Process x tried to remove a PROC_BUF 0xxxxxx name '*ssxxxxxxxxxxx'.." maybe produced when two sessions execute the same SQL in an ASE where "statement cache" is enabled.

而目前外网装的是
 ASE 12.5.4 ebf13396 比 ASE 12.5.4  ESD#1 (ebf13738)要低,所以 针对这个版本,这个bug id 427888是存在的
当然引起这个bug是需要特定情况触发的,再来看解释是由于两个连接会话在语句缓存中调用了相同的sql语句。
所以满足两个条件:1补丁ebf没有安装,2不同会话在语句缓存中同时调用相同的语句。
由于触发的第2个条件可能应用中也没有很好机制控制住,所以我的建议是外网为了防止这个问题重现,进行针对这个版本的ebf安装
安装版本为最新的ASE 12.5.4

EBF 16800: 12.5.4 ESD #10 - 64 Bit 12.5.420 Nov 2009EBF/Patch
or
EBF 16831: 12.5.4 ESD #9.1 - 64 Bit 12.5.420 Apr 2009EBF/Patch

这两个我本身也存在疑惑哪个是最新的根据esd号,10比9.1大,根据ebf号16831又比16800大,时间上来讲,
16800的又比16831出的迟。我的建议还是先装EBF 16800: 12.5.4 ESD #10 这个。
另附议:如果其他地方也装着ASE 12.5.4 ebf13396 都安排统一打上EBF 16800: 12.5.4 ESD #10
 
接着马上出现以下错误日志:
11:00000:01398:2011/12/04 06:07:37.16 kernel  current process (0xd702db) infected with 11
11:00000:01398:2011/12/04 06:07:37.16 kernel  Address 0x0x00000000004d69c4 (treecpynames+0x14), siginfo (code, address) = (1, 0x0x0000002f00000013)
11:00000:01398:2011/12/04 06:07:37.16 kernel  ************************************
11:00000:01398:2011/12/04 06:07:37.16 kernel  SQL causing error : _numberproce
11:00000:01398:2011/12/04 06:07:37.16 kernel  ************************************
11:00000:01398:2011/12/04 06:07:37.16 server  SQL Text: _numberproce
11:00000:01398:2011/12/04 06:07:37.16 kernel  curdb = 4 tempdb = 6 pstat = 0x10000
11:00000:01398:2011/12/04 06:07:37.16 kernel  lasterror = 0 preverror = 0 transtate = 0
11:00000:01398:2011/12/04 06:07:37.16 kernel  curcmd = 193 program = jTDS                          
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x0000000000b26449 pcstkwalk+0x289()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x0000000000b2607f ucstkgentrace+0x17f()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x0000000000adc4df ucbacktrace+0x8f()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x00000000004481ff terminate_process+0xc2f()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x0000000000af7335 kisignal+0x195()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x00000000004d69c4 treecpynames+0x14()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x00000000004d6b0b tree_copynames+0x6b()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x0000000000856cc4 copyres+0x2c4()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x000000000084f4ef com_query+0x5af()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x000000000084986b compile_cmd+0x6b()
11:00000:01398:2011/12/04 06:07:37.16 kernel  pc: 0x0x000000000078df0a s_code_generate+0xca()
11:00000:01398:2011/12/04 06:07:37.17 kernel  pc: 0x0x000000000078d5bb s_compile_stmt+0x88b()
11:00000:01398:2011/12/04 06:07:37.17 kernel  pc: 0x0x000000000078c3ca s_compile+0xa0a()
11:00000:01398:2011/12/04 06:07:37.17 kernel  pc: 0x0x00000000007baef5 s_recompile+0x235()
------
以下错误日志重复不停出现一直持续(2011/12/04 06:07:37  -- 2011/12/05 14:15:47
结束关键错误日志二:
02:00000:03120:2011/12/05 14:15:47.97 kernel  current process (0x3f870694) infected with 11
02:00000:03120:2011/12/05 14:15:47.97 kernel  Address 0x0x000000000054d0b9 (mem_pageallocate+0xe9), siginfo (code, address) = (1, 0x0x0000003530335040)
02:00000:03120:2011/12/05 14:15:47.98 kernel  ************************************
02:00000:03120:2011/12/05 14:15:47.98 kernel  SQL causing error : ?"
02:00000:03120:2011/12/05 14:15:47.98 kernel  ************************************
02:00000:03120:2011/12/05 14:15:47.98 server  SQL Text: ?"
02:00000:03120:2011/12/05 14:15:47.98 kernel  curdb = 4 tempdb = 2 pstat = 0x10000
02:00000:03120:2011/12/05 14:15:47.98 kernel  lasterror = 0 preverror = 0 transtate = 1
02:00000:03120:2011/12/05 14:15:47.98 kernel  curcmd = 224 program = jTDS                          
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000b26449 pcstkwalk+0x289()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000b2607f ucstkgentrace+0x17f()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000adc4df ucbacktrace+0x8f()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x00000000004481ff terminate_process+0xc2f()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000af7335 kisignal+0x195()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x000000000054d0b9 mem_pageallocate+0xe9()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x000000000054a349 memalloc+0x1f9()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x000000000054a659 memalloc_zero+0x19()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x000000000081b84e mksclause+0xe()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000898150 q_bld_sclause+0xd0()
02:00000:03120:2011/12/05 14:15:47.98 kernel  [Handler pc: 0x(nil)  installed by the following function:-]
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000897ca5 qry_init+0x5a5()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x0000000000857682 decision+0xb2()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x00000000007ae942 optimize_cmd+0x82()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x00000000007ae7b9 s_optimizecmds+0xb9()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x000000000078d687 s_compile_stmt+0x957()
02:00000:03120:2011/12/05 14:15:47.98 kernel  pc: 0x0x000000000078c3ca s_compile+0xa0a()
------
从以上错误日志开始时间以及错误的持续情况,可以确定数据库服务器在2011/12/04 06:07:37 已经出现异常一直持续到12.5号的数据库重启。
2011/12/04 06:07:37  这个时间点到底发生了什么事情呢,综合后面的日志update index statistics eb_student
推断诱发bug以及后续问题产生的原因是由于周六开始的进行重新校讯通库表的维护索引统计量更新的程序直到周日早上6点多钟也没有跑完,
并且在跑的过程中出现了数据库异常,导致后面许多访问的sql语句产生错误。再结合
 
sybase官方文档Targeted CR List for ASE 12.5.4查找相关bug:官方总列表链接http://www.sybase.com/detail?id=1023895
Targeted CR List for ASE 12.5.4:
331477The message "current process infected with 11 in ns__exec_stmt" is printed in the errorlog together with a stacktrace containing the modules 'call_ns_exec', 'ns_execute' and 'ns__exec_stmt' when using internal JDBC with the CallableStatement.setNull() methode passing in a java.sql.Types.INTEGER datatype as second param.
--排除331477 错误日志中没有in ns__exec_stmt
 
349480The message "current process infected with 11" in the module 'oj_syboj_markojonclause' may be reported in the error log when executing a SELECT with invalid cyclic outerjoin.
 
365822Occasionally, the message "current process infected with 11" in the module 'omni_adjust_eops' together with a stack trace which includes the modules 'omni_opt_select' and 'omni_code_gen_estep' may be reported in the error log if there is an invalid subtraction between two DATETIME columns with some subqueries involved
 
385017In rare circumstances, ASE running on HP-UX, the message current process infected with 11 may be reported when dbcc printolog() is executed.
--排除385017,ASE运行在HP-UX系统中
 
388074Under stress conditions, ASE running on HP-UX, the message 'current process infected with 11' together with a stack trace which include the module 'addCharArrays' may be reported in the error log when the client is using SSL.
--排除388074,ASE运行在HP-UX系统中
 
391080The message "current process infected with 11" in the module 'pre_alttab' may be reported in the error log when executing the 'ALTER TABLE' command for a table having an 'auto identity' column and if the command needs a data copy. This may happen if the session level 'quoted_identifier' option is on or if the table name length is greater than 28 characters.
--排除391080通过alter table 命令更改有一个自增列的表导致需要进行数据拷贝
 
420453The message "current process infected with 11" in the module 'bufdlink' together with a stack trace that includes 'bufpindirty' may be reported in the error log when running DBCC CHECKALLOC on a database that has tables and/or indexes that are bound to a cache other than the 'default data cache' and the configuration parameter 'number of large I/O buffers' is too small.
 
 
Targeted CR List for ASE 12.5.4 ESD#1:
357188Under some circumstances, on a heavily-loaded multi-engine system and multiple clients running java (activated through "enable java" configuration parameter), the message "current process infected with 11" or "current process infected with 10", together with a stack trace in function Opinvokevirtual or Opinvokestatic or Opinvokespecial may be reported in the errorlog.
--357188排除,重庆没打开enable java
 
Targeted CR List for ASE 12.5.4 ESD#2:无相关bug
 
Targeted CR List for ASE 12.5.4 ESD#3:
389409The message "current process infected with 11" in the module 'memcpy' togther with a stacktrace which includes the module 'ct_get_data' may be reported when accessing parsed XML documents using a local proxy table mapped to a remote proxy table that has an external directory defined on it.
--排除389409 ,重庆没使用代理表访问XML文档
 
417573The message "current process infected with 11" in the module 'ksmask__complete' together with a stack trace which includes the module 'ksmasknserver' may be reported in the error log. This stack trace may be repeated a few times followed by ASE listener failure with a stack trace accompanied by the messages in error log such as 'kernel nopen: accept, Software caused connection abort', 'kernel nopen: accept, Invalid argument', 'kernel Listener with protocol tcp, host < host-name > , port < port-number > , engine 0 is quitting after a fatal error.'
--排除417573 ,后台日志里没出现包含以下错误信息 'kernel nopen: accept, Software caused connection abort', 'kernel nopen: accept, Invalid argument', 'kernel Listener with protocol tcp, host < host-name > , port < port-number > , engine 0 is quitting after a fatal error.'
 
425693In rare circumstances, on a multi-engine system, the message "current process infected with 11" together with a stack trace in ubffree may be reported in the error log while loading java classes.
 
430553An unexpected 557 error "The cursor ' < cursor name > ' can not be used as it could not be found. It is possible that either it was not declared or it is not available in the current context." may be raised or the message "current process infected with 11" in the module 'execudf' may be reported in the errorlog if a JAVA user defined function that uses the internal JDBC ASE driver is executed a high number of times (more than 30000) by the same connection and at the same time one or more cursors are active on that particular connection.
 
 
440923The message "current process infected with 11 in the module sendtext()" together with a stacktrace which includes the modules 's_execute', 'exec_eop' and 'run' maybe reported in the errorlog when running a SELECT with a UNION ALL clause and ORDERBY selecting text or image columns from a table.
--排除440923,日志不带in the module sendtext()
 
Targeted CR List for ASE 12.5.4 ESD#4:无相关bug
 
Targeted CR List for ASE 12.5.4 ESD#5:
450105The message 'current process infected with 11' in module 'level0_restart' may be reported in the error log when SP_SHOWPLAN is run or the configuration option 'plan text pipe active' is set and concurrent activity on table Syscolumns takes place. Modules 'showplan', 'showKeys' and 'getcolname_level0_fromcache' will also be reported in the stack trace.
--排除450105,没人调用sp_showplan
 
Targeted CR List for ASE 12.5.4 ESD#6:
448978
Under some rare conditions, the message "current process infected with 11" in the module 'qualpage' togetherwith a stacktrace which includes the module 'getpage_with_validation' may be reported in the error log if a query is executed with materialized derived table.
 
Targeted CR List for ASE 12.5.4 ESD#7:无相关bug
 
Targeted CR List for ASE 12.5.4 ESD#8:
443332The message "current process infected with 11" in the module 'uwdisconnect' together with a stack trace which includes the modules 'err_pll_primary_processing' and 'pll_exec' may be reported in the error log when a query being executed in parallel on a multiengine server was aborted by CTRL+C or client application was killed.
 
477067On HPUX 64 bit machines, the message "current process infected with 11" in the module 'pcexchandlerprint' together with a stack trace which includes module 'traceassert' may be reported in the errorlog when the trace flag 1205 is turned ON.
--排除477067,服务器不是 HPUX
 
482129Under some circumstances the message "current process infected with 11" maybe reported in the error log when executing a batch of DML statements and a trigger executed a ROLLBACK TRIGGER WITH RAISERROR.
 
484076Under some circumstances the execution of a stored procedure which uses execute immediate for creating other stored procedures and had to be recompiled maybe be aborted reporting in the errorlog the message "current process infected with 11" in the module 'set_cast' including a stack trace which includes the modules 's_renormalize' and 's_recompile'.
-,484076 bug id吻合,后台日志信息中每次current process infected with 11出现伴随着s_compile以及s_recompile,bug意思是在一些特定环境下,执行存储过程(当然包括系统过程),用来创建其他的过程并且进行重新编译失败。
 
Targeted CR List for ASE 12.5.4 ESD#9:
494337The message "current process infected with 11" together with a stack trace which includes the modules 'cursor_process_error' and 's_handle' maybe be reported in the error log if a client session with a distributed transaction having open cursors had to be aborted.
 

494742Occasionally wrong results or even procedure cache corruption could happen raising messages like "current process infected with 11" with routine names like 'mjutil_sdes_update' and 'memcpy' when a query joining two or more tables is executed in parallel inside a stored procedure, the configuration parameter "enable sort-merge join and JTC" is ON and the query uses one or more of the stored procedure parameters in the WHERE clause.

--排除494742,重新没打开enable sort-merge join and JTC
 
 
Targeted CR List for ASE 12.5.4 ESD #9.1:无相关bug

Targeted CR List for ASE 12.5.4 ESD#10:

491277The message "Current process infected with 11" together with a stacktrace which includes the modules 'mda__check_waiting_procs' and 'mda_populate_monState' may be reported in the error log when a SELECT query is executed against the monState MDA table.

--排除491277,MDA监控关闭,重庆
 
556752The message, "Current process infected with 11", together with a stack trace which include the modules 'srvdes_alloc' and 'omni_connect' may be reported in the errorlog when accessing a proxy table or remote server when ASE has no more free memory. As a workaround, increase the "max memory" configuration parameter.
--排除556752,重庆没启用代理表。
 
总结:以上的bug最吻合的就是484076 ,原因跟错误日志吻合,而且呼应了最早错误日志
03:00000:01217:2011/12/04 06:07:37.05 server  Process 1217 tried to remove a PROC_BUF 0x2ab48502a9e0 named '*ss0825993466_0743205685ss*'  that it does not hold in the hold procedure linked list.
意思进程1217试图清除 一个名叫'*ss0825993466_0743205685ss*'(数据库系统内部的命名串) 过程的过程缓存,这个过程在有效的过程链路列表里面不再有效。(水平有限,翻译的不是很准确,大概这个意思),结合错误发生时机刚好是后台shell ,update statistics更新校讯通库表的索引统计量刚好发生错误而导致统计量更新没有正常结束,被数据库本身强行中断(此时数据库出现了以上异常了),所以我根据错误日志以及产生错误日志发生点数据库高负荷正在做的事情判断,数据库由于update statistics高负荷运行,导致了过程的缓存出现异常,进行导致客户端请求的一系列sql语句需要通过系统过程调用产生查询分析计划的引导语句执行,在调用系统过程出现问题情况下,导致一些sql语句不能正常运行,使得后台不停持续相同类似的报错。
其他没有排除的bug可能存在是引起的问题原因。
 
针对以上情况好的解决方法就是,补丁升级到最新 ASE 12.5.4 ESD#10,数据库服务重新启动。
并且针对update statistics的shell在每次本来计划应该正常跑完毕的时间点,进行监控检查,是否正常跑完,并且数据库后台日志在跑这个shell过程中是否正常,如果跑shell过程中不正常,
为了保证周一致周五的系统的正常,该强行终止shell就终止,并且对数据库服务进行重新启动。
 
最坏的情况就看下面的官方报道了,下面的解说简单来讲就是可能硬件引起了问题,堆栈溢出,
处理方法:
1.当由于是sql发生错误是由于编译过程,触发器,视图引起的,那么删除这些对象重新创建。
2.当由于是sql发生错误不是编译对象引起,那么移除掉这些问题的数据,可以用以下方法:
  a.通过select into 法,把老的表导入的新的表,然后删除老的表,然后把新的表名更新回老的表名。
  b.通过bcp out把有问题的表数据导出,然后创建一张新表,再把导出的数据bcp in进入新的表,
以上的做法可能会丢失有问题的数据,尽量检查能引起硬件问题的硬件日志。
并建议更换出问题的硬件。

Current Process Infected with %d

Error message text

current process (0x%x) infected with %d

This error may be caused by a hardware problem.

Explanation

Adaptive Server reports this error when it detects a UNIX signal specifying an error. The signal values ("%d") displayed in the above message vary by platforms and Adaptive Server versions; the most common are either 10 or 11.

Current process infected with 11

A value of 11 [SIGSEGV] means that the operating system detected a segment violation error.

Sometimes this error occurs in conjunction with stack overflow or data corruption. For more information on stack overflow, refer to Stack Guardword Error.

The message appears in the Adaptive Server error log followed by a stack trace. The SQL causing error or thelasterror displayed in the Server error log may be the underlying cause for this error. But they can also be just the last data Adaptive Server had in its cache space.

In order to identify the lasterror (except in the cases where the lasterror is 0), get the number displayed by Adaptive Server in the lasterror field from the Adaptive Server error log and consult this manual for more information on this error number.

In the following example, the value for lasterror is 614.

00: 94/02/14 11:32:26.02 kernel: current process (0x1fb001d)infected with 1100: 94/02/14 11:32:26.07 kernel: Address 0x808a6ef (closetable+0x2f7), siginfo (code, address) = (2, 0x30)00: 94/02/14 11:32:26.07 kernel: ************************************00: 94/02/14 11:32:26.07 kernel: SQL causingerror : CREATE TRIGGER00: 94/02/14 11:32:26.07 kernel: curdb = 22 pstat = 0x10018 lasterror = 614

Action

In order to correct this error, try to eliminate the lasterror first, as this might be one of the causes for this error (except when lasterror is 0).

Also, try rerunning the command referenced in the SQL causing error to see if the problem reoccurs.

If the process is infected with 11 and the problem can be reproduced, try and correct it as follows:

  1. If the SQL causing error is a compiled object such as a stored procedure, trigger, or view, drop and recreate the object.

  2. If the SQL causing error is ad hoc rather than a compiled object, moving the data may fix the problem. Use one of these options:

    • Select the table data into a new table, drop the old table, and rename the new table to the old table name.

    • Bulk copy the affected table out, drop and re-create the table, and bulk copy back in. This is the most efficient solution for a very large table.

    If moving the data corrects the problem, the data may have been corrupt. Be aware that moving corrupted data can lead to a data loss.

Check your hardware error log as this error can be caused by hardware failure as well.


原创粉丝点击