2013年2月7日故障--都是数据库惹的祸

来源：互联网发布：单片机c语言数据类型编辑：程序博客网时间：2024/05/21 13:10

从2月6日起，网点普遍反映web登陆不上来， 登陆上来的也交易缓慢，查看apache交易日志，发现如下提示：

[Thu Feb 07 07:20:57 2013] [error] PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding[Thu Feb 07 07:22:40 2013] [error] PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding[Thu Feb 07 07:22:49 2013] [error] PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding[Thu Feb 07 07:22:50 2013] [error] PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding[Thu Feb 07 07:22:51 2013] [error] PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding[Thu Feb 07 07:22:54 2013] [error] PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding

具体信息无法看到，为了查看具体错误信息，参看apache 的 weblogic整合参数为了打开详细日志我们将httpd.conf 中weblogic整合部分的配置修改如下：

<IfModule mod_weblogic.c>  MatchExpression /bmfx/* WebLogicHost=10.154.2.80|WebLogicPort=8001  MatchExpression /itplat* WebLogicCluster=10.154.2.69:8001,10.154.2.85:8001  MatchExpression /shp/* WebLogicHost=10.154.2.80|WebLogicPort=8001  MatchExpression /chexian/* WebLogicHost=10.154.2.80|WebLogicPort=8001  MatchExpression /qcpiao/* WebLogicHost=10.154.2.80|WebLogicPort=8001  MatchExpression /hcp/* WebLogicHost=10.154.2.80|WebLogicPort=8001  MatchExpression /ltka/* WebLogicHost=10.154.2.80|WebLogicPort=8001  MatchExpression /piaowu/* WebLogicHost=10.154.2.80|WebLogicPort=8001  Debug ON  WLLogFile /tmp/wl_proxy.log  DebugConfigInfo On</IfModule>

通过 Debug ON 等参数打开详细日志，并重定向到tmp下的wl_proxy.log日志中

发现错误提示如下：

Thu Feb  7 10:02:19 2013 <4811360202239410> *******Exception type [READ_TIMEOUT] (no read after 300 seconds) raised at line 205 of ../nsapi/Reader.cppThu Feb  7 10:02:19 2013 <4811360202239410> caught exception in readStatus: READ_TIMEOUT [os error=0,  line 205 of ../nsapi/Reader.cpp]: no read after 300 seconds at line 822Thu Feb  7 10:02:19 2013 <4811360202239410> PROTOCOL_ERROR: Backend Server not responding - isRecycled:1Thu Feb  7 10:02:19 2013 <4811360202239410> *******Exception type [PROTOCOL_ERROR] (Backend Server not responding) raised at line 842 of ../nsapi/URL.cppThu Feb  7 10:02:19 2013 <4811360202239410> sendRequest: exception caught while parsingHeaders w/ recycled connection to 10.154.2.85:8001, numfailures=1Thu Feb  7 10:02:19 2013 <4811360202239410> Marking 10.154.2.85:8001 as badThu Feb  7 10:02:19 2013 <4811360202239410> got exception in sendRequest phase: PROTOCOL_ERROR [line 842 of ../nsapi/URL.cpp]: Backend Server not responding at line 2994Thu Feb  7 10:02:19 2013 <4811360202239410> Failing over after sendRequest() exception: PROTOCOL_ERROR as Idempotent is set to ONThu Feb  7 10:02:19 2013 <4811360202239410> attempt #1 out of a max of 5Thu Feb  7 10:02:19 2013 <4811360202239410> general list: trying connect to '10.154.2.69'/8001/0 at line 2619 for '/itplat/essePage.action?menuid=2111&menuCode=2111'Thu Feb  7 10:02:19 2013 <4811360202239410> INFO: New NON-SSL URLThu Feb  7 10:02:19 2013 <4811360202239410> Connect returns -1, and error no set to 115, msg 'Operation now in progress'Thu Feb  7 10:02:19 2013 <4811360202239410> EINPROGRESS in connect() - selectingThu Feb  7 10:02:19 2013 <4811360202239410> Local Port of the socket is 59856Thu Feb  7 10:02:19 2013 <4811360202239410> Remote Host 10.154.2.69 Remote Port 8001Thu Feb  7 10:02:19 2013 <4811360202239410> general list: created a new connection to '10.154.2.69'/8001 for '/itplat/essePage.action?menuid=2111&menuCode=2111', Local port:59856

上述错误提示的大概意思是说 apache在负载均衡连接weblogic的时候当连接发到 85 那台机器的时候在300秒没有收到响应信息发生了超时，然后apache将连接转而发送给69这台weblogic机器产生了一个新的连接，同时提示

Backend Server not responding

这个错误提示。这里我们看到由于85响应不及时导致大量的连接必须要重新连接到69这台机器上去就发生了响应的提示错误。但问题是85这台机器造成的吗？

经过后台检测应用服务器和oracleRAC集群发现RAC集群里面有一台机器的ip ping值异常，考虑可能是由于oracle响应不及时，导致应用返回超时，然后导致web这台机器响应慢，因为是oracleRAC服务，我们可以随时停掉其中的一台实例二不影响整个数据库的服务

查看RAC集群中的实例状态：

[oracle@sdbmzdb1 ~]$ srvctl status database -d essedb

Instance essedb1 is running on node sdbmzdb1

Instance essedb2 is running on node sdbmzdb2

发现实例1 和实例2 均运转正常

然后停掉其中实例1

[oracle@sdbmzdb1 ~]$ srvctl stop instance -d essedb -i essedb1

再查看数据库状态

[oracle@sdbmzdb1 ~]$ srvctl status database -d essedbInstance essedb1 is not running on node sdbmzdb1Instance essedb2 is running on node sdbmzdb2

发现db1 已经停止

然后再检测 apache应用发现错误提示已经消失，至此故障解除

2013年2月7日 故障--都是数据库惹的祸

2013年2月7日故障--都是数据库惹的祸