不恰当使用线程池处理 MQ 消息引起的故障

来源:互联网 发布:hp centos认不出raid 编辑:程序博客网 时间:2024/05/21 23:45

现状

业务部门反应网站访问特别慢,负责运维监控的同事说MQ消息队列积压了,中间件的说应用服务器内存占用很高,GC 一直回收不了内存,GC 线程占了近 100% 的 CPU,其他的基本上都在等待,数据库很正常,完全没压力。没啥办法,线程、堆 dump 出来后,重启吧,然后应用又正常了。

分析

这种故障之前其实也碰到过了,分析了当时 dump 出来的堆后发现,处理 MQ 消息的线程池的队列长度达百万级别,占用了超过 1.3G 内存,这些内存都是没法回收的。

程序的实现目前是这样的:关联系统把消息推送到 MQ 上,我们再从 MQ 上拉消息下来处理;每种类型的消息都有一个线程负责从 MQ 上拉消息,拉下来后封装成线程池的任务提交给相应的线程池去执行。代码可以简化为:

<code class="java prettyprint" style="position: relative; overflow-y: hidden; overflow-x: auto;"><span class="kwd">package</span><span class="pln"> net</span><span class="pun">.</span><span class="pln">coderbee</span><span class="pun">.</span><span class="pln">mq</span><span class="pun">.</span><span class="pln">demo</span><span class="pun">;</span><span class="pln"></span><span class="kwd">import</span><span class="pln"> java</span><span class="pun">.</span><span class="pln">util</span><span class="pun">.</span><span class="pln">concurrent</span><span class="pun">.</span><span class="typ">ExecutorService</span><span class="pun">;</span><span class="pln"></span><span class="kwd">import</span><span class="pln"> java</span><span class="pun">.</span><span class="pln">util</span><span class="pun">.</span><span class="pln">concurrent</span><span class="pun">.</span><span class="typ">Executors</span><span class="pun">;</span><span class="pln"></span><span class="kwd">public</span><span class="pln"> </span><span class="kwd">class</span><span class="pln"> </span><span class="typ">MQListener</span><span class="pln"> </span><span class="pun">{</span><span class="pln">     </span><span class="kwd">public</span><span class="pln"> </span><span class="typ">ExecutorService</span><span class="pln"> executor </span><span class="pun">=</span><span class="pln"> </span><span class="typ">Executors</span><span class="pun">.</span><span class="pln">newFixedThreadPool</span><span class="pun">(</span><span class="lit">8</span><span class="pun">);</span><span class="pln">     </span><span class="kwd">public</span><span class="pln"> </span><span class="kwd">void</span><span class="pln"> onMessage</span><span class="pun">(</span><span class="kwd">final</span><span class="pln"> </span><span class="typ">Object</span><span class="pln"> message</span><span class="pun">)</span><span class="pln"> </span><span class="pun">{</span><span class="pln">          executor</span><span class="pun">.</span><span class="pln">execute</span><span class="pun">(</span><span class="kwd">new</span><span class="pln"> </span><span class="typ">Runnable</span><span class="pun">()</span><span class="pln"> </span><span class="pun">{</span><span class="pln">               </span><span class="lit">@Override</span><span class="pln">               </span><span class="kwd">public</span><span class="pln"> </span><span class="kwd">void</span><span class="pln"> run</span><span class="pun">()</span><span class="pln"> </span><span class="pun">{</span><span class="pln">                    </span><span class="com">// 耗时且复杂的消息处理逻辑</span><span class="pln">                    complicateHanlde</span><span class="pun">(</span><span class="pln">message</span><span class="pun">);</span><span class="pln">               </span><span class="pun">}</span><span class="pln">          </span><span class="pun">});</span><span class="pln">     </span><span class="pun">}</span><span class="pln">     </span><span class="kwd">private</span><span class="pln"> </span><span class="kwd">void</span><span class="pln"> complicateHanlde</span><span class="pun">(</span><span class="typ">Object</span><span class="pln"> message</span><span class="pun">)</span><span class="pln"> </span><span class="pun">{</span><span class="pln">     </span><span class="pun">}</span><span class="pln"></span><span class="pun">}</span></code>

这个实现就是导致故障的根源,Executors.newFixedThreadPool(8) 创建的线程池的任务队列是无边界的:

<code class="java prettyprint" style="position: relative; overflow-y: hidden; overflow-x: auto;"><span class="kwd">public</span><span class="pln"> </span><span class="kwd">static</span><span class="pln"> </span><span class="typ">ExecutorService</span><span class="pln"> newFixedThreadPool</span><span class="pun">(</span><span class="kwd">int</span><span class="pln"> nThreads</span><span class="pun">)</span><span class="pln"> </span><span class="pun">{</span><span class="pln">     </span><span class="kwd">return</span><span class="pln"> </span><span class="kwd">new</span><span class="pln"> </span><span class="typ">ThreadPoolExecutor</span><span class="pun">(</span><span class="pln">nThreads</span><span class="pun">,</span><span class="pln"> nThreads</span><span class="pun">,</span><span class="pln">                                          </span><span class="lit">0L</span><span class="pun">,</span><span class="pln"> </span><span class="typ">TimeUnit</span><span class="pun">.</span><span class="pln">MILLISECONDS</span><span class="pun">,</span><span class="pln">                                          </span><span class="kwd">new</span><span class="pln"> </span><span class="typ">LinkedBlockingQueue</span><span class="pun"><</span><span class="typ">Runnable</span><span class="pun">>());</span><span class="pln"></span><span class="pun">}</span></code>

当时是关联系统出故障了,他们恢复后,往 MQ 里狂推消息,我们系统里面的 MQListener 不断地从 MQ 拉消息下来,直接塞进线程池里,由于线程池处理消息的速度远远慢于消息进入的速度,所以线程池的队列不断增长,直到把所有的堆内存都占用了,这时不断引发 FullGC,但每次 FullGC 都没法回收到内存,应用也就挂死在那了。

之前那次故障也是线程池队列积压导致的,引起的原因是消息处理逻辑调用了外部接口,由于外部接口的响应非常慢,严重拖慢了消息的处理进度,改成异步调用之后好了些。但问题的根源并没有解决,就像昨天关联系统狂推消息后,我们的系统还是挂了。

解决方法

我的思路其实很简单,MQ 是用来系统间解耦的,也是一个缓冲,目前的实现是把处理消息的线程池又用作一个 MQ 了,消息不能不受控地进入线程池的任务队列,所以,要换成使用定长的阻塞队列,队列满了就暂停拉取消息。把线程池替换成:

<code class="java prettyprint" style="position: relative; overflow-y: hidden; overflow-x: auto;"><span class="kwd">private</span><span class="pln"> </span><span class="kwd">int</span><span class="pln"> nThreads </span><span class="pun">=</span><span class="pln"> </span><span class="lit">8</span><span class="pun">;</span><span class="pln"></span><span class="kwd">private</span><span class="pln"> </span><span class="kwd">int</span><span class="pln"> MAX_QUEUQ_SIZE </span><span class="pun">=</span><span class="pln"> </span><span class="lit">2000</span><span class="pun">;</span><span class="pln"></span><span class="kwd">private</span><span class="pln"> </span><span class="typ">ExecutorService</span><span class="pln"> executor </span><span class="pun">=</span><span class="pln"> </span><span class="kwd">new</span><span class="pln"> </span><span class="typ">ThreadPoolExecutor</span><span class="pun">(</span><span class="pln">nThreads</span><span class="pun">,</span><span class="pln">          nThreads</span><span class="pun">,</span><span class="pln"> </span><span class="lit">0L</span><span class="pun">,</span><span class="pln"> </span><span class="typ">TimeUnit</span><span class="pun">.</span><span class="pln">MILLISECONDS</span><span class="pun">,</span><span class="pln">          </span><span class="kwd">new</span><span class="pln"> </span><span class="typ">ArrayBlockingQueue</span><span class="pun"><</span><span class="typ">Runnable</span><span class="pun">>(</span><span class="pln">MAX_QUEUQ_SIZE</span><span class="pun">),</span><span class="pln">          </span><span class="kwd">new</span><span class="pln"> </span><span class="typ">ThreadPoolExecutor</span><span class="pun">.</span><span class="typ">CallerRunsPolicy</span><span class="pun">());</span></code>

把线程池队列满的时候直接让调用者(也就是 MQListener)执行任务,这样即延缓了消息拉取的速度,当 MQListener 再去拉取消息时,发现线程池有空间时可以提交到线程池,让线程池的工作线程去处理,它继续保持拉取速度。

这样既控制了线程池占用的内存,又可以让消息处理线程池处理不过来时多一个线程处理消息。

由于上面的代码采用调用者执行的方式,那么要考虑消息处理的顺序问题,比如一个订单的处理可能有多个步骤,对应多条 MQ 消息,那么要考虑这些步骤如果乱序了是否可以接受,因为第3步骤的处理消息可能被 MQListener 处理了,而第2步的处理消息还积压在线程池里。

0 0
原创粉丝点击