Spark ERROR: org.apache.spark.shuffle.FetchFailedException 问题追查

来源:互联网 发布:java项目书籍 编辑:程序博客网 时间:2024/06/03 16:52

背景

因为这两天升级上线spark 2.2.1 thriftserver服务,对服务运行情况,尤其对失败作业的关注比较高,今天发现有一台机器上的task失败率格外高,报错FetchFailedException,过去我会认为是资源竞争,导致executor因内存不足而挂掉,从而导致block fetch失败。今天仔细一查,发现问题的另一个原因。

下面是追查过程:

1) 首先看到spark web显示报错:

这里写图片描述

```FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681    at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513)    at 

2) 然后进入stderr页面观察task 日志情况:

这里写图片描述

17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)at java.lang.Thread.run(Thread.java:745)Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing

日志的重点是连接超时,而且重试多次仍然超时。

3)登入到hadoop244.bx,使用top观察进程资源占用情况

这里写图片描述
发现有一个PID为95479的进程cpu占用一直超过100%

4)然后jps查看进程情况

这里写图片描述
然后发现,PID 95479 对应是一个tez作业,这个作业长期占用大量cpu资源,导致其他进程(如当前场景的spark进程)获取执行时间要等待很长,从而导致连接timeout,导致spark作业失败。

总结: 有必要让yarn采用类似cgroup的资源限制功能,限制单个进程对资源的长期占用,避免异常对其他作业的影响;同时加大timeout超时时间,提高对计算及网络环境不好情况下的鲁棒性。