Spark ERROR: org.apache.spark.shuffle.FetchFailedException 问题追查
来源:互联网 发布:java项目书籍 编辑:程序博客网 时间:2024/06/03 16:52
背景
因为这两天升级上线spark 2.2.1 thriftserver服务,对服务运行情况,尤其对失败作业的关注比较高,今天发现有一台机器上的task失败率格外高,报错FetchFailedException,过去我会认为是资源竞争,导致executor因内存不足而挂掉,从而导致block fetch失败。今天仔细一查,发现问题的另一个原因。
下面是追查过程:
1) 首先看到spark web显示报错:
```FetchFailed(BlockManagerId(149, hadoop848.bx.com, 11681, None), shuffleId=135, mapId=12, reduceId=154, message=org.apache.spark.shuffle.FetchFailedException: Failed to connect to hadoop848.bx.com/10.88.69.188:11681 at org.apache.spark.storage.ShuffleBlockFetcherIterator.throwFetchFailedException(ShuffleBlockFetcherIterator.scala:513) at
2) 然后进入stderr页面观察task 日志情况:
17/12/11 11:42:02 ERROR RetryingBlockFetcher: Exception while beginning fetch of 6 outstanding blocks (after 1 retries)java.io.IOException: Failed to connect to hadoop972.bx.com/10.87.112.82:15733at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:232)at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:97)at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)at org.apache.spark.network.shuffle.RetryingBlockFetcher.lambda$initiateRetry$0(RetryingBlockFetcher.java:169)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.run(FutureTask.java:266)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)at java.lang.Thread.run(Thread.java:745)Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: hadoop972.bx.com/10.87.112.82:15733at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:257)at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:291)at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:631)at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(Sing
日志的重点是连接超时,而且重试多次仍然超时。
3)登入到hadoop244.bx,使用top观察进程资源占用情况
发现有一个PID为95479的进程cpu占用一直超过100%
4)然后jps查看进程情况
然后发现,PID 95479 对应是一个tez作业,这个作业长期占用大量cpu资源,导致其他进程(如当前场景的spark进程)获取执行时间要等待很长,从而导致连接timeout,导致spark作业失败。
总结: 有必要让yarn采用类似cgroup的资源限制功能,限制单个进程对资源的长期占用,避免异常对其他作业的影响;同时加大timeout超时时间,提高对计算及网络环境不好情况下的鲁棒性。
阅读全文
0 0
- Spark ERROR: org.apache.spark.shuffle.FetchFailedException 问题追查
- org.apache.spark.shuffle.FetchFailedException:Failed to connect to异常
- Spark Shuffle FetchFailedException解决方案
- Spark Shuffle FetchFailedException解决方案
- Spark Shuffle FetchFailedException解决方案
- Spark Shuffle FetchFailedException解决方案
- Spark Shuffle FetchFailedException异常
- 【Spark系列8】Spark Shuffle FetchFailedException报错解决方案
- Apache Spark探秘:Spark Shuffle实现
- org.apache.spark.shuffle.MetadataFetchFailedException:Missing an output location for shuffle 5
- org.apache.spark
- spark shuffle
- Spark-shuffle
- spark Shuffle
- spark shuffle
- spark shuffle
- sort-based shuffle的核心:org.apache.spark.util.collection.ExternalSorter
- spark shuffle mapreduce shuffle
- javascript、 jquery、 java、JSP和的区别和联系
- jquery练习3 换肤
- 利用MAVEN打包时,如何包含更多的资源文件
- 搜索学习--Elasticsearch全文搜索服务器的基本使用
- 第一篇博客,留个纪念
- Spark ERROR: org.apache.spark.shuffle.FetchFailedException 问题追查
- Throwable和Exception的区别
- iOS证书详解
- 获取html页面代码的方法
- windows常用操作设置
- pdf首页缩略图
- java 面试
- Qt资料大全
- java freemark导出word (模板、单张图片、多张图片源码)