Spark Shuffle FetchFailedException异常

来源:互联网 发布:安卓php服务器汉化版 编辑:程序博客网 时间:2024/06/07 11:39


org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 0org.apache.spark.shuffle.FetchFailedException: Failed to connect to ip-xxxxxxxxorg.apache.spark.shuffle.FetchFailedException: Error in opening FileSegmentManagedBuffer{file=/mnt/yarn/nm/usercache/xxxx/appcache/application_1450751731124_8446/blockmgr-8a7b17b8-f4c3-45e7-aea8-8b0a7481be55/08/, offset=12329181, length=2104094}

This error is almost guaranteed to be caused by memory issues on your executors. I can think of a couple of ways to address these types of problems.

1) You could try to run with more partitions (do a repartition on your dataframe). Memory issues typically arise when one or more partitions contain more data than will fit in memory.

2) I’m noticing that you have not explicitly set spark.yarn.executor.memoryOverhead, so it will default to max(386, 0.10* executorMemory) which in your case will be 400MB. That sounds low to me. I would try to increase it to say 1GB (note that if you increase memoryOverhead to 1GB, you need to lower –executor-memory to 3GB)

3) Look in the log files on the failing nodes. You want to look for the text “Killing container”. If you see the text “running beyond physical memory limits”, increasing memoryOverhead will - in my experience - solve the problem.



2)我注意到,你还没有明确设置spark.yarn.executor.memoryOverhead,所以它会默认max(386, 0.10* executorMemory)而你的情况将是400MB。这可能太低了。我会尽量增加到1GB(注意,如果你增加memoryOverhead为1GB,你需要降低–executor-memory至3GB)

3)看在日志文件中失败的节点上。你想查找文本“杀容器”。如果你看到文本“运行超出物理内存限制”,增加memoryOverhead 这是我解决的经验

额外补充一些: 如果存在数据倾斜,那么可以从这方面来解决,使用更好的分区方式,例如修改分区的数量,方式,或者自定义分区的方式

1 0