ODPS数据倾斜导致的问题

来源：互联网发布：tensorflow gpu配置编辑：程序博客网时间：2024/05/19 22:26

前面转自：https://help.aliyun.com/knowledge_detail/43141.html#MaxCompute的MapReduce报错FAILED: ODPS-0123144: Fuxi job failed - WorkerRestart

MaxCompute的MapReduce报错FAILED: ODPS-0123144: Fuxi job failed - WorkerRestart

问题现象

执行MapReduce或者UDF的时候，有如下报错：

FAILED: ODPS-0123144: Fuxi job failed - WorkerRestart errCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance.
Exception in thread "main" com.aliyun.odps.OdpsException: ODPS-0123144: Fuxi job failed - WorkerRestarterrCode:252,errMsg:kInstanceMonitorTimeout, usually caused by bad udf performance.

问题原因

这个问题是由于集群的Slave节点在计算的过程中出现了超时了，导致Master节点认为子节点死掉的报错。目前的超时时间为10分钟，暂时不支持用户配置。这个报错比较常见的原因是Reduce里做了大循环，比如是存在长尾数据或者做笛卡尔积。用户需要尽量减少这种大循环的情况。对于长尾数据，可以考虑拿出来单独处理。或者用户可以手工发心跳，调用context.progress(); 但是这个有性能问题，不适合调用太频繁。

案例分析

先用group by查看数据倾斜状况

例如 select devmac,merchantid,isencrypt,count(1) as num from wi_passer_flow_log where dt = '2016-08-02' group by devmac,merchantid,isencrypt order by num desc limit 50;

可以看到客流统计日志表2016-08-02分区，存在严重的数据倾斜–最多一天之中一台设备到达|认证|离开一家店铺的次数达到了45w条

解决

修改map或者reduce程序，过滤掉脏数据。

阅读全文

0 0