Lost executor on YARN

来源：互联网发布：linux操作系统入门书籍编辑：程序博客网时间：2024/05/16 03:29

1. Lost executor on YARN ALS iterations

debasish83 Q:

During the 4th ALS iteration, I am noticing that one of the executor gets disconnected: 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding SendingConnectionManagerId not found 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5 disconnected, so removing it 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost executor 5 on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client disassociated 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 12) Any idea if this is a bug related to akka on YARN ? I am using master

ps:ALS(alternating leastsquares)：交替最小二乘法

Xiangrui Meng A:
我们知道container被YARN killed是因为它使用了比它要求的更多的内存，但是这个问题的根源还没找到。

We know that the container got killed by YARN because it used much more memory that it requested. But we haven't figured out the root cause yet.

debasish83 Q:
我可以用YARN1.0或1.1reproduce，所以这应该是和YARN版本相关的问题。
至少对我来说，现在可以的是使用standalone模式。
Sandy Ryza A:
解决方法是增加 spark.yarn.executor.memoryOverhead直到这个错误消失。该配置控制JVM堆大小与从YARN得到的内存大小之间的缓冲（JVM可以占用内存超出他们的堆大小）。你还应该确保，在YARN NodeManager（节点管理器）配置中，yarn.nodemanager.vmem-check-enabled设置为false。

The fix is to raise spark.yarn.executor.memoryOverhead until this goes away.  This controls the buffer between the JVM heap size and the amount of memory requested from YARN (JVMs can take up memory beyond their heap size).You should also make sure that, in the YARN NodeManager configuration, yarn.nodemanager.vmem-check-enabled is set to false.

debasish83 Q:
我在spark-defaults.conf把spark.yarn.executor.memoryOverhead 1024,但是没有在webUI->environment中看到spark properties 中的environment variable.它需要在spark-env.sh中设置吗
Sandy Ryza A:
当前情况就是增加
spark.yarn.executor.memoryOverhead直到job停止失败。我们确实有计划尝试自动缩放此基础上的内存量
要求，但它仍然只是一个启发。
debasish83 Q:
如果我使用40个execotor，内存16GB，对100M（1亿）x10M（1千万）的大矩阵，即数十亿评分，典型的spark.yarn.executor.memoryOverhead是什么呢？
Sandy Ryza A:
我预计2GB就够了，更不要说16GB（unless ALS is using a bunch of off-heap memory?）你之前提到说在县城中property没有在environment选项中显示，你确定它生效了吗？如果生效了它会在environment中出现的。

2. Getting error in Spark: Executor lost

Q:
一个master和2个slaves，每个的RAM是32GB，读取一个大小为18million records的csv文件（第一行是列名）

./spark-submit --master yarn --deploy-mode client --executor-memory 10g <path/to/.py file>rdd = sc.textFile("<path/to/file>")h = rdd.first()header_rdd = rdd.map(lambda 1: h in l)data_rdd = rdd.substract(header_rdd)data_rdd.first()

错误信息：

15/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:  ApplicationMaster has disassociated: 192.168.1.114:5105815/10/12 13:52:03 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint:  ApplicationMaster has disassociated: 192.168.1.114:5105815/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor:  Association with remote system [akka.tcp://sparkYarnAM@192.168.1.114:51058] has failed, address is now gated for [5000] ms. Reason: [Disassociated]15/10/12 13:52:03 ERROR cluster.YarnScheduler: Lost executor 1 on hslave2: remote Rpc client disassociated15/10/12 13:52:03 INFO scheduler.TaskSetManager: Re-queueing tasks for 1 from TaskSet 3.015/10/12 13:52:03 WARN remote.ReliableDeliverySupervisor: Association with remote system [akka.tcp://sparkExecutor@hslave2:58555] has failed,  address is now gated for [5000] ms. Reason: [Disassociated]15/10/12 13:52:03 WARN scheduler.TaskSetManager: Lost task 6.6 in stage 3.0 (TID 208, hslave2): ExecutorLostFailure (executor 1 lost)

该错误是在运行rdd.substract()时产生的。然后我改变了代码，删除了rdd.substract()，用rdd.filter()代替：

rdd = sc.textFile("<path/to/file>")h = rdd.first()data_rdd = rdd.filter(lambda l: h not in l)

得到相同的错误

A:
这不是Spark的bug，应该适合你的Java，Yarn和Spark-config文件的配置有关。
你可以增加Java内存，增加akka的framesize和timeout设置等等。

sapark-defaults.conf:spark.master                       yarn-clusterspark.yarn.historyServer.address   <your cluster url>spark.eventLog.enabled             truespark.eventLog.dir                 hdfs://<your history directory>spark.driver.extraJavaOptions      -Xmx20480m -XX:MaxPermSize=2048m XX:ReservedCodeCacheSize=2048mspark.checkpointDir                hdfs://<your checkpoint directory>yarn.log-aggregation-enable        truespark.shuffle.service.enabled      truespark.shuffle.service.port         7337spark.shuffle.consolidateFiles     truespark.sql.parquet.binaryAsString   truespark.speculation                  falsespark.yarn.maxAppAttempts          1spark.akka.askTimeout              1000spark.akka.timeout                 1000spark.akka.frameSize               1000spark.rdd.compress truespark.storage.memoryFraction 1spark.core.connection.ack.wait.timeout 600spark.driver.maxResultSize         0spark.task.maxFailures             20spark.shuffle.io.maxRetries        20

你可能还想设置需要多少partitions在你的Spark程序里，也许想要增加一些partitionBy（partitioner）语句到你的RDD中，所以你的代码也许是这样的：

myPartitioner = new HashPartitioner(<your number of partitions>)rdd = sc.textFile("<path/to/file>").partitionBy(myPartitioner)h = rdd.first()header_rdd = rdd.map(lambda l: h in l)data_rdd = rdd.subtract(header_rdd)data_rdd.first()

最后，你也许需要设置你的spark-submitcommand并增加参数：executor的数量，executor memory 和driver memory

./spark-submit --master yarn --deploy-mode client --num-executors 100 --driver-memory 20G --executor-memory 10g <path/to/.py file>

3. spark job运行参数优化

一般Spark Job很多问题都是来源于系统资源不够用，通过监控日志等判断是内存资源占用过高等导致的问题，因此尝试通过配置参数的方法来解决。
1.

--conf spark.akka.frameSize=100

此参数控制Spark中通信消息的最大容量（如task的输出结果），默认为10M，当处理大数据时，task的输出可能会大于这个值，需要根据实际数据设置一个更高的值。
2.

--conf spark.shuffle.manager=SORT

Spark默认的shuffle采用Hash模式，在Hash模式下，每一次shuffle会生成M*R数量的文件（M:Map的数目，R:Reduce的数目），当Map和Reduce的数目较大时，会产生相当规模的恩建，与此同时带来了大量的内存开销。为降低系统资源，可以采用Sort模式，只产生M数量的文件，但运行时间加长。
3.

--conf spark.yarn.executor.memoryOverhead=4096

executor堆外内存设置，如果程序使用了大量的堆外内存，就该增大此配置。

0 0