Spark问题10之Spark运行时节点空间不足导致运行报错

来源:互联网 发布:任正非 人工智能 编辑:程序博客网 时间:2024/06/05 11:04

更多代码请见:https://github.com/xubo245/SparkLearning

Spark生态之Alluxio学习 版本:alluxio(tachyon) 0.7.1,spark-1.5.2,hadoop-2.6.0

1.问题描述

1.1 简述

在写了脚本运行多个application的时候,运行到十几个之后,报错了。

org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 1.0 failed 4 times, most recent failure: Lost task 8.3 in stage 1.0 (TID 25, Mcnode4): org.apache.spark.SparkException: File ./DSA.jar exists and does not match contents of http://Master:41701/jars/DSA.jar

查看history时发现是节点空间不足:http://master:18080/history/app-20170209152626-0632/stages/stage/?id=1&attempt=0
需要深入查询到Tasks层记录,才发现问题,1.2的报错没有直接提前。

java.io.IOException: No space left on device +detailsjava.io.IOException: No space left on device    at java.io.FileOutputStream.writeBytes(Native Method)    at java.io.FileOutputStream.write(FileOutputStream.java:345)    at org.spark-project.guava.io.ByteStreams.copy(ByteStreams.java:211)    at org.spark-project.guava.io.ByteSource.copyTo(ByteSource.java:204)    at org.spark-project.guava.io.Files.copy(Files.java:436)    at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$copyRecursive(Utils.scala:514)    at org.apache.spark.util.Utils$.copyFile(Utils.scala:485)    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:362)    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)

查看的hdfs和Mcnode4,确认了确实是空间不足,就2.47 MB 可用了,DSA.jar是2.5M。 主要是work的application记录逐渐增多。

1.2 问题报错记录

hadoop@Master:~/disk2/xubo/project/alignment/SparkSW/SparkSW20161114/alluxio-1.3.0$ ./cloudSWatmtimequerystandalone.sh > cloudSWatmtimequerystandalonetime201702072344.txt[Stage 1:>                                                       (0 + 16) / 128]Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 8 in stage 1.0 failed 4 times, most recent failure: Lost task 8.3 in stage 1.0 (TID 25, Mcnode4): org.apache.spark.SparkException: File ./DSA.jar exists and does not match contents of http://Master:41701/jars/DSA.jar    at org.apache.spark.util.Utils$.copyFile(Utils.scala:464)    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:362)    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)    at java.lang.Thread.run(Thread.java:745)Driver stacktrace:    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283)at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1271)    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1270)at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:697)at scala.Option.foreach(Option.scala:236)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:697)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1496)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1458)at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1447)at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824)    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1944)    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1007)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)at org.apache.spark.rdd.RDD.reduce(RDD.scala:989)at org.apache.spark.rdd.RDD$$anonfun$takeOrdered$1.apply(RDD.scala:1370)    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)    at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)    at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1357)    at org.apache.spark.rdd.RDD$$anonfun$top$1.apply(RDD.scala:1338)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)at org.apache.spark.rdd.RDD.top(RDD.scala:1337)at org.dsa.core.DSW.align(DSW.scala:39)at org.dsa.core.SequenceAlignment$$anonfun$run$1.apply(SequenceAlignment.scala:33)    at org.dsa.core.SequenceAlignment$$anonfun$run$1.apply(SequenceAlignment.scala:32)at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)at org.dsa.core.SequenceAlignment.run(SequenceAlignment.scala:32)at org.dsa.core.DSW$.main(DSW.scala:138)at org.dsa.time.CloudSWATMQueryTime$$anonfun$main$1$$anonfun$apply$mcVI$sp$3$$anonfun$apply$mcVI$sp$4.apply$mcVI$sp(CloudSWATMQueryTime.scala:93)    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)    at org.dsa.time.CloudSWATMQueryTime$$anonfun$main$1$$anonfun$apply$mcVI$sp$3.apply$mcVI$sp(CloudSWATMQueryTime.scala:92)    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)    at org.dsa.time.CloudSWATMQueryTime$$anonfun$main$1.apply$mcVI$sp(CloudSWATMQueryTime.scala:85)at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)at org.dsa.time.CloudSWATMQueryTime$.main(CloudSWATMQueryTime.scala:13)at org.dsa.time.CloudSWATMQueryTime.main(CloudSWATMQueryTime.scala)at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:606)at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)Caused by: org.apache.spark.SparkException: File ./DSA.jar exists and does not match contents of http://Master:41701/jars/DSA.jar    at org.apache.spark.util.Utils$.copyFile(Utils.scala:464)    at org.apache.spark.util.Utils$.fetchFile(Utils.scala:362)    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:405)    at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$5.apply(Executor.scala:397)    at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)    at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:98)    at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:226)    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:39)    at scala.collection.mutable.HashMap.foreach(HashMap.scala:98)    at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)    at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:397)    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:193)    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)    at java.lang.Thread.run(Thread.java:745)

2.解决办法:

2.1 增加空间

RT

2.2 删除/移除文件文件

将work下的记录移到disk2

cd ~/disk2/backup
mv time20161102/spark-1.5.2/ .
mv time20161212/spark/work/* spark-1.5.2/work/
mv ~/cloud/spark-1.5.2/work/app-201* spark-1.5.2/work/

3.运行记录:

移除文件后可以正常运行。

参考

【1】http://spark.apache.org/docs/1.5.2/programming-guide.html【2】https://github.com/xubo245/SparkLearning
0 0