TasksetManager冲突导致SparkContext异常关闭

来源：互联网发布：梦里花落知多少名句编辑：程序博客网时间：2024/04/29 23:36

背景介绍

当正在悠闲敲着代码的时候，业务方兄弟反馈接收到大量线上运行的spark streaming任务的告警短信，查看应用的web页面信息，发现spark应用已经退出了，第一时间拉起线上的应用，再慢慢的定位故障原因。本文代码基于spark 1.6.1。

问题定位

登陆到线上机器，查看错误日志，发现系统一直报Cannot call methods on a stopped SparkContext.，全部日志如下

[ERROR][JobScheduler][2017-03-08+15:56:00.067][org.apache.spark.streaming.scheduler.JobScheduler]Error running job streaming job 1488959760000 ms.0java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.This stopped SparkContext was created at:org.apache.spark.SparkContext.<init>(SparkContext.scala:82)org.apache.spark.streaming.StreamingContext$.createNewSparkContext(StreamingContext.scala:874)org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:81)com.xxxx.xxxx.MainApp$.createStreamingContext(MainApp.scala:46)com.xxxx.xxxx.MainApp$$anonfun$15.apply(MainApp.scala:126)com.xxxx.xxxx.MainApp$$anonfun$15.apply(MainApp.scala:126)scala.Option.getOrElse(Option.scala:120)org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:864)com.xxxx.xxxx.MainApp$.main(MainApp.scala:125)com.xxxx.xxxx.MainApp.main(MainApp.scala)sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)java.lang.reflect.Method.invoke(Method.java:498)org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

看到此处应该很清楚了，是SparkContext已经停止了，接下来我们分析下是什么原因导致了SparkContext的停止，首先找到关闭的日志；分析SparkContext的代码可知，在关闭结束后会打印一个成功关闭的详情日志。

logInfo("Successfully stopped SparkContext")

通过grep命令找到相应的日志的位置，如下所示

[INFO][dag-scheduler-event-loop][2017-03-03+22:16:30.841][org.apache.spark.SparkContext]Successfully stopped SparkContext

从日志中可以看出是dag-scheduler-event-loop线程关闭了SparkContext，查看该线程的日志信息，显示如下

java.lang.IllegalStateException: more than one active taskSet for stage 4571114: 4571114.2,4571114.1        at org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:173)        at org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1052)        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:921)        at org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1214)        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1637)        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

上面显示有一个stage同时启动了两个TasksetManager，TaskScheduler.submitTasks的代码如下：

 override def submitTasks(taskSet: TaskSet) {    val tasks = taskSet.tasks    logInfo("Adding task set " + taskSet.id + " with " + tasks.length + " tasks")    this.synchronized {      val manager = createTaskSetManager(taskSet, maxTaskFailures)      val stage = taskSet.stageId      val stageTaskSets =        taskSetsByStageIdAndAttempt.getOrElseUpdate(stage, new HashMap[Int, TaskSetManager])      stageTaskSets(taskSet.stageAttemptId) = manager      val conflictingTaskSet = stageTaskSets.exists { case (_, ts) =>        ts.taskSet != taskSet && !ts.isZombie      }      if (conflictingTaskSet) {        throw new IllegalStateException(s"more than one active taskSet for stage $stage:" +          s" ${stageTaskSets.toSeq.map{_._2.taskSet.id}.mkString(",")}")      }    .........  }

看到这震惊了，怎么会出现两个呢？继续看之前的日志，发现stage4571114被resubmit了；

[INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Resubmitting ShuffleMapStage 4571114 (map at MainApp.scala:73) because some of its tasks had failed: 0[INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.547][org.apache.spark.scheduler.DAGScheduler]Submitting ShuffleMapStage 4571114 (MapPartitionsRDD[3719544] at map at MainApp.scala:73), which has no missing parents

查看stage重新提交的代码，以下代码截取自DAGScheduler.handleTaskCompletion方法

case smt: ShuffleMapTask =>            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]            updateAccumulators(event)            val status = event.result.asInstanceOf[MapStatus]            val execId = status.location.executorId            logDebug("ShuffleMapTask finished on " + execId)            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {              logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")            } else {              shuffleStage.addOutputLoc(smt.partitionId, status)            }            if (runningStages.contains(shuffleStage) && shuffleStage.pendingPartitions.isEmpty) {              markStageAsFinished(shuffleStage)              logInfo("looking for newly runnable stages")              logInfo("running: " + runningStages)              logInfo("waiting: " + waitingStages)              logInfo("failed: " + failedStages)              // We supply true to increment the epoch number here in case this is a              // recomputation of the map outputs. In that case, some nodes may have cached              // locations with holes (from when we detected the error) and will need the              // epoch incremented to refetch them.              // TODO: Only increment the epoch number if this is not the first time              //       we registered these map outputs.              mapOutputTracker.registerMapOutputs(                shuffleStage.shuffleDep.shuffleId,                shuffleStage.outputLocInMapOutputTrackerFormat(),                changeEpoch = true)              clearCacheLocs()              if (!shuffleStage.isAvailable) {                // Some tasks had failed; let's resubmit this shuffleStage                // TODO: Lower-level scheduler should also deal with this                logInfo("Resubmitting " + shuffleStage + " (" + shuffleStage.name +                  ") because some of its tasks had failed: " +                  shuffleStage.findMissingPartitions().mkString(", "))                submitStage(shuffleStage)              } else {                // Mark any map-stage jobs waiting on this stage as finished                if (shuffleStage.mapStageJobs.nonEmpty) {                  val stats = mapOutputTracker.getStatistics(shuffleStage.shuffleDep)                  for (job <- shuffleStage.mapStageJobs) {                    markMapStageJobAsFinished(job, stats)                  }                }              }

可以看出只有shuffleStage.pendingPartitions为空同时shuffleStage.isAvailable为false的时候才会触发resubmit，我们来看下这两个变量是什么时候开始，pendingPartitions表示现在正在处理的partition的数量，当task运行结束后会删除，

    val stage = stageIdToStage(task.stageId)    event.reason match {      case Success =>        listenerBus.post(SparkListenerTaskEnd(stageId, stage.latestInfo.attemptId, taskType,          event.reason, event.taskInfo, event.taskMetrics))         //从正在处理的partition中移除        stage.pendingPartitions -= task.partitionId

isAvaible判断的是已经告知driver的shuffle数据位置的partition数目是否等于总共的partition数目
def isAvailable: Boolean = _numAvailableOutputs == numPartitions
这个变量也是在ShuffleTask运行结束后进行更新的，不过需要注意的是，只有在Shuffle数据所在的executor还是可用的时候才进行更新，如果运行shuffleTask的executor已经挂了，肯定也无法通过该executor获取磁盘上的shuffle数据

 case smt: ShuffleMapTask =>            val shuffleStage = stage.asInstanceOf[ShuffleMapStage]            updateAccumulators(event)            val status = event.result.asInstanceOf[MapStatus]            val execId = status.location.executorId            logDebug("ShuffleMapTask finished on " + execId)            if (failedEpoch.contains(execId) && smt.epoch <= failedEpoch(execId)) {              logInfo(s"Ignoring possibly bogus $smt completion from executor $execId")            } else {              shuffleStage.addOutputLoc(smt.partitionId, status)            }

唯一的可能造成重新调度的就是该处了，根据关键信息查询下日志信息

[INFO][dag-scheduler-event-loop][2017-03-03+22:16:27.427][org.apache.spark.scheduler.DAGScheduler]Ignoring possibly bogus ShuffleMapTask(4571114, 0) completion from executor 4

但就算此时刚运行完shuffleTask的executor挂掉了，造成了stage的重新调度，也不会导致TasksetManager冲突，因为此时taskset.isZombie状态肯定变了为true，因为TasksetManager.handleSuccessfulTask方法执行在DAGScheduler.handleTaskCompletion之前。
val conflictingTaskSet = stageTaskSets.exists { case (_, ts) => ts.taskSet != taskSet && !ts.isZombie

TasksetManager.handleSuccessfulTask

def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {    val info = taskInfos(tid)    val index = info.index    info.markSuccessful()    removeRunningTask(tid)    // This method is called by "TaskSchedulerImpl.handleSuccessfulTask" which holds the    // "TaskSchedulerImpl" lock until exiting. To avoid the SPARK-7655 issue, we should not    // "deserialize" the value when holding a lock to avoid blocking other threads. So we call    // "result.value()" in "TaskResultGetter.enqueueSuccessfulTask" before reaching here.    // Note: "result.value()" only deserializes the value when it's called at the first time, so    // here "result.value()" just returns the value and won't block other threads.    //最终会提交一个CompletionEvent事件到DAGScheduler的事件队列中等待处理    sched.dagScheduler.taskEnded(      tasks(index), Success, result.value(), result.accumUpdates, info, result.metrics)    if (!successful(index)) {      tasksSuccessful += 1      logInfo("Finished task %s in stage %s (TID %d) in %d ms on %s (%d/%d)".format(        info.id, taskSet.id, info.taskId, info.duration, info.host, tasksSuccessful, numTasks))      // Mark successful and stop if all the tasks have succeeded.      successful(index) = true      if (tasksSuccessful == numTasks) {        isZombie = true      }    } else {      logInfo("Ignoring task-finished event for " + info.id + " in stage " + taskSet.id +        " because task " + index + " has already completed successfully")    }    failedExecutors.remove(index)    maybeFinishTaskSet()  }

可能有的同学已经看出问题来了，为了将问题说的更明白，我画了一个task执行成功的时序图

task执行成功时序图

结合时序图和代码我们可以看出DAGSchduler.handleCompletion执行发生在了TasksetManager.handleSuccessfulTask方法中isZombie变为true之前，handleSuccessfulTask是在task-result-getter线程中执行的,导致isZombie还未变为true，DAGSchduler就触发了stage的重新提交，最终导致TaskManger冲突。
以下日志分别是resubmit提交的时间和handleSuccessfuleTask的结束时间，从侧面(由于isZombie变为true并没有马上打印时间)也能够看出resubmit重新提交的时间早于handleSuccessfuleTask。

handleSuccessfuleTask结束时间[INFO][task-result-getter-2][2017-03-03+22:16:29.999][org.apache.spark.scheduler.TaskSchedulerImpl]Removed TaskSet 4571114.1, whose tasks have all completed, from pool resubmit stage任务重新提交时间[INFO][dag-scheduler-event-loop][2017-03-03+22:16:29.549][org.apache.spark.scheduler.TaskSchedulerImpl]Adding task set 4571114.2 with 1 tasks

事件发生的时间轴

事件时间

问题修复

该问题修复其实很简单，只需要修改TasksetManager.handleSuccessfulTask的方法，在isZombie=true后再发送CompletionEvent事件即可，代码修复如下

  def handleSuccessfulTask(tid: Long, result: DirectTaskResult[_]): Unit = {    val info = taskInfos(tid)    val index = info.index    info.markSuccessful()    removeRunningTask(tid)    // This method is called by "TaskSchedulerImpl.handleSuccessfulTask" which holds the    // "TaskSchedulerImpl" lock until exiting. To avoid the SPARK-7655 issue, we should not    // "deserialize" the value when holding a lock to avoid blocking other threads. So we call    // "result.value()" in "TaskResultGetter.enqueueSuccessfulTask" before reaching here.    // Note: "result.value()" only deserializes the value when it's called at the first time, so    // here "result.value()" just returns the value and won't block other threads.    if (!successful(index)) {      tasksSuccessful += 1      logInfo("Finished task %s in stage %s (TID %d) in %d ms on %s (%d/%d)".format(        info.id, taskSet.id, info.taskId, info.duration, info.host, tasksSuccessful, numTasks))      // Mark successful and stop if all the tasks have succeeded.      successful(index) = true      if (tasksSuccessful == numTasks) {        isZombie = true      }      sched.dagScheduler.taskEnded(        tasks(index), Success, result.value(), result.accumUpdates, info, result.metrics)    } else {      logInfo("Ignoring task-finished event for " + info.id + " in stage " + taskSet.id +        " because task " + index + " has already completed successfully")    }    failedExecutors.remove(index)    maybeFinishTaskSet()    sched.dagScheduler.taskEnded(      tasks(index), Success, result.value(), result.accumUpdates, info, result.metrics)  }

0 0