Spark Streaming On Yarn/ On StandAlone模式下的checkpointing容错

来源：互联网发布：python 写入环境变量编辑：程序博客网时间：2024/04/30 05:46

Spark On Yarn：

在Spark On Yarn模式下部署Spark Streaming 时候，我们需要使用StreamingContext.getOrCreate方法创建StreamingContext实例，指定我们自己的checkpoint目录，用作存储checkpoint数据。

容错1：

当我们使用spark-submit成功提交一个程序之后，我们可以使用jps能够查看到CoarseGrainedExecutorBackend进程，当我们直接kill该线程时候，你会发现立马会再次启动一个全新的CoarseGrainedExecutorBackend进行，这个就是Yarn提供的特性支持自动重启容错，重启的CoarseGrainedExecutorBackend会去读取checkpoint中的数据然后继续计算。

容错2：

如果我们使用yarn application -kill jobid 直接杀死作业的话，再次启动你会发现程序启动不起来了。直接抛异常：

Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:85)at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:62)

异常信息也很清晰：Yarn application has already ended! It might have been killed or unable to launch application master. 为了确定该问题只好翻阅源码看问题具体原因，查看YarnClientSchedulerBackend的waitForApplication方法：

  /**   * Report the state of the application until it is running.   * If the application has finished, failed or been killed in the process, throw an exception.   * This assumes both `client` and `appId` have already been set.   */  private def waitForApplication(): Unit = {    assert(client != null && appId.isDefined, "Application has not been submitted yet!")    val (state, _) = client.monitorApplication(appId.get, returnOnRunning = true) // blocking    if (state == YarnApplicationState.FINISHED ||  //判断如果是完成或者失败或者杀死的话 直接抛异常      state == YarnApplicationState.FAILED ||      state == YarnApplicationState.KILLED) {      throw new SparkException("Yarn application has already ended! " +        "It might have been killed or unable to launch application master.")    }    if (state == YarnApplicationState.RUNNING) {      logInfo(s"Application ${appId.get} has started running.")    }  }

通过代码猜测，Spark 设计者的想法是：如果程序异常被杀死某一个进行的话，则又Yarn负责自动重启容错，就如上面容错问题1描述。如果是人为kill，或者执行完毕，或者失败的话，则认为程序本身要么是执行完毕不需要再次运行，要么认为程序有错误不需要再次执行。

所以on yarn的容错是由yarn负责的。

拓展：由于yarn不允许我们重启提交作业从checkpoint中恢复的话，那么如果我们stareaming消费kafka的话，就需要手动完成kafka的消费offset的自行保存，一遍加载启动时候能够继续消费。

Spark On StandAlone：

如果是spark standlone模式提交的话，如果直接在webui上面kill该job的话，会从新加载上次的checkpoint目录完成容错。

spark standlone模式下，也支持kill CoarseGrainedExecutorBackend之后自动重启CoarseGrainedExecutorBackend的。

结论：不同的部署模式容错有差异，还需要视具体情况而定。

阅读全文

0 0