Spark运行在Standalone模式下产生的临时目录的问题

来源:互联网 发布:淘宝外观专利侵权判定 编辑:程序博客网 时间:2024/04/25 23:09
Spark 的Job任务在运行过程中产生大量的临时目录位置,导致某个分区磁盘写满,主要原因spark运行产生临时目录的默认路径/tmp/spark*

项目中使用的版本情况
Hadoop: 2.7.1
Spark:1.6.0
JDK:1.8.0

1、项目运维需求
线上的Spark的集群相关/tmp/spark-* 日志会把 /分区磁盘写满,建议优化应用程序或更改日志路径到/home/ 子目录下

2、解决方案
2.1 方案1(不建议使用)
可以通过crontab 定时执行rm -rf  /tmp/spark*命令,缺点:当spark的任务执行,这个时候会生成/tmp/spark* 的临时文件,正好在这个时候
crontab 启动rm命令,从而导致文件找不到以至于spark任务执行失败

2.2 方案2(推荐在spark-env.sh 中配置参数,不在spark-defaults.conf 中配置)
spark环境配置spark.local.dir,其中 SPARK_LOCAL_DIRS : storage directories to use on this node for shuffle and RDD data

修改 conf 目录下的spark-defaults.conf 或者 conf 目录下的spark-env.conf,下面我们来一一验证哪个更好。
(1)修改spark执行时临时目录的配置,增加如下一行
spark.local.dir /diskb/sparktmp,/diskc/sparktmp,/diskd/sparktmp,/diske/sparktmp,/diskf/sparktmp,/diskg/sparktmp
说明:可配置多个目录,以 "," 分隔。

(2)修改配置spark-env.sh下增加
export SPARK_LOCAL_DIRS=spark.local.dir /diskb/sparktmp,/diskc/sparktmp,/diskd/sparktmp,/diske/sparktmp,/diskf/sparktmp,/diskg/sparktmp
如果spark-env.sh与spark-defaults.conf都配置,则SPARK_LOCAL_DIRS覆盖spark.local.dir 的配置
生产环境我们按照这样的思路去处理
生产环境修改为:在spark-defaults.conf 下增加一行
spark.local.dir /home/hadoop/data/sparktmp

然后运行通过下面的命令验证:

bin/spark-submit  --class  org.apache.spark.examples.SparkPi \--master spark://10.4.1.1:7077 \--total-executor-cores 4 \--driver-memory 2g \--executor-memory 2g \--executor-cores 1 \lib/spark-examples*.jar  10

执行完成后,有些work下executor的日志发现会存在一些错误日志,错误如下:
6/09/08 15:55:53 INFO util.Utils: Successfully started service 'sparkExecutorActorSystem' on port 50212.
16/09/08 15:55:53 ERROR storage.DiskBlockManager: Failed to create local dir in . Ignoring this directory.
java.io.IOException: Failed to create a temp directory (under ) after 10 attempts!
    at org.apache.spark.util.Utils$.createDirectory(Utils.scala:217)
    at org.apache.spark.storage.DiskBlockManager$$anonfun$createLocalDirs$1.apply(DiskBlockManager.scala:135)
    at org.apache.spark.storage.DiskBlockManager$$anonfun$createLocalDirs$1.apply(DiskBlockManager.scala:133)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
    at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:108)
    at org.apache.spark.storage.DiskBlockManager.createLocalDirs(DiskBlockManager.scala:133)
    at org.apache.spark.storage.DiskBlockManager.<init>(DiskBlockManager.scala:45)
    at org.apache.spark.storage.BlockManager.<init>(BlockManager.scala:76)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:365)
    at org.apache.spark.SparkEnv$.createExecutorEnv(SparkEnv.scala:217)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:186)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
    at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
    at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:151)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253)
    at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
16/09/08 15:55:53 ERROR storage.DiskBlockManager: Failed to create any local dir.

针对以上出错的原因我们通过源码进行分析
(1) DiskBlockManager类中的下面的方法
通过日志我们最终定位这块出现的错误
/**
   * Create local directories for storing block data. These directories are
   * located inside configured local directories and won't
   * be deleted on JVM exit when using the external shuffle service.
   */
  private def createLocalDirs(conf: SparkConf): Array[File] = {
    Utils.getConfiguredLocalDirs(conf).flatMap { rootDir =>
      try {
        val localDir = Utils.createDirectory(rootDir, "blockmgr")
        logInfo(s"Created local directory at $localDir")
        Some(localDir)
      } catch {
        case e: IOException =>
          logError(s"Failed to create local dir in $rootDir. Ignoring this directory.", e)
          None
      }
    }
  }

(2) SparkConf.scala 类中的方法

这个方法告诉我们在spark-defaults.conf 中配置spark.local.dir参数在spark1.0 版本后已经过时。
/** Checks for illegal or deprecated config settings. Throws an exception for the former. Not
    * idempotent - may mutate this conf object to convert deprecated settings to supported ones. */
  private[spark] def validateSettings() {
    if (contains("spark.local.dir")) {
      val msg = "In Spark 1.0 and later spark.local.dir will be overridden by the value set by " +
        "the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN)."
      logWarning(msg)
    }

    val executorOptsKey = "spark.executor.extraJavaOptions"
    val executorClasspathKey = "spark.executor.extr

    。。。。
}

(3)Utils.scala 类中的方法
通过分析下面的代码,我们发现不在spark-env.sh 下配置SPARK_LOCAL_DIRS的情况下,
通过该conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")设置spark.local.dir,然后或根据路径创建,导致上述错误。
故我们直接在spark-env.sh 中设置SPARK_LOCAL_DIRS 即可解决。
然后我们直接在spark-env.sh 中配置:
export SPARK_LOCAL_DIRS=/home/hadoop/data/sparktmp
/**
   * Return the configured local directories where Spark can write files. This
   * method does not create any directories on its own, it only encapsulates the
   * logic of locating the local directories according to deployment mode.
   */
  def getConfiguredLocalDirs(conf: SparkConf): Array[String] = {
    val shuffleServiceEnabled = conf.getBoolean("spark.shuffle.service.enabled", false)
    if (isRunningInYarnContainer(conf)) {
      // If we are in yarn mode, systems can have different disk layouts so we must set it
      // to what Yarn on this system said was available. Note this assumes that Yarn has
      // created the directories already, and that they are secured so that only the
      // user has access to them.
      getYarnLocalDirs(conf).split(",")
    } else if (conf.getenv("SPARK_EXECUTOR_DIRS") != null) {
      conf.getenv("SPARK_EXECUTOR_DIRS").split(File.pathSeparator)
    } else if (conf.getenv("SPARK_LOCAL_DIRS") != null) {
      conf.getenv("SPARK_LOCAL_DIRS").split(",")
    } else if (conf.getenv("MESOS_DIRECTORY") != null && !shuffleServiceEnabled) {
      // Mesos already creates a directory per Mesos task. Spark should use that directory
      // instead so all temporary files are automatically cleaned up when the Mesos task ends.
      // Note that we don't want this if the shuffle service is enabled because we want to
      // continue to serve shuffle files after the executors that wrote them have already exited.
      Array(conf.getenv("MESOS_DIRECTORY"))
    } else {
      if (conf.getenv("MESOS_DIRECTORY") != null && shuffleServiceEnabled) {
        logInfo("MESOS_DIRECTORY available but not using provided Mesos sandbox because " +
          "spark.shuffle.service.enabled is enabled.")
      }
      // In non-Yarn mode (or for the driver in yarn-client mode), we cannot trust the user
      // configuration to point to a secure directory. So create a subdirectory with restricted
      // permissions under each listed directory.
      conf.get("spark.local.dir", System.getProperty("java.io.tmpdir")).split(",")
    }
  }

通过命令行窗口观察日志的生成情况,观察Deleting directory行,发现确实改变了,终于成功了
16/09/08 14:56:19 INFO ui.SparkUI: Stopped Spark web UI at http://10.4.1.1:4040
16/09/08 14:56:19 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors
16/09/08 14:56:19 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down
16/09/08 14:56:19 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/09/08 14:56:19 INFO storage.MemoryStore: MemoryStore cleared
16/09/08 14:56:19 INFO storage.BlockManager: BlockManager stopped
16/09/08 14:56:19 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
16/09/08 14:56:19 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/09/08 14:56:19 INFO spark.SparkContext: Successfully stopped SparkContext
16/09/08 14:56:19 INFO util.ShutdownHookManager: Shutdown hook called
16/09/08 14:56:19 INFO util.ShutdownHookManager: Deleting directory /home/hadoop/data/sparktmp/spark-a72435b2-71e7-4c07-9d60-b0dd41b71ecc
16/09/08 14:56:19 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
16/09/08 14:56:19 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
16/09/08 14:56:19 INFO util.ShutdownHookManager: Deleting directory /home/hadoop/data/sparktmp/spark-a72435b2-71e7-4c07-9d60-b0dd41b71ecc/httpd-7cd8762c-85b6-4e62-8e91-be668830b0a7


0 0