"Spark 1.6 + Alluxio 1.2 + OFF_HEAP" 的配置

来源：互联网发布：ubuntu 16.04 优麒麟编辑：程序博客网时间：2024/06/06 17:18

我们知道，Spark + Tachyon 可以解决 Spark 在使用过程中的一些问题，可以总结为 数据共享 和 GC 等问题。
但是呢，Tachyon 在今年已经更名为 Alluxio，并且其访问schema也从tachyon修改为alluxio；Spark 1.6 的版本也在维护中；可是，到目前为止，Spark还没有将之前的 Tachyon 的协议修改为 Alluxio 的协议，因此 Alluxio 就没有办法很顺利的与 Spark 进行结合使用。此处所说的结合是针对RDD在persist时使用 OFF_HEAP 的方式。

一、如何解决上面的问题呢？

方案是，针对alluxio自定义开发一个了继承自ExternalBlockManager的 AlluxioBlockManager，即可。但是在自定义之前，还是先到网上找找看有没有已经开源的实现方案呢，结果还是被我找到了两个：
1、https://github.com/winse/spark-alluxio-blockstorage
2、https://github.com/chengqiangboy/spark-alluxio-blockstore

经过测试，我发现第一个在使用中没有问题，但是第二个总是会报错，因此下面使用第一个的开源实现来进行配置。
需要说明的是，在Alluxio中 ALLUXIO_UNDERFS_ADDRESS 千万不要配置为本地目录，不然，你的测试会总是报错，一定要配置为HDFS集群，至于为什么，没有去深究。

二、Spark1.6 + Alluxio 1.2 配置

1、到
https://github.com/winse/spark-alluxio-blockstorage
下载 AlluxioBlockManager.scala 文件，并进行编译打包为 spark-alluxio-blockstore.jar

2、将alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar、 spark-alluxio-blockstore.jar 放到所有Spark节点的lib目录下。并在 conf/spark-env.sh 的 SPARK_CLASSPATH 环境变量中加上这两个jar的路径，Spark 集群中所有的节点都需要这样配置。

echo 'export SPARK_CLASSPATH=/usr/spark-1.6.0/lib/alluxio-core-client-spark-1.2.0-jar-with-dependencies.jar:$SPARK_CLASSPATH' >> conf/spark-env.shecho 'export SPARK_CLASSPATH=/usr/spark-1.6.0/lib/ spark-alluxio-blockstore.jar:$SPARK_CLASSPATH' >> conf/spark-env.sh

3、在 conf/spark-defaults.conf 修改 spark.externalBlockStore 的相关配置，如下：

spark.externalBlockStore.blockManager org.apache.spark.storage.AlluxioBlockManagerspark.externalBlockStore.subDirectories 8spark.externalBlockStore.url alluxio://hdfs-yarn-1:19998spark.externalBlockStore.baseDir /tmp_spark_alluxio

三、测试

1、以文件方式访问

val file = sc.textFile("/home/hadoop/sample-1g")file.saveAsTextFile("alluxio://hdfs-yarn-1:19998/sample-1g")val alluxioFile = sc.textFile("alluxio://hdfs-yarn-1:19998/sample-1g")alluxioFile.count()

2、调用RDD的persist，并使用OFF_HEAP进行数据缓存
在 spark-shell 中运行下面的测试代码：

val file = sc.textFile("/home/hadoop/sample-1g",4)file.persist(org.apache.spark.storage.StorageLevel.OFF_HEAP)file.count()

6、查看 Alluxio 中的缓存情况
这里写图片描述
从上面的图片中可以看到，已经缓存到Alluxio中了。

由于这个是开源的方案，所以在应用到真实环境之前，大家还是要多测试测试。

0 0