Alluxio Paper

来源：互联网发布：nginx 网络模型编辑：程序博客网时间：2024/06/01 13:01

原文 : http://people.eecs.berkeley.edu/~haoyuan/papers/2014_EECS_tachyon.pdf

Reliable, Memory Speed Storage for Cluster Computing Frameworks

Abstract

Tachyon is a distributed file system enabling reliable data sharing at memory speed across cluster computing frameworks. While caching today improves read workloads,writes are either network or disk bound, as replication is used for fault-tolerance. Tachyon eliminates this bottleneck by pushing lineage, a well-known technique borrowed from application frameworks, into the storage layer. The key challenge in making a long-lived lineagebased storage system is timely data recovery in case of failures. Tachyon addresses this issue by introducing a checkpointing algorithm that guarantees bounded recovery cost and resource allocation strategies for recomputation under common resource schedulers. Our evaluation shows that Tachyon outperforms in-memory HDFS by 110x for writes. It also improves the end-to-end latency of a realistic workflow by 4x. Tachyon is open source and is deployed at multiple companies.

摘要

Tachyon是一种分布式文件系统，它能够以内存级别的速度在集群计算框架中提供可靠的数据共享。尽管当今缓存可以有效提高读性能，但是写速度仍然较慢，因为目前的分布式系统大多采用复制方法来保证失败恢复，而网络或者磁盘则会成为瓶颈。Tachyon将应用框架（比如Spark）中的lineage技术引入到了存储层，避免了大量复制，从而消除了网络以及磁盘瓶颈。构建一个长生命周期的基于lineage技术的存储系统时，一个主要的挑战就是如何在失败的情况下能够做到及时的数据恢复。为了解决这个问题，Tachyon采用了一种Checkpoint的算法能过确保数据恢复的时间上限并且合理地为“重计算“分配适当地资源。我们做过测试，Tachyon在写效率上是内存式HDFS的110倍，同时对于端到端的workflow的运行速度也提高了4倍。Tachyon是开源的，目前已经在多家公司部署使用。

1. Introduction

Over the past few years, there have been tremendous efforts to improve the speed and sophistication of largescale data-parallel processing systems. Practitioners and researchers have built a wide array of programming frameworks [29, 30, 31, 37, 46, 47] and storage systems [13, 14, 22, 23, 34] tailored to a variety of workloads.
过去的几年中，业界已经花费了大量的精力去提高大规模数据并行处理系统的速度和适配度。
各种专家和研究人员针对各种不同的场景创建出各种各样的编程框架以及存储系统。

As the performance of many of these systems is I/O bound, traditional means of improving their speed is to cache data into memory [8, 11]. While caching can dramatically improve read performance, unfortunately, it does not help much with write performance. This is because these highly parallel systems need to provide faulttolerance, and the way they achieve it is by replicating the data written across nodes. Even replicating the data in memory can lead to a significant drop in the write performance, as both the latency and throughput of the network are typically much worse than that of local memory.
Slow writes can significantly hurt the performance of job pipelines, where one job consumes the output of another. These pipelines are regularly produced by workflow managers such as Oozie [6] and Luigi [9], e.g., to perform data extraction with MapReduce, then execute a SQL query, then run a machine learning algorithm on the query’s result. Furthermore, many high-level programming interfaces [2, 5, 40], such as Pig [33] and FlumeJava[16], compile programs into multiple MapReduce jobs that run sequentially. In all these cases, data is replicated across the network in-between each of the steps.
这些系统的性能大部分都是I/O受限的，因此为了提升速度比较传统的做法是将数据缓存起来。不幸的是，虽然缓存可以非常有效地提高读性能，但是对于写性能却无能为力。这是因为这些高并行系统需要提供“失败恢复”机制，而其采用的方法是通过复制将数据分发到各个节点上。
即使仅在内存中复制数据也会明显地降低写性能，因为网络延迟和吞吐量受限。过慢的写数据严重影响整个job pipeline的性能，因为在一个job pipeline中一个job依赖于上一个job的输出结果。工作流程序比如oozie, luigi等会生成这些pipeline，比如执行一个mapreduce的数据抽取工作，然后执行一次SQL查询，在查询的结果集上去跑一个机器学习算法等。而且许多像pig，FlumeJava这些高级的编程接口可以把程序编译成多个Mapreduce job，然后顺序执行。在以上这些场景中，都是通过复制将数据在不同的操作中进行传递的。

0 0