hdfs纠删码

来源：互联网发布：调查问卷数据分析软件编辑：程序博客网时间：2024/06/03 19:08

分布式文件系统的常用存储策略比较

为了处理大型文件，分布式文件系统通常将文件划分为固定大小的逻辑块，再将逻辑块映射到集群中的物理块。这里的映射关系有两种，即连续映射与条带化映射。连续映射实现简单，将数据按照顺序映射到物理块即可。相反的条带化映射将逻辑块划分成更小的物理单元，并将数据轮询的写入不同的物理块中。

按照条带化与连续存储和副本与纠删码，可以将不同的数据存储实现划分为以下四个象限。ceph处于1,4象限，即实现了条带化的副本及纠删码，而HDFS则使用连续化副本策略。

hdfs纠删码的设计与实现

在HDFS的纠删码实现过程中，需要对以下几种方案进行选型。连续存储实现简单，但是它只适合于大文件存储。以RS（10,4）为例，假设单块数据块为128mb，则如果文件小于128m的情况下，对于存储空间的占用将是400%，存储效率甚至比三副本更低。

相反的使用条带化的存储方案，无论对于小文件或者大文件都适用。同时小块的条带化可以保证数据节点直接生成纠删码。缺点是对于本地化计算任务有比较大的性能影响。为了解决这个问题，最好是可以将条带化文件映射为连续型的分布。

基于以上分析，文件大小是决定性因素，如果文件系统中多数为大文件，则使用连续的纠删存储策略更为适合。同时，我们需要关注文件系统中一般文件大小，其对R（X,Y）的设定有较大影响。

hdfs纠删码的设计和实现的关键是设计一种块结构，使其可以支持数据条带化。但是在现有的hdfs中，连续块的概念深入到设计的方方面面，所以我们又必须设计一种连续块与条带化数据块的映射关系。为此提出那个了存储块与逻辑块的概念。逻辑块代表文件中一段连续的数据空间。而存储块代表在数据节点中实际存储的一段数据。下图说明了逻辑块与虚拟块的区别与联系。

NameNode Extensions

hdfs namenode通过调用本地方法计算每个存储块与逻辑块的对应关系。在映射过程中，通过逻辑块id找到对应的逻辑块。在利用映射关系找到其存储块。这样的问题是导致namenode的内存使用大大提高。

为解决负载问题，我们引入了一种新的块命名协议。目前hdfs在指定块ID时是基于块的创建时间。新的协议将每个块的ID分成2-3个段，块ID的第一位作为标志位，0代表连续块，1代表条带化块。对于条带化块，块ID的余下部分被划分为两部分。中间部分代表逻辑块的ID，结尾部分代表当前存储块在逻辑块中的序号。这种设计使得NameNode可以通过处理datanode的块报告将逻辑块映射到一组存储块之上。通过这种方式可以有效的将namenode内存使用增量控制在70%以内。下图说明块的具体映射关系。

为支持逻辑块的抽象，对NameNode的修改比较多，比如数据恢复算法，文件扫描，平衡算法等。

Client Extensions

HDFS客户端的数据流主要实现了DFSInputStream 和 DFSOutputStream两个接口。为此我们扩展了他们并实现了 DFSStripedInputStream 和 DFSStripedOutputStream以提供对条带化与纠删码的支持。这两个扩展最主要的功能是让客户端可以并行的操作一个逻辑块中的各个存储块。

在进行输出时， DFSStripedOutputStream管理同一逻辑块分布在不同DataNode上的一组存储块。流操作通常采用异步的模式，通过协调器来决定整个逻辑块的状态及分配新块等功能。

在读取数据时， DFSStripedInputStream将对于逻辑块的读取操作转化为针对DataNode上逻辑块的读取操作，并并行的读取相关数据。

DataNode Extensions

为了避免在客户端重建数据造成的巨大开销，如何在后台确认并修复数据节点错误是关键。与基于副本的存储形式一样，NameNode负责跟踪数据块的丢失，并将修复工作分配给DataNode。DataNode上数据恢复工作被分配给一个新的组件ErasureCodingWorker。具体恢复方式：创建一个线程池，从各个数据源读取数据，并重新计算EC。

解码数据与输出数据：与客户端一致，ECWorker利用编码框架完成编码与解码工作并将产生的数据块发送到目标节点上。

Codec CalculationFramework

Data encoding/decoding is very CPU intensive and can be a major overheadwhen using erasure coding. To mitigate this in HDFS-EC, we leverage Intel’sopen-source Intelligent Storage Acceleration Library (IntelISA-L), which accelerates EC-related linear algebra calculations by exploitingadvanced hardware instruction sets like SSE, AVX, and AVX2. ISA-L supports allmajor operating systems, including Linux and Windows.

In HDFS-EC we implemented the Reed-Solomon algorithm in two forms: onebased on ISA-L and another in pure Java (suitable for systems without therequired CPU models). We have compared the performance of these twoimplementations, as well as the coder from Facebook’s HDFS-RAID implementation.A (6,3) schema is used in all tests in this section.

Figure 8 first shows results from an in-memory encoding/decoding microbenchmark. The ISA-L implementation outperforms the HDFS-EC Java implementationby more than 4x, and the Facebook HDFS-RAID coder by ~20x. Based on theresults, we strongly recommend the ISA-L accelerated implementation for allproduction deployments.

Figure 8. Encoding and decoding performance comparison

We also compared end-to-end HDFS I/O performance with these differentcoders against HDFS’s default scheme of three-way replication. The tests wereperformed on a cluster with 11 nodes (1 NameNode, 9 DataNodes, 1 client node)interconnected with 10 GigE network. Figure 9 shows the throughput results of1) client writing a 12GB file to HDFS; and 2) client reading a 12GB file fromHDFS. In the reading tests we manually killed two DataNodes so the resultsinclude decoding overhead.

As shown in Figure 9, in both sequential write and read and readbenchmarks, throughput is greatly constrained by the pure Java coders(HDFS-RAID and our own implementation). The ISA-L implementation is much fasterthan the pure Java coders because of its excellent CPU efficiency. It alsooutperforms replication by 2-3x because the striped layout allows the client toperform I/O with multiple DataNodes in parallel, leveraging the aggregatebandwidth of their disk drives. We have also tested read performance withoutany DataNode failure: HDFS-EC is roughly 5x faster than three-way replication.

Note that further performance gains should be possible. With an RS (6,3)layout, a striped layout should be able to achieve approximately a 6ximprovement in I/O throughput, or approximately 1GB/s of throughput. Thecurrent performance does not meet the theoretical optimum partially because thestriped layout spreads logically sequential I/O requests to multiple DataNodes,potentially degrading sequential I/O pattern on local disk drives. We plan toadd more advanced prefetching and write buffering to the client as a future optimization.

Figure 9. HDFS I/O performance comparison

Another important optimization in ISA-L is support for incremental coding.This means applications do not have to wait for all source data before startingthe coding process. This will potentially enable HDFS-EC to efficiently handleslow writing applications, as well as append operations.

FutureWork

This article summarizes the first development phase for HDFS-EC. Manyexciting extensions and optimizations have been identified and documentedunder HDFS-8031. A majorfollow-on task is to build a generic EC policy framework which allows systemusers to deploy and configure multiple coding schemas such as conventionalReed-Solomon, HitchHiker, LRC, and soforth. By abstracting and modularizing common codec logics, the framework willalso enable users to easily develop new EC algorithms. We also plan to furtheroptimize NameNode memory consumption and reduce data reconstruction latency.

To save storage space on files belonging to locality-sensitive workloads,we have established HDFS-EC phase II (HDFS-8030) to support ECwith contiguous block layout.

0 0