Replica Placement in DFS

来源：互联网发布：东南亚旅游知乎编辑：程序博客网时间：2024/05/20 16:13

Hadoop:
How does the namenode choose which datanodes to store replicas on? There’s a tradeoff between reliability and write bandwidth and read bandwidth here.

For example, placing all replicas on a single node incurs the lowest write bandwidth penalty since the replication pipeline runs on a single node, but this offers no real redundancy (if the node fails, the data for that block is lost). Also, the read bandwidth is high for off-rack reads. At the other extreme, placing replicas in different data centers may maximize redundancy, but at the cost of bandwidth. Even in the same data center (which is what all Hadoop clusters to date have run in), there are a variety of placement strategies.
Indeed, Hadoop changed its placement strategy in release 0.17.0 to one that helps keep a fairly even distribution of blocks across the cluster. (See “balancer” on page 284 for details on keeping a cluster balanced.)
Hadoop’s strategy is to place the first replica on the same node as the client (for clients running outside the cluster, a node is chosen at random, although the system tries not to pick nodes that are too full or too busy). The second replica is placed on a different rack from the first (off-rack), chosen at random.The third replica is placed on the same rack as the second, but on a different node chosen at random. Further replicas are placed on random nodes on the cluster, although the system tries to avoid placing too many replicas on the same rack.
Once the replica locations have been chosen, a pipeline is built, taking network topology into account. For a replication factor of 3, the pipeline might look like Figure 3-4.
Overall, this strategy gives a good balance between reliability (blocks are stored on two racks), write bandwidth (writes only have to traverse a single network switch), read
performance (there’s a choice of two racks to read from), and block distribution across the cluster (clients only write a single block on the local rack).

// some other issues about placement algorithm

1, 应该记录/预测前一段时间内向某机器分配的数据量, 到达一定值后降低分配到该机器的概率. 因为该机器网络会成为瓶劲.

2, 在考虑机器磁盘free disk size 时, 可能会好的办法是考虑磁盘的空间使用率, 使用率高的被选中的概率高. 因为集内可能磁盘异构.