HDFS小结

来源：互联网发布：淄博用友软件编辑：程序博客网时间：2024/06/01 08:34

1、HDFS: Motivation：

（1）Based on Google’s GFS

（2）Redundant storage of massive amounts of data on cheap and unreliable computers

（3）Why not use an existing file system?

– Different workload and design priorities；

– Handles much bigger dataset sizes than other filesystems

2、HDFS Design Decisions

（1）Files stored as blocks-Much larger size than most filesystems (default is 64MB)

（2）Reliability through replication

– Each block replicated across 3+ DataNodes

（3）Single master (NameNode) coordinates access, metadata

– Simple centralized management

（4）No data caching-– Little benefit due to large data sets, streaming reads

（5）Familiar interface, but customize the API

– Simplify the problem; focus on distributed apps

3、HDFS Client Block Diagram

4、Based on GFS Architecture

5、Metadata

（1）Single NameNode stores all metadata

– Filenames, locations on DataNodes of each file

（2）Maintained entirely in RAM for fast lookup

（3）DataNodes store opaque file contents in “block” objects on underlying local filesystem

6、HDFS Conclusions

（1）HDFS supports large-scale processing workloads on commodity hardware

–designed to tolerate frequent component failures；

–optimized for huge files that are mostly appended and read

– filesystem interface is customized for the job, but still retains familiarity for developers

– simple solutions can work (e.g., single master)

（2）Reliably stores several TB in individual clusters