Hadoop Distributed FileSystem (HDFS) Architectural Documentation - Overview
来源:互联网 发布:人工智能进化史 编辑:程序博客网 时间:2024/06/03 15:14
Hadoop Distributed FileSystem (HDFS) Architectural Documentation - Overview
全文地址:http://kazman.shidler.hawaii.edu/ArchDoc.html
3 Overview of the HDFS Architecture
Thissection provides a quick overview of the architecture of HDFS. Thematerial in here is elaborated in other sections. The figure belowgives a run-time view of the architecture showing three types ofaddress spaces: the application, the NameNode and the DataNode. Anessential portion of HDFS is that there are multiple instances ofDataNode.
The application incorporates the HDFSclient library into its addressspace. Theclient library manages all communication from the application to theNameNodeand the DataNode. An HDFS cluster consists of a single NameNode—amaster serverthat manages the file system namespace and regulates access to files byclients. In addition, there are a number of DataNodes, usually one percomputernode in the cluster, which manage storage attached to the nodes thatthey runon.
The NameNode and DataNode are pieces of softwaredesigned torun on commodity machines. These machines typically run a GNU/Linuxoperatingsystem (OS). HDFS is built using the Java language; any machine thatsupportsJava can run the NameNode or the DataNode software. Usage of the Javalanguagemeans that HDFS can be deployed on a wide range of machines. A typicaldeployment has a dedicated machine that runs only the NameNodesoftware. Eachof the other machines in the cluster runs one instance of the DataNodesoftware. The architecture does not preclude running multiple DataNodeson thesame machine but in a real deployment that is rarely the case.
3.1 HDFSFiles
There is a distinction between an HDFS file and anative(Linux) file on the host computer. A computer in an HDFS installationis(typically) allocated to one NameNode or one DataNode. Each computerhas itsown file system and informationaboutan HDFS file—the metadata—is managed by the NameNode and persistentinformationis stored in the NameNode’s host filesystem. Theinformationcontained in an HDFS fileis managed by a DataNode and stored on the DataNode’shost computer file system.
HDFS exposes a file system namespace and allowsuser data tobe stored in HDFS files. An HDFS file consists of a number of blocks.Eachblock is typically 64MByes. Each block is replicated some specifiednumber oftimes. The replicas of the blocks are stored on different DataNodeschosen to reflect loading on a DataNode aswell as toprovide both speed in transfer and resiliency in case of failure of arack. SeeBlockAllocationfor a description of the allocation algorithm.
A standard directory structure is used in HDFS.That is,HDFS files exist in directories that may in turn be sub-directories ofotherdirectories, and so on. There is no concept of a current directorywithin HDFS.HDFS files are referred to by their fully qualified name which is aparameterof many of the elements of the interaction between the Client and theotherelements of the HDFS architecture.
The NameNodeexecutes HDFS file system namespace operationslike opening, closing, and renaming files and directories. It alsodeterminesthe mapping of blocks to DataNodes. The list of HDFS files belonging toeachblock, the current location of the block replicas on the DataNodes,the state of the file, and the access control information is themetadata forthe cluster and is managed by the NameNode.
The DataNodes are responsible for serving read andwriterequests from the HDFS file system’s clients. The DataNodes alsoperform block replicacreation, deletion, and replication upon instruction from the NameNode.TheDataNodes are the arbiter of the state of the replicates and theyreport thisto the NameNode.
The existence of a single NameNode in a clustergreatlysimplifies the architecture of the system. The NameNode is thearbitrator andrepository for all HDFS metadata. The client sends data directly to andreadsdirectly from DataNodes so that client data never flows through theNameNode.
3.2 Block Allocation
Each block is replicated some number of times—thedefaultreplication factor for HDFS is three. When addBlock() isinvoked, spaceis allocated for each replica. Each replica is allocated on a differentDataNode. The algorithm for performing thisallocationattempts to balance performance and reliability. This is done byconsideringthe following factors:
- The dynamic loadon the set of DataNodes. Preference is given to more lightly loaded DataNodes.
· The location of the DataNodes.Communication between two nodes in different racks has to go throughswitches.In most cases, network bandwidth between machines in the same rack isgreaterthan network bandwidth between machines in different racks.
· For the common case, when thereplication factoris three, HDFS’s placement policy is to put one replica on one node inthelocal rack, another on a node in a different (remote) rack, and thelast on adifferent node in the same remote rack. This policy cuts the inter-rackwritetraffic which generally improves write performance. The chance of rackfailureis far less than that of a node failure; therefore this co-locationpolicy doesnot adversely impact data reliability and availability guarantees.However, itdoes reduce the aggregate network bandwidth used when reading datasince ablock is placed in only two unique racks rather than three. With thispolicy,the replicas of a file do not evenly distribute across the racks. Onethird ofreplicas are on one node on some rack; the other two thirds of replicasare on distinctnodes one a different rack. This policy improves write performancewithoutcompromising data reliability or read performance.
Thefigure below shows how blocks are replicated on different DataNodes.
Blocks are linked to the file through INode.Each block is given a timestamp that is used to determine whether areplica iscurrent. We discuss this further in theBlockandReplicaManagement section.
- Hadoop Distributed FileSystem (HDFS) Architectural Documentation - Overview
- HDFS(Hadoop distributed filesystem)和KFS (Kosmos distributed filesystem)比较
- The Hadoop Distributed Filesystem
- Hadoop HDFS FileSystem详解
- Hadoop Distributed File System (HDFS)
- Hadoop Distributed File System( HDFS)
- About hadoop hdfs filesystem rename
- 02Architectural Overview 结构
- 《Hadoop The Definitive Guide》ch03 The Hadoop Distributed Filesystem
- Hadoop Definitive Guide --- Chapter 3. The Hadoop Distributed Filesystem
- HDFS(Hadoop Distributed File System )常用命令示例:
- HDFS(Hadoop Distributed File System)简介
- [openstack swift] Swift Architectural Overview
- High Availability for the Hadoop Distributed File System (HDFS)
- 《Hadoop: The Definitive Guide》读书笔记 -- Chapter 3 The Hadoop distributed Filesystem
- hadoop学习记录—2.7.4documentation—hdfs
- Hadoop HDFS文件系统通过java FileSystem 实现上传下载等
- 利用 Hadoop FileSystem moveToLocalFile 方法下载文件 实现HDFS操作
- Tomcat 7.0.6发布
- Web - Sr. IT specialist - Andriod – Nanjing
- ubuntu11.10 64位下载android4.1
- access查询到的结果输出到excel
- WCF基本资料
- Hadoop Distributed FileSystem (HDFS) Architectural Documentation - Overview
- 程序的机器级表示(一)
- Application Developer - Power Builder
- HTML中设置输入框为只读状态的方法
- 短信接收--短信的接收流程应用层
- mybatis 分页实现
- 通过金矿模型介绍动态规划
- 有关于数据库设计的几点建议
- 在框架内添加背景图片