Nutch搜索引擎之分布式文件系统

来源:互联网 发布:mac上的wifi共享软件 编辑:程序博客网 时间:2024/04/23 18:40
 
1.介绍
  NDFS:在一系列机器上存储庞大的面向流的文件,包含多机的存储冗余和负载均衡。  文件以块为单位存储在NDFS的离散机器上,提供一个传统的input/output流接口用于文件读写。块的查找以及数据在网络上传输等细节由NDFS自动完成,对用户是透明的。而且NDFS能很好地处理用于存储的机器序列,能方便地添加和删除一台机器。当某台机器不可用时,NDFS自动的保证文件的可用性。只要网上的机器序列能提供足够的存储空间,就要保证NDFS文件系统的正常运作。  NDFS是建立在普通磁盘上的,不需要RAID控制器或者其它的磁盘阵列解决方案。
2.语法
1). 文件只能写一次,写完之后,就变成只读了(但是可以被删除)2). 文件是面向流的,只能在文件末尾加字节流,而且只能读写指针只能递增。3). 文件没有存储访问的控制
所以,所有对NDFS的访问都是通过验证的客户代码。没有提供API供其它程序访问。因此Nutch就是NDFS的模拟用户。
3.系统设计     NDFS包含两种类型的机器:NameNodes和DataNodes: NameNodes维护名字空间;而DataNodes存储数据块。NDFS中包含一个NamdNode,而包含任意多的DataNodes,每个DataNodes都配置与唯一的NameNode通信。  1)NameNode: 负责存储整个名字空间和文件系统的布局。是一个关键点,不能down掉。但是做的工作不多,因此不是一个负载的瓶颈。    维护一张保存在磁盘上的表: filename-0->BlockID_A,BlockID_B...BlockID_X,etc.;    filename就是一字符串,BolockID是唯一的标识符。每个filename有任意个blocks。  2)DataNode:负责存储数据。一个块应该在多个DataNode中有备份;而一个DataNode对于一个块最多只包含一个备份。    维护一张表:BlockID_X->array of bytes..
  3)合作:DataNode在启动后,就主动与NameNode通信,将本地的Block信息告知NameNode。NameNode据此可以构造一颗树,描述如何找到NDFS中的Blocks。这颗树是实时更新的。DataNode会定期发送信息给NameNode,以证明自己的存在,当NameNode收不到该信息时,就会认为DataNode已经down了。
  4)文件的读写过程:例如Client要读取foo.txt,则有以下过程。    a.Client通过网络联系NameNode,提交filename:"foo.txt"    b.Client收到从NameNode来的回复,包含:组成"foo.txt"的文件块和每个块存在的DataNode序列。    c.Client依次读取每个文件块。对于一个文件块,Client从它的DataNode序列中得到合适的DataNode,      然后发送请求给DataNode,由DataNode将数据传输给Client
4.系统的可用性
  NDFS的可用性取决于Blocks的冗余度,即应该在多少个DataNode保持同一Block的备份。对于有条件的话可以设置3个备份和2个最低备份(DESIRED_REPLICATION and MIN_REPLICATION constants in fs.FSNamesystem)。当一个块的低于MIN_REPLICATION,NameNode就会指导DataNode做新的备份。
5.net.nutch.fs包的一些文件介绍  1)NDFS.java:包含两个main函数,一个是关于NameNode的,一个是关于DataNode的  2)FSNamesystem.java:维护名字空间,包含了NameNode的功能,比如如何寻找Blocks,可用的DataNode序列  3)FSDirectory.java:被FSNamesystem调用,用于维护名字空间的状态。记录NameNode的所有状态和变化,当NameNode崩溃时,可以根据这个日志来恢复。  4)FSDataset.java: 用于DataNode,维护Block序列等  5)Block.java and DatanodeInfo: 用于维护Block信息  6)FSResults.java and FSParam.java: 用于在网络上传送参数等  7)FSConstants.java:包含一些常数,用于参数调整等。  8)NDFSClient.java:用于读写数据  9)TestClient.java:包含一个main函数,提供一些命令用于对NDFS的存取访问  6.简单的例子  1)创建NameNode:    Machine A:java net.nutch.fs.NDFS$NameNode 9000 namedir    2)创建DataNode:    Machine B:java net.nutch.fs.NDFS$DataNode datadir1 machineB 8000 machineA:9000     Machine C:java net.nutch.fs.NDFS$DataNode datadir2 machineC 8000 machineA:9000
   运行1,2步后,则得到了一个NDFS,包含一个NameNode和两个DataNode。(可以在同一台机 的不同目录下安装NDFS)    3)client端的文件访问:    创建文件:java net.nutch.fs.TestClient machineA:9000 CREATE foo.txt     读取文件:java net.nutch.fs.TestClient machineA:9000 GET foo.txt     重命名文件:java net.nutch.fs.TestClient machineA:9000 RENAME foo.txt bar.txt     再读取文件:java net.nutch.fs.TestClient machineA:9000 GET bar.txt     删除文件:java net.nutch.fs.TestClient machineA:9000 DELETE bar.txt 
ipc
= IPC =
IPC, for InterProcess Communication, is a fast and easy RPC mechanism. Unlike Sun's standard RPC package, it does not use standard Java serialization and instead requires the author of every class to write the relevant serialization method. That extra work might seem like a drawback, but if you've ever tried to debug Sun's class versioning system, you realize that the extra work is in fact welcome.
IPC does not require a special compiler of any kind to create network stubs and skeletons. Rather, it uses introspection to examine a declared "publicly-available interface" and determine how to marshall/unmarshall arguments.
IPC is used as the internal procedure call mechanism for all of Hadoop and Nutch.
Use Model
IPC is a client/server system. A server process offers service to others by opening a socket and exposing one or more Java interfaces that remote callers can invoke. User server code must indicate the port number and an instance of an object that will receive remote calls. (see RPC.getServer())
A client contacts a server at a specified host and port, and invokes methods exposed by the server. User client code must indicate the target hostname and port, and also the name of the Java interface that the client would like to invoke. While a single IPC server object can expose several interfaces simultaneously, a client can invoke only one of them at a time. (see RPC.getClient())
There is no way for an IPC server to invoke methods of the client. There are places in Hadoop where bidirectional communication is helpful (e.g., in DFS, where the Name and Data nodes must report status to each other). In these cases, one side acts as a client, making the same call over and over again. The server always returns a special "status" object, which the client may then interpret as a request to perform work.
Under the covers
The IPC mechanism automatically inspects the client's requested interface, plus the server's exposed interfaces, and figures out how to marshall/unmarshall arguments for the remote call. This system works fine as long as all arguments in methods consist of either Java's builtin types, or String, or an implementation of the Writable interface. (Or an array of one of those types)
原创粉丝点击