集群文件系统 moosefs 安装配置 容灾恢复

来源:互联网 发布:淘宝鹊桥佣金 编辑:程序博客网 时间:2024/06/05 17:02

http://contrib.meharwal.com/home/moosefs


Large Scale Data Storage with MooseFS (MFS)

Last Updated: June/21/2011  (Comments to : contrib+moosefs@meharwal.com)
MooseFS is a fault tolerant, large scale network distributed file system available for Unix/Linux compatible environment. It is horizontally scalable file system, that is easy to setup by using commodity hardware. It supports POSIX compliance and mounts file system using FUSE driver, commonly supported by various Unix/Linux distributions. It uses, any of your favorite native file system (ext3, ext4, xfs, zfs, ... ) underneath and builds on top of it. It fits right in your UNIX/Linux networked environment by mounting file system using mount, system fstab file or ubiquitous automounter. Some of the exciting fault tolerant features of MooseFS include, N copies of data replication, trash bin (recovery of deleted files) and snapshots, configurable at a very granular level of a file system.  Visit MooseFS web site http://www.moosefs.org, for detailed features set and architecture. 

  Large scale network based distributed file system cluster is an evolving field. These file systems have their own strengths and weaknesses. Some of the well known distributed file systems are listed here.  Network file system is a core component of large scale application frameworks and setup simplicity is one of the important factor for any storage administrator, in order to choose a system. In our experience, MooseFS shines in terms of performance, massive scalability using commodity hardware, POSIX compliance, useful feature lists, quick recovery and simplicity of setup. 

Contents

  1. MooseFS Architecture
  2. MFS Cluster Setup
    1. 2.1 CGI status viewer Setup
    2. 2.2 Master Server Setup
    3. 2.3 Metalogger Server Setup
    4. 2.4 Chunk Server Setup
    5. 2.5 Mounting Filesystem on Client Hosts:
  3. MFS Cluster Maintenance and Operation
    1. 3.1 Goals (Replication setup)
    2. 3.2 Trash bin and data quarantine time setup
    3. 3.3 Metadata backup, recovery and redundancy
    4. 3.4 Few things to know
    5. 3.5 Other Goodies
  4. MFS Metadata Maintenance (Disaster Recovery)
    1. 4.1 Few Basics about metadata
    2. 4.2 Recovery from crashed master server
    3. 4.3 Moving master metadata server in a planned manner.
    4. 4.4 Recovery from Metalogger server data.
    5. 4.5 Additional copies of metadata.
    6. 4.6 Speeding up chunk replication and re-balancing.
  5. Wrapping up
  6. Conclusion.


MooseFS Architecture

MooseFS conists of  Master, Metalogger, Chunk and Client hosts.  If desired, these components can co-exists on a single server

Master Server is responsible for a file system meta-data. All other MFS components communicate with master server at specified network ports. Current version (1.6.20) of  MooseFS does not provide active/active redundancy for master server, so it is a single point of failure. However moving master with a backup metalogger server (manually) is a quick and easy process. Master server keeps a copy of all current meta data in memory(RAM) , this makes meta data access very fast. It is important to setup master server on a computer with disks redundancy (such as RAID-1 or higher) and ECC memory.

Metalogger Server(s)  are optional but absolutely recommended for disaster recovery. Metaloggers passively replicate file system meta data from master server in real time. It is recommended to setup at least two Metalogger servers, in addition to master server. Data from Metalogger servers can be copied around to create new master server, in case, original master server is lost.

Chunk Servers are the bulk data storage units of MFS cluster. Chunk servers can be added or removed from MFS cluster dynamically, at any time. MFS cluster can grow horizontally to a very large scale file system, just by adding chunk servers later. If enough replicas (MFS goals per file/folder) are setup, then a chunk server can be removed, rebooted or shutdown without any disruption for a file system operation.

Client computers use  FUSE driver to mount MFS volume. It can also be auto-mounted via autofs or via system level fstab files.

CGI status server provides a web based status interface. Although optional, but absolutely recommended to view the health of MFS clusters. Either use MooseFS supplied client server or install cgi package on any server running web server such as Apache. CGI program communicate with master server(s) at a given network port to get  a status about MFS cluster(s). No additional package needed. A single cgi status server is enough and can communicate with all MFS clusters, by providing different web URL arguments.

Figure 1: MFS Cluster



MFS Cluster Setup

Following sections describe the MooseFS file system cluster setup for  CentOS, Scientific Linux (SL) or any RHEL based OS using commodity hardware. MooseFS components can run on several Linux/Unix flavors and similar setup can be applied to those platforms as well.

Design considerations:
Following design considerations are suggested for better MFS management per our experience. They are not mandatory and individual preferences may vary. It is advisable to run MFS as a non root user. Typically, create a dedicated user mfs and group mfs for all MFS servers. Setup MFS master server listening on separate VIP (Virtual IP address) and only listen on that VIP using default ports, i.e do not use server's primary IP address for MFS master server. This is useful when setting up several MFS cluster with in a same network and using single hardware for all MFS master servers using default ports.  Also, when moving master MFS server from one physical server to another, moving VIPs setup helps a lot. In this case  metalogger servers, chunk servers and client hosts, don't have to re-bind to a different IP address of a master server and wait for DNS or cache (such as nscd) to time out. This makes migration  of master server (either planned or emergency) less disruptive. 
Client hosts can use Automounter (autofs)  setup to mount MFS file system for easy scalability and setup. Restrict root mounts  [MFS_ROOT] with full root privileges only to  MFS master (or any designated server) for MFS cluster maintenance and better security.
 
In the following example assuming MFS cluster name foo to be setup. File system mount point as /net/mfs/foo   (can be any path where client  host can mount network file system). Master server VIP name is mfsmaster-foo.example.com (with IP address 192.168.1.10)

Source code or Pre-compiled binaries:
In the following examples RPM installs are used, these are downloaded from  http://packages.sw.be/mfs/ site. For source code compile and install, refer to document from MooseFS site here.

CGI status viewer Setup

It is advisable to setup CGI status server first. Later when MFS cluster components (master, metalogger and chunk servers) are added they can be verified from web GUI, easily. MooseFS also ships with a basic cgi httpd server, that can be used.  However, following steps describe a setup using Apache web server. First install MFS cgi RPM, for example rpm -ivh mfs-cgi-1.6.20-1.el6.rf.x86_64.rpm,Install /etc/httpd/conf.d/mfs.conf file as below:
Alias /mfs/ "/var/www/html/mfs/"
ScriptAlias /mfscgi/ "/var/www/cgi-bin/mfs/"
<Directory "/var/www/cgi-bin/mfs/">
    AllowOverride None
    Options None
    Order allow,deny
    Allow from all
</Directory>
Reload apache. MFS cluster can then be access via URL.
 http://[APACHE-SERVER-HOSTNAME]/mfscgi/mfs.cgi?masterhost=mfsmaster-foo.example.com



Master Server Setup

Install MFS RPM, for example rpm -ivh mfs-1.6.20-1.el6.rf.x86_64.rpm.Create a separate VIP(Virtual  IP interface)  interface for mfsmaster-[foo].example.com.  Linux command such as  'ifconfig eth0:foo 192.168.1.10 netmask 255.255.255.0 up'  or file /etc/sysconfig/network-scripts/ifcfg-eth0:foo can be used to create a VIP interface like eth0:fooEdit configuration files in /etc/mfs/ area.
 If you are planning to run multiple MFS master servers on a single host for more than one MFS clusters (e.g. foo1, foo2,...). Create separate VIPs (e.g eth0:foo1, eth0:foo2,...), corresponding configuration files under separate sub-directories (e.g /etc/mfs/{foo1,foo2}), distinct master metadata directories (e.g: /var/mfs/{foo1,foo2,...}/ and separate boot scripts for each MFS cluster's master server (e.g /etc/init.d/{mfsmaster-foo1, mfsmaster-foo2,...}).

mfsmaster.cfg : Master's primary configuration file.
#-----------------------------------------------------
# MFS master server for "foo" cluster
#-----------------------------------------------------
WORKING_USER = mfs
WORKING_GROUP = mfs
SYSLOG_IDENT = mfsmaster-foo
# LOCK_MEMORY = 0
# NICE_LEVEL = -1

# File path for exports definition for this cluster.
EXPORTS_FILENAME = /etc/mfs/mfsexports.cfg

# Meta data for this master stored here.
DATA_PATH = /var/mfs/        #-- Use distinct directory, if running multiple master servers on this host 

# BACK_LOGS = 50
# REPLICATIONS_DELAY_INIT = 300
# REPLICATIONS_DELAY_DISCONNECT = 3600

#-- For MetaLogger server connections
# MATOML_LISTEN_HOST = *     #-- Change this default from all interfaces to just VIP interface.
MATOML_LISTEN_HOST = mfsmaster-foo.example.com
# MATOML_LISTEN_PORT = 9419

#-- For ChunkServer connection
# MATOCS_LISTEN_HOST = *     #-- Change this default from all interfaces to just VIP interface 
MATOCS_LISTEN_HOST = mfsmaster-foo.example.com
# MATOCS_LISTEN_PORT = 9420

#-- For Client connection
# MATOCU_LISTEN_HOST = *    #-- Change this default from all interfaces to just VIP interface.   
MATOCU_LISTEN_HOST = mfsmaster-foo.example.com
# MATOCU_LISTEN_PORT = 9421

# CHUNKS_LOOP_TIME = 300
# CHUNKS_DEL_LIMIT = 100
# CHUNKS_WRITE_REP_LIMIT = 1
# CHUNKS_READ_REP_LIMIT = 5
# REJECT_OLD_CLIENTS = 0

# deprecated, to be removed in MooseFS 1.7
# LOCK_FILE = /var/run/mfs/mfsmaster.lock
 


mfsexports.cfg: This file describes, share exports definition for the cluster. Client's mount requests are controlled via this file. Refer to an example file supplied in the /etc/mfs directory. Export file can be setup to restrict mounts based on IP addresses, limited root access and password protected for exported shares.
#-----------------------------------------------------------------------------
#-- [MFS_ROOT] : root location / of MFS and all paths are relative after that.
#-----------------------------------------------------------------------------
#-- Only master can see full  [MFS_ROOT] and with full root priv. access.
#-- Mount on a separate path like:
#--    mfsmount /mfsfoo -H mfsmaster-foo.example.com
192.168.1.10             /       rw,alldirs,ignoregid,maproot=0
# Allow "-o mfsmeta". from master
192.168.1.10             .       rw

#-- [MFS_ROOT]/   Allow RW access from all, but root priv disabled.
192.168.1.0/24          /   rw,alldirs,ignoregid,maproot=nobody
 
Before starting master server for the first time, seed an empty  meta data file as , cd /var/mfs; cp metadata.mfs.empty metadata.mfsApply proper permissions,  chown -Rh mfs:mfs /var/mfs. Test start/stop master server by hand using mfsmaster command:  mfsmaster -c /etc/mfs/mfsmaster.cfg start/stop. Activate boot script /etc/init.d/mfsmaster to start/stop master server at boot/shutdown time. Master server is now ready to accept connections. View master server status using CGI viewer.

Metalogger Server Setup

Although optional, but absolutely recommended to setup at least two additional metalogger servers. Install MFS RPM,  rpm -ivh mfs-1.6.20-1.el6.rf.x86_64.rpm. Edit configuration files in /etc/mfs/ area. If you are planning to run multiple MFS cluster's metalogger servers on a single host then create separate configuration files, distinct metalogger data directories and corresponding boot scripts for each MFS cluster's metalogger server.

mfsmetalogger.cfg: Metalogger's  configuration file.
#-----------------------------------------------------
# MFS metalogger(backup) server for "foo" cluster
#-----------------------------------------------------
WORKING_USER = mfs
WORKING_GROUP = mfs
SYSLOG_IDENT = mfsmetalogger-foo
# LOCK_MEMORY = 0
# NICE_LEVEL = -19

# Data directory for metalogger data
DATA_PATH = /var/mfs      # Use distint directories, if running several metaloggers on this host.

# BACK_LOGS = 50
# META_DOWNLOAD_FREQ = 24

# MASTER_RECONNECTION_DELAY = 5

MASTER_HOST = mfsmaster-foo.example.com
# MASTER_PORT = 9419

# MASTER_TIMEOUT = 60

# deprecated, to be removed in MooseFS 1.7
# LOCK_FILE = /var/run/mfs/mfsmetalogger.lock 

Apply proper permission, chown -Rh mfs:mfs /var/mfsTest start/stop metalogger server by hand using  command, mfsmetalogger -c /etc/mfs/mfsmetalogger.cfg start/stopActivate boot script /etc/init.d/mfsmetalogger to start/stop metalogger server at boot/shutdown time. Metalogger server is now ready and should be fetching data from master on a periodically. 


Chunk Server Setup

Bulk of storage data resides on chunk servers. Chunk servers are highly redundant in nature and if enough replicas (goals) are setup for data, then a chunk server can be removed or rebooted without  any loss of data. Number of chunk servers can be added later anytime dynamically to the MFS cluster in order to expand storage capacity.
On each chunk server, install MFS RPM, for example rpm -ivh mfs-1.6.20-1.el6.rf.x86_64.rpm.Edit configuration files in /etc/mfs/ area. It is recommended to use a single chunk server for only one MFS cluster at a time.

mfschunkserver.cfg  : Chunk server  configuration file.
#-----------------------------------------------------
# MFS chunk server for "foo" cluster
#-----------------------------------------------------
WORKING_USER = mfs
WORKING_GROUP = mfs
SYSLOG_IDENT = mfschunkserver-foo
# LOCK_MEMORY = 0
# NICE_LEVEL = -19

# DATA_PATH = /var/mfs

# MASTER_RECONNECTION_DELAY = 5

# BIND_HOST = *
MASTER_HOST = mfsmaster-foo.example.com
# MASTER_PORT = 9420

# MASTER_TIMEOUT = 60

# CSSERV_LISTEN_HOST = *
# CSSERV_LISTEN_PORT = 9422

# HDD_CONF_FILENAME = /etc/mfs/mfshdd.cfg
# HDD_TEST_FREQ = 10

# deprecated, to be removed in MooseFS 1.7
# LOCK_FILE = /var/run/mfs/mfschunkserver.lock
# BACK_LOGS = 50
# CSSERV_TIMEOUT = 5
 
Data chunks are stored in a directory created on a native file system (ext3, ext4, zfs, ...) on a chunk server. Chunk directories are added to the mfshdd.cfg file. Multiple directories(chunk areas) can be specified. Normally there is no need to use RAID setup for chunk storage, if goals>1 setup for all MooseFS data. 
Data redundancy is handled at  MooseFS setup level. Data chunks are replicated on separate chunk servers to survive a complete hardware failure of a single chunk server itself. For critical data it is recommended to setup goals=>3 or setup appropriate RAID on chunk servers to achieve higher level of redundancy. 

mfshdd.cfg  :  Chunk storage data directories
#-- mount points of HDD drives.
/data/sda3/mfschunk
/data/sdb4/mfschunk

Apply proper permissions, chown -Rh mfs:mfs /var/mfsTest start/stop chunk server by hand using command,  mfschunkserver -c /etc/mfs/mfschunkserver.cfg start/stopActivate boot script /etc/init.d/mfschunkserver to start/stop chunk server at boot/shutdown time. Chunk server is now ready and should be communicating with master server. Verify via CGI status viewer. 
It is noted that, at this time there is no in built mechanism to prevent unauthorized chunk servers to join MFS cluster via master server. In theory any non-root user on the network can setup  a chunk server, point to a Master server and join existing MFS cluster. Although, it makes whole process very convenient, when adding a new chunk server, however it presents a little operational risk.  Hopefully in the future release, MooseFS would have some authentication mechanism between chunk and master servers to avoid security risks and operational accidents, especially when running multiple MFS clusters in the same network segment.


Mounting Filesystem on Client Hosts:

Install mfs-client RPM, for example rpm -ivh mfs-client-1.6.20-1.el6.rf.x86_64.rpmAlso install fuse and fuse-libs. For RHEL,CentOS type OS, yum install fuse fuselibs should install those. Clients can now install share of MFS cluster foo by following one of these methods below. 

Mount file system to /net/mfs/foo locally, using mfsmount command.
[root]# mfsmount /net/mfs/foo -H mfsmster-foo.example.com -o mfssubfolder=/
 
Mount file system using /etc/fstab file. Add an entry like below and then run a command,  mount /net/mfs/foo
Note: During boot process, MFS cluster may initiate later than file system mount attempt from /etc/fstab file. One of the workaround is to put a line like 'mount -a' in the /etc/rc.local file, that is executed as a last step in the boot process. This will ensure to mount MFS share across reboots.
mfsmount /net/mfs/foo fuse mfsmaster=mfsmster-foo.example.com,mfssubfolder=/ 0 0

Mount file system using Automounter(Autofs). cd /net/mfs/foo, should auto-mount MFS file share.
File /etc/auto.master:: /net/mfs /etc/auto.mfs --timeout=120
File /etc/auto.mfs:: foo -fstype=fuse,mfsmaster=mfsmaster-foo.example.com,mfssubfolder=/ \:mfsmount





MFS Cluster Maintenance and Operation

Once MFS cluster is up and running it requires very little maintenance effort. Following section describes few operation tips.

Goals (Replication setup)

Folders and files replication (or goals) can be setup at a very granular level. Use command mfsgetgoal and mfssetgoal for goals setup. MFS ensures to keep replicated chunks on different physical servers, thus  folders setup with N goals, would have N-1 redundancy at chunk servers level. By default goals settings  are inherited from the parent folder, but individual files and folders can be setup with different goals. MFS cluster maintains goal level on the data chunks at the time of  writing, and it also re-balances time to time to ensure N-1 level redundancy at a chunk server level. In case a chunk server disappears, MFS cluster would start replicating chunks to attain goal  level set for the data. CGI web interface displays the chunks status matrix with any under goal(orange) chunks. Data chunks with zero goal(red) is a fatal condition and indicates that, some chunks are not available at the moment. File system will throw an I/O error when a data hitting those missing chunks are requested. This can happen if more than desired chunk servers become offline. Normally bringing those chunk servers online will bring back file system in healthy condition. It is highly recommended to keep an eye on CGI web interface while doing chunk server maintenance like rebooting, replacing or retiring chunk servers.

[root]# mfsgetgoal /net/mfs/foo
/net/mfs/foo/: 1

[root]# mfsgetgoal -r 2 /net/mfs/foo                <== Setting up goals=2 recursively on existing files and folders.
2:
/net/mfs/foo:
 files with goal        2 :               10
 directories with goal  2 :              3


Trash bin and data quarantine time setup

Accidents do happen. A powerful /bin/rm -rf * command is a friend and a foe. Last night backup can only give some relief, but still may cause a day worth of effort lost. MooseFS enables a trash bin setup for all folders. By defaults all files data deleted from MFS cluster is kept for 1 day. Trash time can be setup on individual folder or files, like goals setup by using commands mfsgettrashtime and mfssettrashtime. When file is deleted, a metadata related to file is kept in a special .(META) area and related data chunks are left on the chunk server for a specified period of time. Administrator has to mount  special share with option -o mfsmeta to access trash meta data. In order to perform undelete operation, move related metadata file to trash/undel area and MFS cluster will bring related data back online. 

To recover deleted data with in specificed Login to Authorized server.

[root]# mfsmount /mnt/mfsmeta-foo -o mfsmeta -H mfsmaster-foo.example.com       <== will mount .(META) area on /mnt/mfsmeta-foo
[root]# cd /mnt/mfsmeta-foo/trash
 
Find a file path to be un-deleted and move that entry to trash/undel area.


Metadata backup, recovery and redundancy

MFS metadata redundancy is a sweet and sour experience for operations folks. MFS metadata storage files are simple and stored in a single directory. Meta data can be moved around manually  and standing up a new master server is very easy and takes only few minutes. Built in automatic fail over redundancy is a highly desired feature for an Enterprise file system cluster setup. Currently, third party solutions such as UCARP or Linux heartbeat with DRBD setup may fill this gap. These solutions may be an overkill and prone to false positives for MooseFS setup, compared to how quick and easy it is, to manually recover master metadata server. However for 24x7 operations, an automated fail over redundancy is hard to avoid feature. A good news is that MooseFS developers have indicated that, it is on the radar for future releases.


Few things to know

   Currently (MooseFS 1.6.20), global POSIX file locking is not supported like NFS supports at a server level. However MFS  via FUSE supports kernel level POSIX file locking with in the same OS kernel. It looks like author is planning to support global file locking in the future releases. 

Also it seems, O_DIRECT file system call is not supported either, this is more of a FUSE driver issue.  dd command like below would fail with error. (strace of such process shows system call O_DIRECT fails.

[root] # dd if=/dev/zero of=testfile bs=1024 count=10 oflag=direct
dd: opening `testfile': Invalid argument

strace of above command further showed:
  open("test1", O_WRONLY|O_CREAT|O_TRUNC|O_DIRECT, 0666) = -1 EINVAL (Invalid argument)
If you are writing your own application then these are not a major factor as workarounds can be used. However if any third party application already using these calls, they may exhibit strange behavior when using FUSE based mounts such as used by MooseFS and several other file systems.

MFS uses 64KB for block size and 64MB(max) for chunk file size on the disk. These are hard coded limits and seems to work very well in general. If you are planning to use MFS cluster for lots of small files, such as for code development, version control (Subversion, CVS respository) etc, then you will notice a spike in the metadata size on the master server. Having additional RAM to support larger metadata on a master server would surely help in this case.


Other Goodies

   Most of the MooseFS commands start with mfs.... MooseFS has some very nice set of commands such as mfsdirinfo and mfsfilefile to get instant file system meta data. Given a copy of metadeta is cached in RAM on master metadata server, even a command like du -hs  is extremely fast compared to several native file systems. Removal of chunk server can be performed simply by shutting down the mfschunkserver process and that is  immediately reflected in the CGI web gui. Recovery of MooseFS is generally very quick in case of network blip or master server crash/reboot cycle. Clients would simply retry and as soon as all necessary MooseFS components are online, file system will respond right away, which is very impressive. 

   Retiring chunk server or chunk area from a particular chunk server is a very easy process. In order to remove chunk storage area, simply mark a file system area with * (asterisk) in front of directory path  (e.g:  */data/sda3/mfschunk  to retire /data/sda3/mfschunk storage area) and restart mfschunkserver. MooseFS cluster then replicate desired chunks to other areas and prepare for storage area removal, while keeping desired goals level for files and folders. In case of storage chunk area loss due to hard disk failure or such, simply prepend # (hash) in front of chunk storage area and start mfschunkserver.




MFS Metadata Maintenance (Disaster Recovery)

Anyone administering MooseFS cluster must understand how to recover MFS metadata in case of hardware component failure and sudden crash of a system. Without associated metadata, data stored on the chunkservers are nothing but a heap of trash. It is very important to guard a metadata and keep enough backup provision to recover file system in case of emergency.

Few Basics about metadata

metadata.mfs: This file is created on a master server, when a master server process is shutdown gracefully. All active metadata including any pending changelogs are written to this file. It contains a full set of metadata at the time of graceful shutdown. This is the only file required at the master metadata server startup next time. This file is not present when master data server is running normally.

metadata.mfs.back: When master server is started , it will read  a file metadata.mfs and initiate metadata.mfs.back file from its last state. During normal operation on a master server, all pending changelogs are written to this file on an hourly basis at the top of the hour. It is highly recommended to make an hourly copy, preferably at [HH]:30 and keep last 24 copies of this file and possible backup away from Master srever.  In a very rare circumstances, if master server or metalogger servers are beyond recovery for any reason, these hourly backed up files can be used to recover MFS Filesystem to its last known good state. 

changelogs.mfs*: These are changelog files written to the disk time to time. These change logs are merged and written to metadata.mfs.back file on an hourly basis. Backup metaloggers sync these files and keep themselves up to date on a regular basis.


Recovery from crashed master server

When a master server process got killed unexpectedly, it is left with metadata.mfs.back file and pending changelogs* files those are not merged yet. At the next start up, master server would fail because of missing metadata.mfs file. Run mfsmetarestore command on the master server  to merge all the pending changes manually and then start master server.
mfsmetarestore -a -d /var/mfs          #-- /var/mfs is a location where master meta data are written 
It is highly recommended to always gracefully shutdown master server, whenever possible. 
 
 

Moving master metadata server in a planned manner.

Follow these steps, when moving master metadata server from one server to another. First, shutdown all metalogger servers, chunk servers(if possible,although not required) and then shutdown old master metadataserver. Ensure master server is  gracefully shutdown. Prepare a new master server by moving a VIP(Virtual IP) to the new server. Copy master metadata server directory to the new server and start master metadata server on a new host. Start metalogger and chunk servers as well. 

Note: Using VIP(virtual IP address) for master metadata server and moving it around has a great advantage. Chunk servers and client hosts earlier mounting MFS volume will be able to join back with much ease, in this case. If you change IP address of master metadata server, then chunk servers and client hosts may experience a long disruption due to DNS caching, nscd, failed mount to an old IP etc.


Recovery from Metalogger server data.

In an event, when master metadata server hardware is completely lost and unable to retrieve latest master metadata from it, metadata from Metalogger servers can be used to recover. In this case prepare a new master Metadata server by moving its designated VIP. Copy all Metalogger data to the new server at some temporary location (e.g: /tmp/metalogger/) and then run a following command to recreate [MasterDir]/metadata.mfs file.

mfsmetarestore -o /var/mfs/metadata.mfs  -m /tmp/metalogger/metadata.mfs.back /tmp/metalogger/changelog_ml.*.mfs

Once completed, start  Master metadata server. This will bring the MFS Filesystem to the latest state cached by the Metalogger servers at the time, when Master server went down. 

Additional copies of metadata.

Metadata is a very critical asset and there are every reason to get paranoid about not loosing metadata. Two or more metaloggers are highly recommened for sure, in addition to that script like below can also keep an hourly copy of a metadata for historical purpose. These older snapshots may be helpful if you want to move back in time and use older metadata (e.g current metadata got corrupted and also copied over metalogger). Using historical metadata won't bring everything current, but may be able to bring  file system up to last known good state. Obviously corresponding chunks must be present on the chunkservers to match up with metadata or you will notice some error about missing chunks.

#!/bin/sh
#-----------------------------------------------------------------
#   Backing up master metalogs locally. Run this script
#   from cronjob frequently (at every HOUR:30 or so), to capture
#   frequent snapshots. This will keep last 24 snapshots of metadata
#   and overwrite after that. For additional protection, backup this
#   data using your backup system frequently.
#-----------------------------------------------------------------
MFS_BASE="/var/mfs"
MFS_LOCALBACKUP="$MFS_BASE/LocalBackup"
#-- Assuming active master metalogs are stored as
#-- $MFS_BASE/metadata.mfs.back
DATE=`date`
CURRENT_METADATA="$MFS_BASE/metadata.mfs.back
if [ ! -d ${MFS_LOCALBACKUP} ]; then
 mkdir $MFS_LOCALBACKUP
 if [ $? != 0 ]; then
  echo "Oops! Can not create '$MFS_LOCALBACKUP' directory. Aborting..."
  exit 1
 fi
fi
if [  -f $CURRENT_METADATA ]; then
  echo "$DATE: Backing up metadata $CURRENT_METADATA"
  HOUR_NUM=`date +%0H`
  #-- copy and replace previous file.
  cp -u $CURRENT_METADATA $MFS_LOCALBACKUP/metadata.mfs-hour:$HOUR_NUM
fi


Speeding up chunk replication and re-balancing.

By default MooseFS yields higher I/O to the file system operations for clients and uses very little I/O for chunk replication and re-balancing. This is mostly preferable in normal operation. However when you replace an existing chunk server, then some of the chunks may be under goal and need a quick attention. If you want to speed up chunk replication process by sacrificing I/O for other file system operations, then tweak the following two parameters in mfsmaster.cfg file on a master server and restart master server process. You may have to experiment little bit to find a correct balance between chunk replication rate and available I/O for other  file system operations.

CHUNKS_WRITE_REP_LIMIT = 5             #default value 1
CHUNKS_READ_REP_LIMIT = 25             #default value 5


Wrapping up

Some of the missing, but desired features are worth mentioning here. Native redundancy at Master metadata server is highly desired feature. Currently one can use uCARP and DRBD with heartbeat solutions, however it would be nice to see a built in redundancy in the future releases.
Although cgi-bin web interface is sufficient for most information, however more command line options to gather information about MooseFS would be nice, so one can use in scripts and for monitoring system. 
Global POSIX file system locking would be nice to have and looks like promised for the next release.    

Advanced features such as file system compression, encryption and data de-duplication are also important for many data center environments. Currently some of these features can be used  with native file systems such as (ext4, zfs or lessfs). There are two zfs-linux ports in works as well zfs-fuse and native zfs for linux kernel. Btrfsis suppose to be next Linux filesystem, worth keeping an eye.

Conclusion.

In our experience MooseFS turned out to be a great file system. It's simplicity and resiliency surpasses many other distributed network filesystems. MooseFS components such as Master server and  chunk servers can be restarted gracefully without significant issues on the client side mounting MFS shares. MooseFS is able to recover very well if more than one critical components are down at a given point and come back after certain time. Replication and Trash quarantine setup on a very granular level is superb. For example one can setup a very high replication  factor (say goals=4)  for a mission critical data folder thus allowing more chunk servers failures without affecting the availability of such data, while keeping a (goals=2) for less critical data to utilize disk space better. Cgi-bin status web interface is great and give lots of information needed for a file system operation. 
原创粉丝点击