The Linux multipath implementation (转)

来源:互联网 发布:python飞机大战源代码 编辑:程序博客网 时间:2024/06/06 04:53

The Linux multipath implementation



Original author : Christophe Varoqui
Creation : Feb 2004
Last update : Dec 2010

Introduction


The most common multipathed environment today is a Fibre Channel (FC) StorageArea Network (SAN). These beasts can be found in most Datacenters. The legoblocks forming a SAN are :

  • FC switches : core switches (multi-protocols chassis and FC boards), or stacked switches linked by Inter Switch Links (ISL). This layer, the Fabric layer, can be subdivided in two major fabric types :
    • Simple fabrics : all storage ports can be routed to all hosts ports. Hosts and storage controllers can have a single attachment to the the fabric.
    • Dual independent Fabrics : two sets of switches are completely segregated (no ISL). They form two independent naming domains. The hosts and storage controllers must be attached to the two fabrics to ensure redundancy. This technology is used to provide the maximum availability as one fabric can be shut, for planned or unplanned reasons, without perturbation seen on the other one.
  • FC storage controllers : most of them provide multiple ports to attach to the switches layer. The physical storage they drive is arranged in virtual drives we will refer to as Logical Units (LU). Each LU is provided by its host controller a unique identifier as per the SCSI standard. We will refer to this identifier as the World Wide Identifier (WWID) or World Wide Name (WWN).
  • Host Bus Controllers (HBA) : the PCI / FC coupling adapters. A server can embark multiple HBA.


The multipath term simply means that a host can access a LU by multiple paths,the path being a route from one host HBA port to one storage controller port.

Examples :

  • A host with 2 HBA attached to a single fabric is presented a LU by a 4 ports storage controller. The host then see 8 paths to the LU
  • A host with 2 HBA attached to a dual independent fabric (1 HBA on each fabric) is presented a LU by a 4 ports storage controller (2 ports on each fabric). The host then see 4 paths to the LU : 2 paths through fabric A, plus 2 through the fabric B.


The Linux kernel choose not to mask the individual paths, that appear as normalSCSI Disks (sd).

Multipath awareness and support for an operating system can be described as :

  • Provide a single block device node for a multipathed LU
  • Ensure that IO are re-routed to available paths when a loss of path occurs, with no userspace process disruption other than an short pause.
  • Ensure that failed paths get revalidated as soon as possible
  • Ensure stability of the naming of that node
  • Configure the multipaths to maximize performance : spread IO when possible path switching is free, and not spread when it's costly.
  • Configure the multipaths automatically early at boot to permit OS install and boot on a multipathed LU
  • Reconfigure the multipaths automatically when events occur
  • The multipaths must be partitionable
  • In the Linux way : simple and hardware vendor agnostic


All these goals are met by leveraging a set of userspace tools ans kernelsubsystems :

  • the kernel device mapper
  • the hotplug kernel subsystem
  • the udev device naming tool
  • the multipath userspace configuration tool
  • the multipathd userspace daemon
  • the kpartx userspace configuration tool
  • the early userspace Linux kernel boot environment


The rest of this document describes these individual tools and subsystems andtheir interactions.

Device Mapper


Starting with Linux kernel 2.6, a new lightweight block subsystem named DeviceMapper enables advanced storage management with style. This component featuresa pluggable design. At the time of this writing available plugins are :

  • segments concatenation
  • segment striping
  • segment snapshotting
  • segment mirroring, with and without persistence
  • segment on-the-fly encryption
  • segment multipathing


This subsystem is the core component of the multipath tool chain. It is notincluded in the main kernel tree as of linux-2.6.10. It is part of a patchsetcreated by Joe Thornber, and now maintained by Alasdair G Kergon (agk at redhatdot com) that can be downloaded at http://sources.redhat.com/dm/

This component fills the following requirements :

  • Provide a single block device node for a multipathed LU
  • Ensure that IO are re-routed to available paths when a loss of path occurs, with no userspace process disruption


So, let's see how it works.

The Device Mapper is configured one map at a time. A device map, also referredto as a table, is a list of segments in the form of :


0 35258368 linear8:48 65920
35258368 35258368linear 8:32 65920
70516736 17694720linear 8:16 17694976
88211456 17694720linear 8:16 256


The first 2 parameters of each line are the segment starting block in thevirtual device and the length of the segment. The next keyword is the target policy(linear). The rest of the line is the target parameters.

The Device Mapper can be fed its tables through the use of a library :libdevmapper. EVMS2, dmsetup, LVM2, the multipath configuration tool and kpartxall link this lib. A table setup boils down to sprintf'ing the right segmentdefinitions in a char *. Should the DM user-kernel interface change from beingioctl based to a pseudo filesystem, the libdevmapper API should remain stable.

Here is an example of a multipath target :

                             [----------- 1st path group-----------] [--------- 2nd path group -----------]

0 71014400 multipath 0 0 2 1 round-robin 0 2 1 66:1281000 65:64 1000 round-robin 0 2 1 8:0 1000 67:192 1000

^     ^       ^     ^ ^ ^ ^      ^      ^ ^ ^  ^      ^

|     |       |     | | | |      |      | | |  |      nb of io to send to thispath before switching

|     |       |     | | | |      |      | | |  path major:minor numbers

|     |       |     | | | |      |      | | number of path arguments

|     |      |      | | | |      |     | number of paths in this path group

|     |       |     | | | |      |      number of selector arguments

|     |       |     | | | |      path selector

|     |       |     | | | next path group to try

|     |       |     | | number of path groups

|     |       |     | number of hwhandler

|     |       |     number of features

|     |       target name

|     target lenghtin 512-bytes blocks

starting offset of the target


For completeness, here is an example of a pure failover target definition forthe same LU :


0 71014400multipath 0 0 4 1 round-robin 0 1 1 66:112 1000 round-robin 0 1 1 67:176 1000round-robin 0 1 1 68:240 1000 round-robin 0 1 1 65:48 1000

And a full spread (multibus) target one :


0 71014400 multipath0 0 1 1 round-robin 0 4 1 66:112 1000 67:176 1000 68:240 1000 65:48 1000

Upon device map creation, a new block kernel object named dm-[0-9]* isinstantiated, and a hotplug call is triggered. Each device map can be assigneda symbolic name when created through libdevmapper, but this name won't beavailable anywhere but through a libdevmapper request.

hotplug subsystem and udev


Starting with Linux kernel 2.6, the hotplug callbacks are provided by the sysfspseudo filesystem events. This filesystem presents to userspace kernel objectslike bus, driver instances or block devices in a hierarchical and homogeneousmanner. /sbin/hotplug is called upon file creation and deletion in the sysfsfilesystem.
Udev acts as a proxy for sysfs events. The multipath tools collects events froma udev event relaying unix socket.

For our needs this facility provides :

  • userspace callbacks upon paths additions and suppressions
  • userspace callbacks upon device maps additions and suppressions


Since linux-2.6.4, and its integration of the transport class for sysfs, it canalso provide callbacks upon FC transport events like a “Port Database Rescan”.These callbacks could now be used to trigger SCSI Bus Rescan to bring a fullydynamic storage layer. (Or am I wrong ?)

Here is how we use this callbacks for the multipath implementation :

  • The paths additions and suppressions callbacks are listened by the multipath userspace daemon described later. This daemon ensures the multipath maps are always up-to-date with the fabric topology, and this ensures optimal performance by adding new paths to the existing maps as soon as they become available.
  • The udev userspace tool is triggered upon every block sysfs entry creation and suppression, and assume the responsibility of the associated device node creation and naming. Udev default naming policies can be complemented by add-on scripts or binaries. As it does not currently have a default policy for device maps naming, we plug a little tool named devmap_name that resolve the sysfs dm-[0-9]* names in map names as set at map creation time. Provided the map naming is rightly done, this plugin provides the naming stability and meaningfulness required for a proper multipath implementation.
  • The userspace callbacks upon device maps additions and suppressions also trigger the kpartx tool to create the device maps over eventual partitions


Udev is a reimplementation in userspace of the devfs kernel facility. Itprovides a dynamic /dev space, with an agnostic naming policy. GregKroah-Hartman is the original developper of this package, and it now maintainedby Kay Sievers. It can be found athttp://ftp.kernel.org/pub/linux/utils/kernel/hotplug/

To summarize what implementation details these subsystems fill :

  • Relay events to the multipath daemon
  • Create multipath device entries
  • Trigger kpartx to map device maps on multipath partitions

 

multipath userspace config tool


This tool implements a stateless subset of multipath daemon features. It canwork without the daemon running. It can handle the paths coalescing and devicemaps creation.

Here is how it works :

  • draw a list of all available devices in the system through a sysfs scan. For each device, get a bunch of information :
    • Host / Bus / Target / Lun tuple
    • SCSI Device Strings : Vendor / Model / Revision
    • SCSI Serial String
  • Considering the informations fetched, elect a LU UID fetching method and an IO spreading policy. Ie deal with hardware specifics.
  • Get the LU UID with the elected method. This method defaults to the standard 128 bit EUID found in the EVPD 0x83 inquiry page of the device.
  • Coalesce the paths to form the multipath structs, ie group paths by UID.
  • Create and name the device maps associated with the multipath structs with the selected IO spreading policy


There are currently 3 io routing policy implemented :

  • failover : 1 path per priority group. IO thus get routed to one path only.
  • multibus : 1 priority group containing all paths to the LU. Brings the maximum spreading, but assumes that all paths are excitable without penalty.
  • group_by_serial : 1 priority group per storage controller (serial), paths through one controller are assigned to the associated PG. This policy applies to controllers that impose a latency penalty on LU management hand-over between a pair of redundant controllers.


Policy assignment can be set manually at the command line. This one sets thepolicy to multibus for the multipath containing the device with major 8 andminor 0 (/dev/sda)


multipath -pmultibus -D 8 0


These policies can optionally be stored in a config file (/etc/multipath.conf).If the file is present, its content override the in-code defaults. Allmultipath hardware you will use must be described in either the config file ifyou have one, or the in-code defaults table if not, for the multipath tool towork.

The device maps naming policy is “name by LU WWID”, with a provision fordefining per-LU aliases.

To illustrate this synopsis, here is an example verbose output :

[root@cl039 multipath-tools-0.3.9]# multipath -v23600a0b80000b5c9c0000044d3b667c19

unchanged: 3600a0b80000b5c9c0000044d3b667c19

[size=34675MB][features="0"][hwhandler="0"]

\_ round-robin 0 [first]

  \_ 3:0:0:5 sdat66:208  [ready ][active]

  \_ 2:0:2:5sdz  65:144  [ready ][active]

\_ round-robin 0

  \_ 3:0:2:5 sdbn68:16   [ready ][active]

  \_ 2:0:0:5sdf  8:80    [ready ][active]



The first section shows the list of all paths detected on the host. The secondshows the multipath structs produced by the coalescing logic. The third showsthe device maps submitted to the Device Mapper.

Of interest is the creation of device maps for single path LU : this enablessystems to operate normally when booted in a degraded SAN context. The missingpaths will be added to the maps when they become available.

The implementation requirements filled by this tool are :

  • Work in early userspace with minimal dependencies.
  • Ensure naming stability of the multipathed LU (in complement of udev)
  • Configure the multipaths to maximize availability and performance

 

the multipathd daemon


This daemon can do everything the multipath command do, and additionaly, is incharge of checking the paths in case they come up or down. When this occurs, itwill reconfigure the multipath map the path belongs to, so that this mapregains its maximum performance and redundancy.

The implementation requirements filled by this daemon are :

  • Ensure naming stability of the multipathed LU (in complement of udev)
  • Configure the multipaths to maximize availability and performance
  • Ensure that failed paths get revalidated as soon as possible
  • Reconfigure the multipaths automatically when events occur (add/remove paths, switch path groups, ...)

 

kpartx userspace config tool


This tool, derived from util-linux' partx, reads partition tables on specifieddevice and create device maps over partitions segments detected. It is calledfrom hotplug upon device maps creation and deletion.

kpartx is part of the multipath-tools package

Early userspace


Starting with Linux kernel 2.6, an early userspace execution environment isavailable in the name of initramfs. The grand plan is to package a set of toolsin a cpio archive concatenated to the kernel. This archive is expanded in anin-memory filesystem early at boot and the tools are called to assume logicsthat previously belonged in the kernel : dhcp requests and setups, nfsroot stuff...

The multipath implementation toolbox fits in this early userspace definition.udev, multipath and kpartx are packaged with the initrd or initramfs to bringup the multipathed device early enough to boot on.

So is met the last multipath implementation requirement.