btrfs元数据设计（转设计文档，可以基础性的看清楚其扩展b+树的数据结构设计）

来源：互联网发布：粒子特效制作软件编辑：程序博客网时间：2024/06/01 12:42

Directories and subvolumes

A directory's keys simply list all of the files contained within thedirectory, twice. The first list consists of a sequence of DIR_ITEM keys,ordered by the hash of the item's name (this is stored in the offset of thekey). The second list consists of a sequence of DIR_INDEX keys, ordered by the"natural" order of the directory (typically in creation order). Bothkey types store the same structure, which references the key of this object'sinode, and holds the full name of the object within this directory. Thereferenced key will be either of type INODE_ITEM, in which case it is anordinary POSIX filesystem object and can be looked up in this tree; or it willbe of type ROOT_ITEM, in which case it's a subvolume, and the subvolumeobjectid should be looked up in the tree of tree roots to find thecorresponding FS tree.

Btrfs design

Btrfs is implemented with simple and well known constructs.It should perform well, but the long term goal of maintaining performance asthe FS system ages and grows is more important than winning a short livedbenchmark. To that end, benchmarks are being used to try and simulateperformance over the life of a filesystem.

Btree Data structures

The Btrfs btree provides a generic facility to store avariety of data types. Internally it only knows about three data structures:keys, items, and a block header:

struct btrfs_header {

    u8 csum[32];

    u8 fsid[16];

    __le64 blocknr;

    __le64 flags;

    u8 chunk_tree_uid[16];

    __le64 generation;

    __le64 owner;

    __le32 nritems;

    u8 level;

struct btrfs_disk_key {

    __le64 objectid;

    u8 type;

    __le64 offset;

struct btrfs_item {

    struct btrfs_disk_key key;

    __le32 offset;

    __le32 size;

Upper nodes of the trees contain only [ key, block pointer] pairs. Tree leaves are broken up into two sections that grow toward eachother. Leaves have an array of fixed sized items, and an area where item datais stored. The offset and size fields in the item indicate where in the leafthe item data can be found. Example:

Item 0 ...

Item N

Free Space

...

Free Space

Data for itemN ...

Data for Item 0

Item data is variably size, and various filesystem datastructures are defined as different types of item data. The type field instruct btrfs_disk_key indicates the type of data stored in the item.

The block header contains a checksum for the block contents,the uuid of the filesystem that owns the block, the level of the block in thetree, and the block number where this block is supposed to live. These fieldsallow the contents of the metadata to be verified when the data is read.Everything that points to a btree block also stores the generation field itexpects that block to have. This allows Btrfs to detect phantom or misplacedwrites on the media.

The checksum of the lower node is not stored in the nodepointer to simplify the FS writeback code. The generation number will be knownat the time the block is inserted into the btree, but the checksum is onlycalculated before writing the block to disk. Using the generation will allowBtrfs to detect phantom writes without having to find and update the upper nodeeach time the lower node checksum is updated.

The generation field corresponds to the transaction id thatallocated the block, which enables easy incremental backups and is used by thecopy on write transaction subsystem.

Filesystem Data Structures

Each object in the filesystem has an objectid, which isallocated dynamically on creation. A free objectid is simply a hole in the keyspace of the filesystem btree; objectids that don't already exist in the tree.The objectid makes up the most significant bits of the key, allowing all of theitems for a given filesystem object to be logically grouped together in thebtree.

The offset field of the key stores indicates the byteoffset for a particular item in the object. For file extents, this would be thebyte offset of the start of the extent in the file. The type field stores theitem type information, and has extra room for expanded use.

Inodes

Inodes are stored in struct btrfs_inode_item at offset zeroin the key, and have a type value of one. Inode items are always the lowestvalued key for a given object, and they store the traditional stat data forfiles and directories. The inode structure is relatively small, and will notcontain embedded file data or extended attribute data. These things are storedin other item types.

Files

Small files that occupy less than one leaf block may bepacked into the btree inside the extent item. In this case the key offset isthe byte offset of the data in the file, and the size field of structbtrfs_item indicates how much data is stored. There may be more than one ofthese per file.

Larger files are stored in extents. structbtrfs_file_extent_item records a generation number for the extent and a [ diskblock, disk num blocks ] pair to record the area of disk corresponding to thefile. Extents also store the logical offset and the number of blocks used bythis extent record into the extent on disk. This allows Btrfs to satisfy arewrite into the middle of an extent without having to read the old file datafirst. For example, writing 1MB into the middle of a existing 128MB extent mayresult in three extent records:

[ old extent: bytes 0-64MB ], [ new extent 1MB ], [ old extent: bytes 65MB –128MB]

File data checksums are stored in a dedicated btree in astruct btrfs_csum_item. The offset of the key corresponds to the byte number ofthe extent. The data is checksummed after any compression or encryption is doneand it reflects the bytes sent to the disk.

A single item may store a number of checksums. structbtrfs_csum_items are only used for file extents. File data inline in the btreeis covered by the checksum at the start of the btree block.

Directories

Directories are indexed in two different ways. For filenamelookup, there is an index comprised of keys:

Directory Objectid

BTRFS_DIR_ITEM_KEY

64 bit filename hash

The default directory hash used is crc32c, although otherhashes may be added later on. A flags field in the super block will indicatewhich hash is used for a given FS.

The second directory index is used by readdir to returndata in inode number order. This more closely resembles the order of blocks ondisk and generally provides better performance for reading data in bulk(backups, copies, etc). Also, it allows fast checking that a given inode islinked into a directory when verifying inode link counts. This index uses anadditional set of keys:

Directory Objectid

BTRFS_DIR_INDEX_KEY

Inode Sequence number

The inode sequence number comes from the directory. It isincreased each time a new file or directory is added.

Reference Counted Extents

Reference counting is the basis for the snapshottingsubsystems. For every extent allocated to a btree or a file, Btrfs records thenumber of references in a struct btrfs_extent_item. The trees that hold theseitems also serve as the allocation map for blocks that are in use on thefilesystem. Some trees are not reference counted and are only protected by acopy on write logging. However, the same type of extent items are used for allallocated blocks on the disk.

Extent Block Groups

Extent block groups allow allocator optimizations bybreaking the disk up into chunks of 256MB or more. For each chunk, they recordinformation about the number of blocks available. Files and directories willhave a preferred block group which they try first for allocations.

Block groups have a flag that indicate if they arepreferred for data or metadata allocations, and at mkfs time the disk is brokenup into alternating metadata (33% of the disk) and data groups (66% of thedisk). As the disk fills, a group's preference may change back and forth, butBtrfs always tries to avoid intermixing data and metadata extents in the samegroup. This substantially improves fsck throughput, and reduces seeks duringwriteback while the FS is mounted. It does slightly increase the seeks whilereading.

Extent Trees and DM integration

The Btrfs extent trees are intended to divide up theavailable storage into a number of flexible allocation policies. Each extenttree owns a section of the underlying disk, and they can be assigned to a collectionof (or a single) tree roots, directories or inodes. Policies will direct how agiven allocation is spread across the extent trees available, allowing theadmin to direct which parts of the filesystem are striped, mirrored or confinedto a given device.

Btrfs will try to tie in with DM in order to easily managelarge pools of storage. The basic idea is to have at least one extent tree perspindle, and then allow the admin to assign those extent trees to subvolumes,directories or files.

Explicit Back References

Back references have three main goals:

Differentiate between all holders of references to an extent so that when a reference is dropped we can make sure it was a valid reference before freeing the extent.
Provide enough information to quickly find the holders of an extent if we notice a given block is corrupted or bad.
Make it easy to migrate blocks for FS shrinking or storage pool maintenance. This is actually the same as #2, but with a slightly different use case.

File Extent Backrefs

File extents can be referenced by:

Multiple snapshots, subvolumes, or different generations in one subvol
Different files inside a single subvolume
Different offsets inside a file

[The remainder of this section refers to the extent_ref_v0 structure, which isnot used on current btrfs filesystems]

The extent ref structure has fields for:

Objectid of the subvolume root
Generation number of the tree holding the reference
objectid of the file holding the reference
offset in the file corresponding to the key holding the reference

When a file extent is allocated the fields are filled in:

(root objectid,transaction id inode objectid, offset in file)

When a leaf is cow'd new references are added for everyfile extent found in the leaf. It looks the same as the create case, but thetransaction id will be different when the block is cow'd.

(root objectid,transaction id, inode objectid, offset in file)

When a file extent is removed either during snapshotdeletion or file truncation, the corresponding back reference is found bysearching for:

(btrfs_header_owner(leaf),btrfs_header_generation(leaf), inode objectid, offset in file)

Btree Extent Backrefs

Btree extents can be referenced by:

Different subvolumes
Different generations of the same subvolume

Storing sufficient information for a full reverse mappingof a btree block would require storing the lowest key of the block in thebackref, and it would require updating that lowest key either before write out orevery time it changed.

Instead, the objectid of the lowest key is stored alongwith the level of the tree block. This provides a hint about where in the btreethe block can be found. Searches through the btree only need to look for apointer to that block, and they stop one level higher than the level recordedin the backref.

Some btrees do not do reference counting on their extents.These include the extent tree and the tree of tree roots. Backrefs for thesetrees always have a generation of zero.

When a tree block is created, back references are inserted:

(root objectid,transaction id or zero, level, lowest objectid)

The level is stored in the objectid slot of the backref todifferentiate between Btree back references and file data back references. Thehighest possible level is 255, and the lowest possible file objectid has beenraised to 256. So, if the objectid field in the back reference is less than256, it corresponds to a Btree block.

When a tree block is cow'd in a reference counted root, newback references are added for all the blocks it points to:

(root objectid,transaction id, level, lowest objectid)

Because the lowest_key_objectid and the level are justhints they are not used when backrefs are deleted. When a snapshot is created anew reference is taken directly on the root block. This means the owner fieldof the root block may be different from the objectid of the snapshot. So, whendropping references on tree roots, the objectid of the root structure is alwaysused. When a backref is deleted:

if backref was for a tree root:

     root_objectid = root->root_key.objectid

else

     root_objectid = btrfs_header_owner(parent)

(root_objectid,btrfs_header_generation(parent) or zero, 0, 0)

Back Reference Key Construction

Back references have four fields, each 64 bits long. Thisis hashed into a single 64 bit number and placed into the key offset. The keyobjectid corresponds to the first byte in the extent, and the key type is setto BTRFS_EXTENT_REF_KEY.

Hash overflows on the offset field are handled by addingone to the calculated hash and searching forward. The searching stops when thecorrect back reference structure is found or

Snapshots and Subvolumes

Subvolumes are basically a named btree that holds files anddirectories. They have inodes inside the tree of tree roots and can havenon-root owners and groups. Subvolumes can be given a quota of blocks, and oncethis quota is reached no new writes are allowed. All of the blocks and fileextents inside of subvolumes are reference counted to allow snapshotting. Up to2⁶⁴ subvolumes may be created on the FS.

Snapshots are identical to subvolumes, but their root blockis initially shared with another subvolume. When the snapshot is taken, thereference count on the root block is increased, and the copy on writetransaction system ensures changes made in either the snapshot or the sourcesubvolume are private to that root. Snapshots are writable, and they can besnapshotted again any number of times. If read only snapshots are desired,their block quota is set to one at creation time.

Btree Roots

Each Btrfs filesystem consists of a number of tree roots. Afreshly formatted filesystem will have roots for:

The tree of tree roots
The tree of allocated extents
The default subvolume tree

The tree of tree roots records the root block for theextent tree and the root blocks and names for each subvolume and snapshot tree.As transactions commit, the root block pointers are updated in this tree toreference the new roots created by the transaction, and then the new root blockof this tree is recorded in the FS super block.

The tree of tree roots acts as a directory of all the othertrees on the filesystem, and it has directory items recording the names of allsnapshots and subvolumes in the FS. Each snapshot or subvolume has an objectidin the tree of tree roots, and at least one corresponding structbtrfs_root_item. Directory items in the tree map names of snapshots andsubvolumes to these root items. Because the root item key is updated with everytransaction commit, the directory items reference a generation number of (u64)-1,which tells the lookup code to find the most recent root available.

The extent trees are used to manage allocated space on thedevices. The space available can be divided between a number of extent trees toreduce lock contention and give different allocation policies to differentblock ranges.

The diagram below depicts a collection of tree roots. The super block points tothe root tree, and the root tree points to the extent trees and subvolumes. Theroot tree also has a directory to map subvolume names to structbtrfs_root_items in the root tree. This filesystem has one subvolume named'default' (created by mkfs), and one snapshot of 'default' named 'snap'(created by the admin some time later). In this example, 'default' has notchanged since the snapshot was created and so both point tree to the same rootblock on disk.

Copy on Write Logging

Data and metadata in Btrfs are protected with copy on writelogging (COW). Once the transaction that allocated the space on disk hascommitted, any new writes to that logical address in the file or btree will goto a newly allocated block, and block pointers in the btrees and super blockswill be updated to reflect the new location.

Some of the btrfs trees do not use reference counting fortheir allocated space. This includes the root tree, and the extent trees. Asblocks are replaced in these trees, the old block is freed in the extent tree.These blocks are not reused for other purposes until the transaction that freedthem commits.

All subvolume (and snapshot) trees are reference counted.When a COW operation is performed on a btree node, the reference count of allthe blocks it points to is increased by one. For leaves, the reference countsof any file extents in the leaf are increased by one. When the transactioncommits, a new root pointer is inserted in the root tree for each new subvolumeroot. The key used has the form:

Subvolume inode number

BTRFS_ROOT_ITEM_KEY

Transaction ID

The updated btree blocks are all flushed to disk, and thenthe super block is updated to point to the new root tree. Once the super blockhas been properly written to disk, the transaction is considered complete. Atthis time the root tree has two pointers for each subvolume changed during thetransaction. One item points to the new tree and one points to the tree thatexisted at the start of the last transaction.

Any time after the commit finishes, the older subvolumeroot items may be removed. The reference count on the subvolume root block islowered by one. If the reference count reaches zero, the block is freed and thereference count on any nodes the root points to is lowered by one. If a treenode or leaf can be freed, it is traversed to free the nodes or extents belowit in the tree in a depth first fashion.

The traversal and freeing of the tree may be done in piecesby inserting a progress record in the root tree. The progress record indicatesthe last key and level touched by the traversal so the current transaction cancommit and the traversal can resume in the next transaction. If the systemcrashes before the traversal completes, the progress record is used to safelydelete the root on the next mount.

Ohad Rodeh presented this reference counted snapshotalgorithm at the 2007 Linux Filesystem and Storage Workshop:

Slides: http://www.cs.huji.ac.il/~orodeh/papers/LinuxFS_Workshop.pdf

Paper: http://www.cs.tau.ac.il/~ohadrode/papers/btree_TOS.pdf

The Btrfs snapshotting implementation is based on the ideashe presented.

Btrfsck

The filesystem checking utility is a crucial tool, but itcan be a major bottleneck in getting systems back online after something hasgone wrong. Btrfs aims to be tolerant of invalid metadata, and will avoid usingmetadata it determines to be incorrect. The disk format allows Btrfs to dealwith most corruptions at run time, without crashing the system and withoutrequiring offline filesystem checking.

An offline btrfsck is being developed, in part to helpverify the filesystem during testing, and as an emergency tool to make sure thefilesystem is safe for mounting. The existing tool only verifies the extentallocation maps, making sure that reference counts are correct and that allextents are accounted for. If the extent maps are correct, there is no risk ofincorrectly writing over existing data or metadata as blocks are allocated fornew use.

btrfsck is able to read metadata in roughly disk order. Asit scans the btrees on disk, it collects the locations of nodes and leaves andpulls them from the disk in large sequential batches. For the most part,btrfsck is bound by the sequential read throughput of the storage, and it isable to take advantage of multi-spindle arrays. The price paid for the extraspeed is more ram. Btrfsck uses about 3x more ram than ext2fsck.