linux虚拟文件系统(VFS)笔记

来源：互联网发布：js关闭window.open 编辑：程序博客网时间：2024/09/21 09:21

先了解概念，再看源代码，可以事半功倍，所以找了很多资料，把重点记录下来。

super_block为超级块，每个文件系统都有一个超级块，它里面有类型为super_operations *的成员s_op，指向超级块的操作方法。

分配inode就是调用各个文件系统自己超级块的方法，fs/inode.c：
alloc_inode() -> sb->s_op->alloc_inode(sb);

s_op还提供了mknod等方法

inode表示文件系统的一个对象，具有唯一标识。inode结构体有两个重要的成员inode_operations *i_op 和file_operations *i_fop。i_op定义了直接操作inode的方法，i_fop定义了于文件和目录相关的方法，也就是标准的系统调用方法。他们的关系如下图：

图 5. inode 结构和相关联的操作

inode和目录缓存分别保存最近使用过的inode和dentry。

linux文件系统中，每个文件都被赋予一个唯一的数值，这个数值称作索引节点，这些索引节点被存储在一个索引节点表<inode table>中。这个表是在磁盘格式化时候就分配好了，每个实际的磁盘或者每个分区都有自己的inode table。一个索引节点就包含了一个文件的所有信息如数据在磁盘上的地址，大小，文件类型，修改，创建日期，数据块，目录块等等(但是不包含文件名)。文件名字是包含在目录块中，目录块包含了文件名和文件的索引结点编号，这样索引结点就可以和目录块对应起来了。所以一个结点可以有多个目录块，但是一个目录块只能有一个结点。

而超级块包含的是该磁盘或者分区的整体信息，如文件系统类型，大小等。下面是超级块的结构体定义：

1 //come from /usr/src/kernel/'uname –r'/include/linux/fs.h  2 struct super_block {  3         struct list_head s_list; /* Keep this first */  4         dev_t s_dev;                                 /* search index; _not_ kdev_t */  5         unsigned long s_blocksize;                     //数据块大小  6         unsigned char s_blocksize_bits;             //块大小占用的位数  7         unsigned char s_dirt;                         //脏位，如果该位置位，说明超级块被修改  8         unsigned long long s_maxbytes;                //单个文件最大体积  9         struct file_system_type *s_type;             //文件系统结构 10         struct super_operations *s_op;                 //超级块支持的操作11         struct dquot_operations *dq_op;             //12         struct quotactl_ops *s_qcop;                 //用户配额限制13         struct export_operations *s_export_op;        //14 15         unsigned longs_flags; 16         unsigned long s_magic;                         //区块的magic数17         struct dentry *s_root;                         //根目录18         struct rw_semaphore s_umount;                 //19         struct mutex s_lock;                         //20         int s_count;                                 //21         int s_syncing;                                 //22         int s_need_sync_fs;                         //23         atomic_t s_active;                             //是否活动24         void *s_security;                             //25         struct xattr_handler **s_xattr;                //26 27         struct list_head s_inodes;                    //所有inode 28         struct list_head s_dirty;                     //脏inode 29         struct list_head s_io;                        //用于写回的缓存 30         struct hlist_head s_anon;                     //nfs匿名入口 31         struct list_head s_files;                    //文件链表头32 33         struct block_device *s_bdev; 34         struct list_head s_instances; 35         struct quota_info s_dquot;                     //配额定制选项36 37         int s_frozen;                     38         wait_queue_head_t s_wait_unfrozen;39 40         char s_id[32];                                //名字41 42         void *s_fs_info; /* Filesystem private info */ 43         /** The next field is for VFS *only*. No filesystems have any business 44         * even looking at it. You had been warned. */ 45         struct mutex s_vfs_rename_mutex; /* Kludge */ 46         /* Granularity of c/m/atime in ns. Cannot be worse than a second */ 47         u32 s_time_gran; 48 };

一个磁盘或者分区的文件系统组成为超级块 + inode table + 数据块，第一块为超级块，inode大小和数据块大小固定，大小和操作系统类型有关。

下面列出创建一个文件userlist的执行步骤：

1. 存储属性。内核先找到一个空闲的inode,假设该inode编号为47，文件大小占用3个数据块。内核把文件信息记录在inode中。

2. 存储数据。内核从自由块列表中找到3个空闲块，把数据从内核缓冲拷贝到这3个空闲块中。

3. 记录分配情况。把3个数据块编号信息记录到inode的磁盘序号列表中，这3个编号放在磁盘序号列表的最开始3个位置。

4. 添加文件名到目录。新的文件名字是userlist,内核将文件的入口(47, userlist)添加到目录文件里。

5. 从第3点可以看到，如果是大文件磁盘序号表是放不了那么多的，实际它最多只能放13个项的分配链表。如果数据块超过13个，linux用间接块来解决。比如记录14个数据块信息，inode记录前面10个块编号，另外4个块编号信息记录在一个数据块中，inode的序列表第11项记录存放编号数据块的指针，通过指针就能找到剩下的4个块编号，这个用来存放编号的数据块就叫间接块。当间接块不够的时候还可以建立第二级，第三级间接块等。

接着列出根据绝对路径查找一个文件/tmp/temp/adb的过程：

1. 找到根文件系统的根目录的dentry和inode

2. 由inode提供的操作接口i_op -> lookup()找到下一层节点temp的dentry和inode

3. 由temp的inode找到下一层adb的dentry和inode

可以看出，整个查找过程就是一个递归过程。

从进程的角度去看inode, dentry,超级块的关系：

可以看到，进程每打开一个文件，就会创建一个file结构与之对应，同一个进程可以多次打开同一个文件从而得到多个file结构，这些file对应的都是同一个dentry。图中两个dentry对应的是同一个inode则是用链接(ln命令)实现的。

dentry和inode都只能描述一个物理的文件，无法描述“打开”这个概念，因此才要引入file结构。

每个进程都有一个类型为files_struct的结构体，它的成员fd_arrays数组专门存放file结构指针，用户空间通过打开文件时候返回的fd句柄就能从fd_arrays中找到相应的file结构。他们的关系图如下：

dentry与dentry_cache的关系：

dentry_cache简称dcache，中文名称是目录项高速缓存，是Linux为了提高目录项对象的处理效率而设计的。它主要由两个数据结构组成：
1、哈希链表dentry_hashtable：dcache中的所有dentry对象都通过d_hash指针域链到相应的dentry哈希链表中。
2、未使用的dentry对象链表dentry_unused：dcache中所有处于unused状态和negative状态的dentry对象都通过其d_lru指针域链入dentry_unused链表中。该链表也称为LRU链表。
目录项高速缓存dcache是索引节点缓存icache的主控器（master），也即 dcache中的dentry对象控制着icache中的inode对象的生命期转换。无论何时，只要一个目录项对象存在于dcache中（非 negative状态），则相应的inode就将总是存在，因为 inode的引用计数i_count总是大于0。当dcache中的一个dentry被释放时，针对相应inode对象的iput()方法就会被调用。

每个文件系统模块都有一个初始化例程，它的作用就是VFS中进行注册，即填写一个叫做file_system_type的数据结构。所有已注册的文件系统的file_system_type结构形成一个链表，我们把这个链表称为注册链表。他们有类似如下的关系图：

每个设备在mount时都要搜索该注册链表，选择适合自己设备文件系统的一项，并从中取出read_super()函数获取设备的超级块（存储在具体设备上，记录存储设备各种信息的一个存储块），并解析其内容。因为每种类型文件系统的超级块的格式不同，并且各自有特定的信息，每种文件系统必须使用对应的解析函数，否则内核就因为不认识该文件系统而无法完成安装。这就是注册文件系统的意义所在。

有了上述概念，下面就是源代码分析了，基于linux3.5的代码。

根目录的创建过程。

从fs/namespace.c的init_mount_tree()开始。

static void __init init_mount_tree(void)                                                                                                                       {    struct vfsmount *mnt;    struct mnt_namespace *ns;     struct path root;    mnt = do_kern_mount("rootfs", 0, "rootfs", NULL);    if (IS_ERR(mnt))        panic("Can't create rootfs");    ns = create_mnt_ns(mnt);    if (IS_ERR(ns))        panic("Can't allocate initial namespace");    init_task.nsproxy->mnt_ns = ns;    get_mnt_ns(ns);    root.mnt = ns->root;    root.dentry = ns->root->mnt_root;    set_fs_pwd(current->fs, &root);    set_fs_root(current->fs, &root);}

do_kern_mount函数：

struct vfsmount *do_kern_mount(const char *fstype, int flags, const char *name, void *data){    struct file_system_type *type = get_fs_type(fstype);    struct vfsmount *mnt;    if (!type)        return ERR_PTR(-ENODEV);    mnt = vfs_kern_mount(type, flags, name, data);    if (!IS_ERR(mnt) && (type->fs_flags & FS_HAS_SUBTYPE) &&        !mnt->mnt_sb->s_subtype)        mnt = fs_set_subtype(mnt, fstype);    put_filesystem(type);    return mnt;}

vfs_kern_mount：

vfs_kern_mount(struct file_system_type *type, int flags, const char *name, void *data){    struct vfsmount *mnt;    struct dentry *root;    if (!type)        return ERR_PTR(-ENODEV);    mnt = alloc_vfsmnt(name);    if (!mnt)        return ERR_PTR(-ENOMEM);    if (flags & MS_KERNMOUNT)        mnt->mnt_flags = MNT_INTERNAL;    root = mount_fs(type, flags, name, data);    if (IS_ERR(root)) {        free_vfsmnt(mnt);        return ERR_CAST(root);    }    mnt->mnt_root = root;    mnt->mnt_sb = root->d_sb;    mnt->mnt_mountpoint = mnt->mnt_root;    mnt->mnt_parent = mnt;    return mnt;}

先要注意的是mnt->mnt_root = root，说明root就是超级块的目录项，从用户空间角度理解，该目录项就是根目录。每个文件系统都有一个vfsmount结构，各种文件系统的vfsmount组成一个链表，其中的mnt_sb成员指向各自的超级块。

mount_fs()：

struct dentry *mount_fs(struct file_system_type *type, int flags, const char *name, void *data){    struct dentry *root;    struct super_block *sb;         .................     root = type->mount(type, flags, name, data);                                                                                                                    sb = root->d_sb;    .....................}

回调了注册文件系统时候的mount指向的函数：

struct dentry *mount_nodev(struct file_system_type *fs_type,    int flags, void *data,    int (*fill_super)(struct super_block *, void *, int)){    int error;    struct super_block *s = sget(fs_type, NULL, set_anon_super, NULL);//这里创建了超级块    ..........................    error = fill_super(s, data, flags & MS_SILENT ? 1 : 0);//创建目录和inode     ..........................     return dget(s->s_root);//增加引用计数}

int ramfs_fill_super(struct super_block *sb, void *data, int silent){    struct ramfs_fs_info *fsi;    struct inode *inode = NULL;    struct dentry *root;    int err;    ........................    sb->s_maxbytes      = MAX_LFS_FILESIZE;//文件的最大值    sb->s_blocksize     = PAGE_CACHE_SIZE;//以byte为单位的块大小    sb->s_blocksize_bits    = PAGE_CACHE_SHIFT;//以bit为单位的块大小    sb->s_magic     = RAMFS_MAGIC;//魔术数    sb->s_op        = &ramfs_ops;//超级块的操作方法，处理inode时候会用到    sb->s_time_gran     = 1;    inode = ramfs_get_inode(sb, NULL, S_IFDIR | fsi->mount_opts.mode, 0); //创建inode，也就是目录索引节点    if (!inode) {        err = -ENOMEM;        goto fail;    }       root = d_alloc_root(inode);//建立根目录对象dentry    sb->s_root = root;//将超级块的根目录指向刚建立的目录'/'}

其中创建inode是个重要函数，inode中会提供操作该文件的方法(注意目录也是文件哦)：

struct inode *ramfs_get_inode(struct super_block *sb,                const struct inode *dir, int mode, dev_t dev){                                                                                                                                                                  struct inode * inode = new_inode(sb);//在索引点高速缓存icache中分配空间创建inode    if (inode) {        inode->i_ino = get_next_ino();        inode_init_owner(inode, dir, mode);        inode->i_mapping->a_ops = &ramfs_aops;        inode->i_mapping->backing_dev_info = &ramfs_backing_dev_info;        mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);        mapping_set_unevictable(inode->i_mapping);        inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;        switch (mode & S_IFMT) {        default:            init_special_inode(inode, mode, dev);//特殊文件的操作方法，如字符，块设备等等            break;        case S_IFREG://普通文件            inode->i_op = &ramfs_file_inode_operations;            inode->i_fop = &ramfs_file_operations;            break;        case S_IFDIR://目录            inode->i_op = &ramfs_dir_inode_operations;            inode->i_fop = &simple_dir_operations;            /* directory inodes start off with i_nlink == 2 (for "." entry) */            inc_nlink(inode);            break;        case S_IFLNK://符号链接            inode->i_op = &page_symlink_inode_operations;            break;        }       }       return inode;}

从上面的特殊文件处理可以猜测，我们平时写的驱动，如字符设备驱动会自己定义file_operations结构，就是通过上面来重定向的，这在后续再分析。

有了inode,我们还是找不到它的，所以要建立目录项，这样才能通过目录项找到inode，所以接着是建立根目录对象了：

struct dentry * d_alloc_root(struct inode * root_inode){       struct dentry *res = NULL;            if (root_inode) {        static const struct qstr name = { .name = "/", .len = 1 };//可以看到目录的名字就是'/'            res = d_alloc(NULL, &name);        if (res) {            res->d_sb = root_inode->i_sb;            d_set_d_op(res, res->d_sb->s_d_op);            res->d_parent = res;//根目录中，父目录是指向自己的。我们查找目录，如果发现一个目录的父目录指向自己，那么它就是根目录了            d_instantiate(res, root_inode);//inode和dentry关联，以后通过dentry找到inode        }    }    return res;}

到此，根目录的对象dentry和inode都建立好了。

接着分析在根目录下创建文件的过程。

比如创建/home/lsc/hello.c，查找过程前面有说过了，查找到lsc目录项关联的inode,调用inode->i_op方法，在内存或者磁盘上创建文件。这里要说的是，cache的空间是有限的，挂载文件系统后，cache中只有根目录的dentry和inode,而其他文件的dentry和inode都是需要的时候才在cache中动态建立的，如果某个目录和inode在cache中，那么他的父目录和inode也一定在cache中。

首先要在根目录下创建一个名字为'lsc'的目录，创建目录的系统调用是：

SYSCALL_DEFINE2(mkdir, const char __user *, pathname, int, mode){    return sys_mkdirat(AT_FDCWD, pathname, mode);                                                                                                              }

它会调用mkdirat:

SYSCALL_DEFINE3(mkdirat, int, dfd, const char __user *, pathname, int, mode){    int error = 0;    char * tmp;    struct dentry *dentry;    struct nameidata nd;                                error = user_path_parent(dfd, pathname, &nd, &tmp); //这是递归从根目录查找到lsc目录的过程                                                                                                                                      ...................    dentry = lookup_create(&nd, 1);//如果是虚拟文件系统，则在dcache中分配dentry,如果有实际存储介质，从hash表中查找一个空dentry。    ..........................    error = vfs_mkdir(nd.path.dentry->d_inode, dentry, mode);//这里会调用父目录索引节点的inode方法    ......................}

int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)                                                                                              {    int error = may_create(dir, dentry);    if (error)        return error;    if (!dir->i_op->mkdir)        return -EPERM;    mode &= (S_IRWXUGO|S_ISVTX);    error = security_inode_mkdir(dir, dentry, mode);    if (error)        return error;    error = dir->i_op->mkdir(dir, dentry, mode);    if (!error)        fsnotify_mkdir(dir, dentry);    return error;}

正如前面所述，调用了i_op的mkdir方法来创建文件：

static int ramfs_mkdir(struct inode * dir, struct dentry * dentry, int mode){    int retval = ramfs_mknod(dir, dentry, mode | S_IFDIR, 0);  //模式为创建目录                                                                                                    if (!retval)        inc_nlink(dir);    return retval;}

ramfs_mknod(struct inode *dir, struct dentry *dentry, int mode, dev_t dev)                                                                                     {    struct inode * inode = ramfs_get_inode(dir->i_sb, dir, mode, dev);//这个前面分析过了    int error = -ENOSPC;    if (inode) {        d_instantiate(dentry, inode);//关联dentry和inode        dget(dentry);   /* Extra count - pin the dentry in core */        error = 0;        dir->i_mtime = dir->i_ctime = CURRENT_TIME;//创建的时间，可以验证前面所说，时间是记录在inode中的    }       return error;}

接着创建lsc目录下的hello.c文件，系统调用为：

SYSCALL_DEFINE3(mknod, const char __user *, filename, int, mode, unsigned, dev){    return sys_mknodat(AT_FDCWD, filename, mode, dev);}

SYSCALL_DEFINE4(mknodat, int, dfd, const char __user *, filename, int, mode,        unsigned, dev){    int error;    char *tmp;    struct dentry *dentry;    struct nameidata nd;    if (S_ISDIR(mode))//如果是目录则返回        return -EPERM;    error = user_path_parent(dfd, filename, &nd, &tmp);    if (error)        return error;    dentry = lookup_create(&nd, 0);    if (IS_ERR(dentry)) {        error = PTR_ERR(dentry);        goto out_unlock;    }    if (!IS_POSIXACL(nd.path.dentry->d_inode))        mode &= ~current_umask();    error = may_mknod(mode);    if (error)        goto out_dput;    error = mnt_want_write(nd.path.mnt);    if (error)                                                                                                                                                         goto out_dput;    error = security_path_mknod(&nd.path, dentry, mode, dev);    if (error)        goto out_drop_write;    switch (mode & S_IFMT) {        case 0: case S_IFREG://普通文件            error = vfs_create(nd.path.dentry->d_inode,dentry,mode,&nd);            break;        case S_IFCHR: case S_IFBLK://字符设备或者块设备            error = vfs_mknod(nd.path.dentry->d_inode,dentry,mode,                    new_decode_dev(dev));            break;        case S_IFIFO: case S_IFSOCK://fifo,socket文件            error = vfs_mknod(nd.path.dentry->d_inode,dentry,mode,0);            break;    }    ................}

我们要创建的是普通文件：

int vfs_create(struct inode *dir, struct dentry *dentry, int mode,                                                                                                     struct nameidata *nd){    int error = may_create(dir, dentry);    if (error)        return error;    if (!dir->i_op->create)        return -EACCES; /* shouldn't it be ENOSYS? */    mode &= S_IALLUGO;    mode |= S_IFREG;    error = security_inode_create(dir, dentry, mode);    if (error)        return error;    error = dir->i_op->create(dir, dentry, mode, nd);    if (!error)        fsnotify_create(dir, dentry);    return error;}

和创建目录差不多，也是调用i_op的方法。这样文件就创建完毕了！

创建好了文件，接着看打开一个文件的过程。

open的系统调用为：

SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, int, mode){    long ret;    if (force_o_largefile())        flags |= O_LARGEFILE;    ret = do_sys_open(AT_FDCWD, filename, flags, mode);    /* avoid REGPARM breakage on x86: */    asmlinkage_protect(3, ret, filename, flags, mode);    return ret;}

long do_sys_open(int dfd, const char __user *filename, int flags, int mode){    struct open_flags op;    int lookup = build_open_flags(flags, mode, &op);//这是判断处理打开模式标志    char *tmp = getname(filename);//从用户空间拷贝文件路径名    int fd = PTR_ERR(tmp);    if (!IS_ERR(tmp)) {        fd = get_unused_fd_flags(flags);//分配一个没有用过的fd        if (fd >= 0) {            struct file *f = do_filp_open(dfd, tmp, &op, lookup);            if (IS_ERR(f)) {                put_unused_fd(fd);                fd = PTR_ERR(f);            } else {                fsnotify_open(f);//内核事件通知，用户层可以监听文件的状态                fd_install(fd, f);//把创建的file结构保存到当前进程的fdtable->fd[fd]位置,后续通过fd即可查找到file结构                                                                                                                                         }        }        putname(tmp);    }    return fd;}

do_filp_open函数会调用path_openat():

static struct file *path_openat(int dfd, const char *pathname,        struct nameidata *nd, const struct open_flags *op, int flags){    struct file *base = NULL;    struct file *filp;    struct path path;    int error;    filp = get_empty_filp();//这里会创建分配file结构的空间    if (!filp)        return ERR_PTR(-ENFILE);    filp->f_flags = op->open_flag;    nd->intent.open.file = filp;    nd->intent.open.flags = open_to_namei_flags(op->open_flag);    nd->intent.open.create_mode = op->mode;    error = path_init(dfd, pathname, flags | LOOKUP_PARENT, nd, &base);    if (unlikely(error))        goto out_filp;    current->total_link_count = 0;    error = link_path_walk(pathname, nd);//递归查找要打开文件的目录项    if (unlikely(error))                                                                                                                                               goto out_filp;    filp = do_last(nd, &path, op, pathname);//最终会调用__dentry_open函数来处理打开方法    .........................}

static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,                    struct file *f,                    int (*open)(struct inode *, struct file *),                    const struct cred *cred){    static const struct file_operations empty_fops = {};    struct inode *inode;    int error;    f->f_mode = OPEN_FMODE(f->f_flags) | FMODE_LSEEK |                FMODE_PREAD | FMODE_PWRITE;    if (unlikely(f->f_flags & O_PATH))        f->f_mode = FMODE_PATH;                                                                                                                                                                   inode = dentry->d_inode;    if (f->f_mode & FMODE_WRITE) {        error = __get_file_write_access(inode, mnt);        if (error)            goto cleanup_file;        if (!special_file(inode->i_mode))            file_take_write(f);    }    f->f_mapping = inode->i_mapping;    f->f_path.dentry = dentry;    f->f_path.mnt = mnt;    f->f_pos = 0; //初始化读写位置为0    file_sb_list_add(f, inode->i_sb);    if (unlikely(f->f_mode & FMODE_PATH)) {        f->f_op = &empty_fops;        return f;    }    f->f_op = fops_get(inode->i_fop);//通过inode获取操作文件的方法    error = security_dentry_open(f, cred);    if (error)        goto cleanup_all;    if (!open && f->f_op)        open = f->f_op->open;    if (open) {        error = open(inode, f);//最终调用open方法        if (error)            goto cleanup_all;    }    if ((f->f_mode & (FMODE_READ | FMODE_WRITE)) == FMODE_READ)        i_readcount_inc(inode);    f->f_flags &= ~(O_CREAT | O_EXCL | O_NOCTTY | O_TRUNC);    file_ra_state_init(&f->f_ra, f->f_mapping->host->i_mapping);    /* NB: we're sure to have correct a_ops only after f_op->open */    if (f->f_flags & O_DIRECT) {        if (!f->f_mapping->a_ops ||            ((!f->f_mapping->a_ops->direct_IO) &&            (!f->f_mapping->a_ops->get_xip_mem))) {            fput(f);            f = ERR_PTR(-EINVAL);        }    }    return f;}

这里open方法，如果是普通文件，会调用前面分析的普通文件的ramfs_file_inode_operations结构的open,如果是字符设备，则调用了驱动中自己写的file_operations的open方法。

至于读/写方法，和上面的打开流程差不多，因为有了fd,就会找到file结构，从而找到inode->i_op的相应方法。

下面列出读写的系统调用方法：

SYSCALL_DEFINE3(read, unsigned int, fd, char __user *, buf, size_t, count){    struct file *file;    ssize_t ret = -EBADF;    int fput_needed;    file = fget_light(fd, &fput_needed);    if (file) {        loff_t pos = file_pos_read(file);        ret = vfs_read(file, buf, count, &pos);        file_pos_write(file, pos);        fput_light(file, fput_needed);    }       return ret;}SYSCALL_DEFINE3(write, unsigned int, fd, const char __user *, buf,        size_t, count){    struct file *file;    ssize_t ret = -EBADF;    int fput_needed;    file = fget_light(fd, &fput_needed);    if (file) {        loff_t pos = file_pos_read(file);        ret = vfs_write(file, buf, count, &pos);        file_pos_write(file, pos);        fput_light(file, fput_needed);    }       return ret;}