除了xlog,哪些操作可能还需要fsync ?

来源:互联网 发布:北京电视台网络直播 编辑:程序博客网 时间:2024/05/16 12:59

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

  • Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
  • Postgres-XL的项目发起人Mason Sharp
  • pgpool的作者石井达夫(Tatsuo Ishii)
  • PG-Strom的作者海外浩平(Kaigai Kohei)
  • Greenplum研发总监姚延栋
  • 周正中(德哥), PostgreSQL中国用户会创始人之一
  • 汪洋,平安科技数据库技术部经理
  • ……


 
  • 2015年度PG大象会报名地址:http://postgres2015.eventdove.com/
  • PostgreSQL中国社区: http://postgres.cn/
  • PostgreSQL专业1群: 3336901(已满)
  • PostgreSQL专业2群: 100910388
  • PostgreSQL专业3群: 150657323



  • 我们知道xlog的一个重要责任是用来保护用户提交的事务在数据库的持久化特性的。

    那么就涉及到用户提交事务后,必须先等待这笔事务对应的XLOG fsync完成。所以xlog会涉及不断的fsync(由wal writter间歇性发起,用户进程仅仅在申请不到XLOG BUFFER时会调用fsync)  (http://blog.163.com/digoal@126/blog/static/163877040201573564223/)。
    另一方面,XLOG还有一个设计初衷,就是将离散的IO归为连续的IO,因为XLOG文件是预分配的,连续写入的。
    如果没有XLOG,用户事务提交时,必须对操作对象fsync,可能涉及大量的离散IO,也不利于操作系统合并IO。
    那么问题来了,除了xlog需要fsync,还有没有其他操作需要fsync呢?
    答案是必须有的,只是这种fsync会越来越少,至少在对操作响应要求高的场景会尽力避免非XLOG的fsync需求。
    所以在一些对响应要求不是那么高的操作中还是有非xlog的fsync需求的。

    例如
    1. initdb
    src/bin/initdb/initdb.c
    /*
     * Issue fsync recursively on PGDATA and all its contents.
     *
     * We fsync regular files and directories wherever they are, but we
     * follow symlinks only for pg_xlog and immediately under pg_tblspc.
     * Other symlinks are presumed to point at files we're not responsible
     * for fsyncing, and might not have privileges to write at all.
     *
     * Errors are reported but not considered fatal.
     */
    static void
    fsync_pgdata(void)
    {
            bool            xlog_is_symlink;
            char            pg_xlog[MAXPGPATH];
            char            pg_tblspc[MAXPGPATH];

            fputs(_("syncing data to disk ... "), stdout);
            fflush(stdout);

            snprintf(pg_xlog, MAXPGPATH, "%s/pg_xlog", pg_data);
            snprintf(pg_tblspc, MAXPGPATH, "%s/pg_tblspc", pg_data);

            /*
             * If pg_xlog is a symlink, we'll need to recurse into it separately,
             * because the first walkdir below will ignore it.
             */
            xlog_is_symlink = false;

    #ifndef WIN32
            {
                    struct stat st;

                    if (lstat(pg_xlog, &st) < 0)
                            fprintf(stderr, _("%s: could not stat file \"%s\": %s\n"),
                                            progname, pg_xlog, strerror(errno));
                    else if (S_ISLNK(st.st_mode))
                            xlog_is_symlink = true;
            }
    #else
            if (pgwin32_is_junction(pg_xlog))
                    xlog_is_symlink = true;
    #endif

            /*
             * If possible, hint to the kernel that we're soon going to fsync the data
             * directory and its contents.
             */
    #ifdef PG_FLUSH_DATA_WORKS
            walkdir(pg_data, pre_sync_fname, false);
            if (xlog_is_symlink)
                    walkdir(pg_xlog, pre_sync_fname, false);
            walkdir(pg_tblspc, pre_sync_fname, true);
    #endif

            /*
             * Now we do the fsync()s in the same order.
             *
             * The main call ignores symlinks, so in addition to specially processing
             * pg_xlog if it's a symlink, pg_tblspc has to be visited separately with
             * process_symlinks = true.  Note that if there are any plain directories
             * in pg_tblspc, they'll get fsync'd twice.  That's not an expected case
             * so we don't worry about optimizing it.
             */
            walkdir(pg_data, fsync_fname_ext, false);
            if (xlog_is_symlink)
                    walkdir(pg_xlog, fsync_fname_ext, false);
            walkdir(pg_tblspc, fsync_fname_ext, true);

            check_ok();
    }

    2. create database 或 alter database move tablespace
    src/backend/commands/dbcommands.c
    copydir@src/backend/storage/file/copydir.c
    每一个文件都需要fsync,量比较大。

    3. rewrite table 或 create table as 或 copy from file or 刷新物化视图 when wal_level=minimal。
    调用heap_sync : 
    src/include/access/xlog.h:
    #define XLogIsNeeded() (wal_level >= WAL_LEVEL_ARCHIVE)

    ...
            if (!XLogIsNeeded())
                    myState->hi_options |= HEAP_INSERT_SKIP_WAL;
    ...
            /* If we skipped using WAL, must heap_sync before commit */
            if (myState->hi_options & HEAP_INSERT_SKIP_WAL)
                    heap_sync(myState->rel);

    4. 2pc事务文件
    发生在WAL replay时。
    RecreateTwoPhaseFile

    5. 时间线文件
    因为promote或者walreceiver接收到时间线文件,需要创建新的时间线文件时。

    6. replication slot文件
    创建slot时,需要在pg_replslot目录中创建对应的文件。

    7. pg_clog, pg_multixact
    /*
     * SlruCtlData is an unshared structure that points to the active information
     * in shared memory.
     */
    typedef struct SlruCtlData
    {
            SlruShared      shared;

            /*
             * This flag tells whether to fsync writes (true for pg_clog and multixact
             * stuff, false for pg_subtrans and pg_notify).
             */
            bool            do_fsync;

            /*
             * Decide which of two page numbers is "older" for truncation purposes. We
             * need to use comparison of TransactionIds here in order to do the right
             * thing with wraparound XID arithmetic.
             */
            bool            (*PagePrecedes) (int, int);

            /*
             * Dir is set during SimpleLruInit and does not change thereafter. Since
             * it's always the same, it doesn't need to be in shared memory.
             */
            char            Dir[64];
    } SlruCtlData;

    其他
    ......
    0 0
    原创粉丝点击