don't shutdown before pg_stop_backup()

来源:互联网 发布:银行数据管理制度 编辑:程序博客网 时间:2024/06/07 04:08

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

  • Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
  • Postgres-XL的项目发起人Mason Sharp
  • pgpool的作者石井达夫(Tatsuo Ishii)
  • PG-Strom的作者海外浩平(Kaigai Kohei)
  • Greenplum研发总监姚延栋
  • 周正中(德哥), PostgreSQL中国用户会创始人之一
  • 汪洋,平安科技数据库技术部经理
  • ……
 
  • 2015年度PG大象会报名地址:http://postgres2015.eventdove.com/
  • PostgreSQL中国社区: http://postgres.cn/
  • PostgreSQL专业1群: 3336901(已满)
  • PostgreSQL专业2群: 100910388
  • PostgreSQL专业3群: 150657323



  • 在创建备份或STANDBY时,我们通常可以有两种方式一种是通过pg_basebackup,另一种是使用pg_start_backup然后COPY文件的方式。
    在使用第二种方式时,正确的流程应该是:
    pg_start_backup
    COPY
    pg_stop_backup
    但是我们试想一下,如果在COPY完后,调用pg_stop_backup前,主库关闭了,然后又起来了。
    这个备份还有效吗?
    先来试一试:
    1. 执行pg_start_backup
    do_pg_start_backup@src/backend/access/transam/xlog.c被调用。

    postgres@digoal-> psql
    psql (9.4.4)
    Type "help" for help.
    postgres=# select pg_start_backup('a');

    2. 拷贝文件,并使用这个拷贝创建备库
    postgres@digoal-> rm -f postmaster.pid
    postgres@digoal-> pg_ctl start -D .
    server starting
    postgres@digoal-> LOG:  00000: redirecting log output to logging collector process
    HINT:  Future log output will appear in directory "/data03/pg_log_1922".
    LOCATION:  SysLogger_Start, syslogger.c:645


    postgres@digoal-> top -c -u postgres

    top - 15:36:48 up  7:19,  4 users,  load average: 0.00, 0.00, 0.00
    Tasks: 159 total,   1 running, 158 sleeping,   0 stopped,   0 zombie
    Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
    Mem:   1914436k total,   961196k used,   953240k free,   104048k buffers
    Swap:  1048572k total,        0k used,  1048572k free,   666808k cached
      PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                     
     4082 postgres  20   0 15040 1272  936 R 33.2  0.1   0:00.25 top -c -u postgres                                                                                      
     1908 postgres  20   0  105m 1948 1452 S  0.0  0.1   0:00.64 -bash                                                                                                          
     3850 postgres  20   0  105m 1960 1448 S  0.0  0.1   0:00.28 -bash                                                                                                          
     3987 postgres  20   0  488m  34m  32m S  0.0  1.8   0:00.21 /opt/pgsql9.4.4/bin/postgres                                                                    
     3989 postgres  20   0  186m 3112  708 S  0.0  0.2   0:00.00 postgres: logger process                                                                          
     3991 postgres  20   0  488m 5928 3492 S  0.0  0.3   0:00.00 postgres: checkpointer process                                                              
     3992 postgres  20   0  488m 5544 3112 S  0.0  0.3   0:00.05 postgres: writer process                                                                          
     3993 postgres  20   0  488m  19m  16m S  0.0  1.0   0:00.31 postgres: wal writer process                                                                    
     3994 postgres  20   0  188m 3380  800 S  0.0  0.2   0:00.00 postgres: stats collector process                                                            
     4069 postgres  20   0  488m  34m  32m S  0.0  1.8   0:00.11 /opt/pgsql9.4.4/bin/postgres -D .                                                            
     4071 postgres  20   0  186m 3116  708 S  0.0  0.2   0:00.00 postgres: logger process                                                                          
     4072 postgres  20   0  488m 4040 1572 S  0.0  0.2   0:00.01 postgres: startup process   recovering 000000060000002900000033                                                                                                                 
     4076 postgres  20   0  488m 3344  912 S  0.0  0.2   0:00.00 postgres: checkpointer process                                                              
     4077 postgres  20   0  488m 3432 1000 S  0.0  0.2   0:00.00 postgres: writer process                                                                            4078 postgres  20   0  495m 5380 2840 S  0.0  0.3   0:00.00 postgres: wal receiver process                                                              
     4080 postgres  20   0  489m 4828 2216 S  0.0  0.3   0:00.00 postgres: wal sender process replica 192.168.150.128(26168) streaming 29/33000100                                                                                               


    3. 关闭主库,注意这个时候还没有在主库执行 pg_stop_backup

    4. 备库CRASH掉了。并产生了一个CORE文件。
    日志如下:

    2015-09-10 15:36:56.065 CST,,,4072,,55f1330a.fe8,5,,2015-09-10 15:36:42 CST,1/0,0,PANIC,XX000,"online backup was canceled, recovery cannot continue",,,,,"xlog redo checkpoint: redo 29/33000100; tli 6; prev tli 6; fpw true; xid 0/639272964; oid 151900; multi 788; offset 1711; oldest xid 600048491 in DB 1; oldest multi 631 in DB 1; oldest running xid 0; shutdown",,,"xlog_redo, xlog.c:9310",""

    CORE文件

    [root@digoal pg_root_1922]# ll
    total 10540
    -rw------- 1 postgres postgres       193 Sep 10 15:36 backup_label.old
    drwx------ 6 postgres postgres      4096 Sep 10 15:36 base
    -rw------- 1 postgres postgres 318500864 Sep 10 15:36 core.4072
    drwx------ 2 postgres postgres      4096 Sep 10 15:36 global
    drwx------ 2 postgres postgres      4096 Sep 10 15:36 pg_clog

    分析这个CORE文件:
    # gdb -c ./core.4072 /opt/pgsql/bin/postgres
    截取最后一部分信息如下:

    Loaded symbols for /lib64/libnss_files.so.2
    Reading symbols from /lib/modules/2.6.32-504.el6.x86_64/vdso/vdso.so...Reading symbols from /usr/lib/debug/lib/modules/2.6.32-504.el6.x86_64/vdso/vdso.so.debug...done.
    done.
    Loaded symbols for /lib/modules/2.6.32-504.el6.x86_64/vdso/vdso.so
    Core was generated by `postgres: startup process   recovering 00000006000000'.
    Program terminated with signal 6, Aborted.
    #0  0x00000038b9032625 in raise () from /lib64/libc.so.6
    Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 cyrus-sasl-lib-2.1.23-15.el6.x86_64 glibc-2.12-1.149.el6_6.9.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-33.el6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libxml2-2.7.6-17.el6_6.1.x86_64 nspr-4.10.6-1.el6_5.x86_64 nss-3.16.1-14.el6.x86_64 nss-softokn-freebl-3.14.3-17.el6.x86_64 nss-util-3.16.1-3.el6.x86_64 openldap-2.4.39-8.el6.x86_64 openssl-1.0.1e-42.el6.x86_64 pam-1.1.1-20.el6.x86_64 zlib-1.2.3-29.el6.x86_64
    (gdb) bt
    #0  0x00000038b9032625 in raise () from /lib64/libc.so.6
    #1  0x00000038b9033e05 in abort () from /lib64/libc.so.6
    #2  0x0000000000b64a3a in errfinish (dummy=0) at elog.c:572
    #3  0x0000000000547058 in xlog_redo (lsn=176848634272, record=0x2311328) at xlog.c:9309
    #4  0x0000000000540909 in StartupXLOG () at xlog.c:6920
    #5  0x00000000008b7455 in StartupProcessMain () at startup.c:224
    #6  0x000000000055a61a in AuxiliaryProcessMain (argc=2, argv=0x7fffa1f9d130) at bootstrap.c:422
    #7  0x00000000008b57e3 in StartChildProcess (type=StartupProcess) at postmaster.c:5155
    #8  0x00000000008abc59 in PostmasterMain (argc=3, argv=0x22cb280) at postmaster.c:1237
    #9  0x00000000007a8f01 in main (argc=3, argv=0x22cb280) at main.c:228

    可以看到PG的错误日志是由xlog_redo 引起,相关的代码:
    xlog_redo是在数据恢复时被调用,XLOG_CHECKPOINT_SHUTDOWN是一笔数据库关闭的xlog.

            else if (info == XLOG_CHECKPOINT_SHUTDOWN)
            {
                    CheckPoint      checkPoint;

                    memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
                    /* In a SHUTDOWN checkpoint, believe the counters exactly */
                    LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
                    ShmemVariableCache->nextXid = checkPoint.nextXid;
                    LWLockRelease(XidGenLock);
                    LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
                    ShmemVariableCache->nextOid = checkPoint.nextOid;
                    ShmemVariableCache->oidCount = 0;
                    LWLockRelease(OidGenLock);
                    MultiXactSetNextMXact(checkPoint.nextMulti,
                                                              checkPoint.nextMultiOffset);
                    SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
                    SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
                    MultiXactSetSafeTruncate(checkPoint.oldestMulti);

                    /*
                     * If we see a shutdown checkpoint while waiting for an end-of-backup
                     * record, the backup was canceled and the end-of-backup record will
                     * never arrive.
                     */    此时ControlFile->backupStartPoint <>0 并且ControlFile->backupEndPoint = 0 说明备份未结束。
                    if (ArchiveRecoveryRequested &&
                            !XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
                            XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
                            ereport(PANIC,
                            (errmsg("online backup was canceled, recovery cannot continue")));

    我们可以看到,正常情况下是这样的:
    postgres@digoal-> pg_controldata |grep -i backup

    Backup start location:                0/0
    Backup end location:                  0/0
    End-of-backup record required:        no

    但是当前备库的情况是这样的:
    postgres@digoal-> pg_controldata /data01/pg_root_1922|grep -i back

    Minimum recovery ending location:     29/2D000028
    Min recovery ending loc's timeline:   6
    Backup start location:                29/2D000028
    Backup end location:                  0/0
    End-of-backup record required:        no


    如果我们调用了pg_stop_backend,那么会在XLOG写一笔XLOG_BACKUP_END的RECORD。
    do_pg_stop_backup@src/backend/access/transam/xlog.c

            stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END, &rdata);
    ......
            /*
             * Force a switch to a new xlog segment file, so that the backup is valid
             * as soon as archiver moves out the current segment file.
             */
            RequestXLogSwitch();


    然后,在恢复时,如果遇到这个RECORD,会把 ControlFile->backupStartPoint 改成0,也就是我们看到的正常的数据。

    /*
     * XLOG resource manager's routines
     *
     * Definitions of info values are in include/catalog/pg_control.h, though
     * not all record types are related to control file updates.
     */
    void
    xlog_redo(XLogRecPtr lsn, XLogRecord *record)
    {
    ......
            else if (info == XLOG_BACKUP_END)
            {
    ......
                    if (ControlFile->backupStartPoint == startpoint)
                    {
    ......
                            ControlFile->backupStartPoint = InvalidXLogRecPtr;


    那么ControlFile->backupStartPoint == startpoint又是什么情况下会满足呢?
    其实是在启动STARTUP进程时,判断有没有一个LABEL文件,有的话,就会将ControlFile->backupStartPoint改为checkPoint.redo。

    /*
     * This must be called ONCE during postmaster or standalone-backend startup
     */
    void
    StartupXLOG(void)
    {
    ......
            /* REDO */
            if (InRecovery)
            {
    .......
                    /*
                     * Set backupStartPoint if we're starting recovery from a base backup.
                     *
                     * Also set backupEndPoint and use minRecoveryPoint as the backup end
                     * location if we're starting recovery from a base backup which was
                     * taken from a standby. In this case, the database system status in
                     * pg_control must indicate that the database was already in
                     * recovery. Usually that will be DB_IN_ARCHIVE_RECOVERY but also can
                     * be DB_SHUTDOWNED_IN_RECOVERY if recovery previously was interrupted
                     * before reaching this point; e.g. because restore_command or
                     * primary_conninfo were faulty.
                     *
                     * Any other state indicates that the backup somehow became corrupted
                     * and we can't sensibly continue with recovery.
                     */
                    if (haveBackupLabel)
                    {
                            ControlFile->backupStartPoint = checkPoint.redo;
                            ControlFile->backupEndRequired = backupEndRequired;


    [参考]
    1. src/backend/access/transam/xlog.c
    0 0
    原创粉丝点击