don't shutdown before pg_stop_backup()

来源：互联网发布：银行数据管理制度编辑：程序博客网时间：2024/06/07 04:08

Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大，国内顶级PostgreSQL数据库专家将悉数到场，并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:

Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
Postgres-XL的项目发起人Mason Sharp
pgpool的作者石井达夫(Tatsuo Ishii)
PG-Strom的作者海外浩平(Kaigai Kohei)
Greenplum研发总监姚延栋
周正中(德哥), PostgreSQL中国用户会创始人之一
汪洋，平安科技数据库技术部经理
……

2015年度PG大象会报名地址：http://postgres2015.eventdove.com/
PostgreSQL中国社区： http://postgres.cn/
PostgreSQL专业1群： 3336901（已满）
PostgreSQL专业2群： 100910388
PostgreSQL专业3群： 150657323

在创建备份或STANDBY时，我们通常可以有两种方式一种是通过pg_basebackup，另一种是使用pg_start_backup然后COPY文件的方式。

在使用第二种方式时，正确的流程应该是：

pg_start_backup

COPY

pg_stop_backup

但是我们试想一下，如果在COPY完后，调用pg_stop_backup前，主库关闭了，然后又起来了。

这个备份还有效吗？

先来试一试：

1. 执行pg_start_backup

do_pg_start_backup@src/backend/access/transam/xlog.c被调用。


postgres@digoal-> psql
psql (9.4.4)
Type "help" for help.
postgres=# select pg_start_backup('a');

2. 拷贝文件，并使用这个拷贝创建备库

postgres@digoal-> rm -f postmaster.pid

postgres@digoal-> pg_ctl start -D .

server starting

postgres@digoal-> LOG: 00000: redirecting log output to logging collector process

HINT: Future log output will appear in directory "/data03/pg_log_1922".

LOCATION: SysLogger_Start, syslogger.c:645

postgres@digoal-> top -c -u postgres

top - 15:36:48 up 7:19, 4 users, load average: 0.00, 0.00, 0.00

Tasks: 159 total, 1 running, 158 sleeping, 0 stopped, 0 zombie

Cpu(s): 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st

Mem: 1914436k total, 961196k used, 953240k free, 104048k buffers

Swap: 1048572k total, 0k used, 1048572k free, 666808k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

4082 postgres 20 0 15040 1272 936 R 33.2 0.1 0:00.25 top -c -u postgres

1908 postgres 20 0 105m 1948 1452 S 0.0 0.1 0:00.64 -bash

3850 postgres 20 0 105m 1960 1448 S 0.0 0.1 0:00.28 -bash

3987 postgres 20 0 488m 34m 32m S 0.0 1.8 0:00.21 /opt/pgsql9.4.4/bin/postgres

3989 postgres 20 0 186m 3112 708 S 0.0 0.2 0:00.00 postgres: logger process

3991 postgres 20 0 488m 5928 3492 S 0.0 0.3 0:00.00 postgres: checkpointer process

3992 postgres 20 0 488m 5544 3112 S 0.0 0.3 0:00.05 postgres: writer process

3993 postgres 20 0 488m 19m 16m S 0.0 1.0 0:00.31 postgres: wal writer process

3994 postgres 20 0 188m 3380 800 S 0.0 0.2 0:00.00 postgres: stats collector process

4069 postgres 20 0 488m 34m 32m S 0.0 1.8 0:00.11 /opt/pgsql9.4.4/bin/postgres -D .

4071 postgres 20 0 186m 3116 708 S 0.0 0.2 0:00.00 postgres: logger process

4072 postgres 20 0 488m 4040 1572 S 0.0 0.2 0:00.01 postgres: startup process recovering 000000060000002900000033

4076 postgres 20 0 488m 3344 912 S 0.0 0.2 0:00.00 postgres: checkpointer process

4077 postgres 20 0 488m 3432 1000 S 0.0 0.2 0:00.00 postgres: writer process 4078 postgres 20 0 495m 5380 2840 S 0.0 0.3 0:00.00 postgres: wal receiver process

4080 postgres 20 0 489m 4828 2216 S 0.0 0.3 0:00.00 postgres: wal sender process replica 192.168.150.128(26168) streaming 29/33000100

3. 关闭主库，注意这个时候还没有在主库执行 pg_stop_backup

4. 备库CRASH掉了。并产生了一个CORE文件。

日志如下：

2015-09-10 15:36:56.065 CST,,,4072,,55f1330a.fe8,5,,2015-09-10 15:36:42 CST,1/0,0,PANIC,XX000,"online backup was canceled, recovery cannot continue",,,,,"xlog redo checkpoint: redo 29/33000100; tli 6; prev tli 6; fpw true; xid 0/639272964; oid 151900; multi 788; offset 1711; oldest xid 600048491 in DB 1; oldest multi 631 in DB 1; oldest running xid 0; shutdown",,,"xlog_redo, xlog.c:9310",""

CORE文件

[root@digoal pg_root_1922]# ll

total 10540

-rw------- 1 postgres postgres 193 Sep 10 15:36 backup_label.old

drwx------ 6 postgres postgres 4096 Sep 10 15:36 base

-rw------- 1 postgres postgres 318500864 Sep 10 15:36 core.4072

drwx------ 2 postgres postgres 4096 Sep 10 15:36 global

drwx------ 2 postgres postgres 4096 Sep 10 15:36 pg_clog

分析这个CORE文件：

# gdb -c ./core.4072 /opt/pgsql/bin/postgres

截取最后一部分信息如下：

Loaded symbols for /lib64/libnss_files.so.2

Reading symbols from /lib/modules/2.6.32-504.el6.x86_64/vdso/vdso.so...Reading symbols from /usr/lib/debug/lib/modules/2.6.32-504.el6.x86_64/vdso/vdso.so.debug...done.

done.

Loaded symbols for /lib/modules/2.6.32-504.el6.x86_64/vdso/vdso.so

Core was generated by `postgres: startup process recovering 00000006000000'.

Program terminated with signal 6, Aborted.

#0 0x00000038b9032625 in raise () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install audit-libs-2.3.7-5.el6.x86_64 cyrus-sasl-lib-2.1.23-15.el6.x86_64 glibc-2.12-1.149.el6_6.9.x86_64 keyutils-libs-1.4-5.el6.x86_64 krb5-libs-1.10.3-33.el6.x86_64 libcom_err-1.41.12-21.el6.x86_64 libselinux-2.0.94-5.8.el6.x86_64 libxml2-2.7.6-17.el6_6.1.x86_64 nspr-4.10.6-1.el6_5.x86_64 nss-3.16.1-14.el6.x86_64 nss-softokn-freebl-3.14.3-17.el6.x86_64 nss-util-3.16.1-3.el6.x86_64 openldap-2.4.39-8.el6.x86_64 openssl-1.0.1e-42.el6.x86_64 pam-1.1.1-20.el6.x86_64 zlib-1.2.3-29.el6.x86_64

(gdb) bt

#0 0x00000038b9032625 in raise () from /lib64/libc.so.6

#1 0x00000038b9033e05 in abort () from /lib64/libc.so.6

#2 0x0000000000b64a3a in errfinish (dummy=0) at elog.c:572

#3 0x0000000000547058 in xlog_redo (lsn=176848634272, record=0x2311328) at xlog.c:9309

#4 0x0000000000540909 in StartupXLOG () at xlog.c:6920

#5 0x00000000008b7455 in StartupProcessMain () at startup.c:224

#6 0x000000000055a61a in AuxiliaryProcessMain (argc=2, argv=0x7fffa1f9d130) at bootstrap.c:422

#7 0x00000000008b57e3 in StartChildProcess (type=StartupProcess) at postmaster.c:5155

#8 0x00000000008abc59 in PostmasterMain (argc=3, argv=0x22cb280) at postmaster.c:1237

#9 0x00000000007a8f01 in main (argc=3, argv=0x22cb280) at main.c:228

可以看到PG的错误日志是由xlog_redo 引起，相关的代码：

xlog_redo是在数据恢复时被调用，XLOG_CHECKPOINT_SHUTDOWN是一笔数据库关闭的xlog.


        else if (info == XLOG_CHECKPOINT_SHUTDOWN)
        {
                CheckPoint      checkPoint;

                memcpy(&checkPoint, XLogRecGetData(record), sizeof(CheckPoint));
                /* In a SHUTDOWN checkpoint, believe the counters exactly */
                LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
                ShmemVariableCache->nextXid = checkPoint.nextXid;
                LWLockRelease(XidGenLock);
                LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
                ShmemVariableCache->nextOid = checkPoint.nextOid;
                ShmemVariableCache->oidCount = 0;
                LWLockRelease(OidGenLock);
                MultiXactSetNextMXact(checkPoint.nextMulti,
                                                          checkPoint.nextMultiOffset);
                SetTransactionIdLimit(checkPoint.oldestXid, checkPoint.oldestXidDB);
                SetMultiXactIdLimit(checkPoint.oldestMulti, checkPoint.oldestMultiDB);
                MultiXactSetSafeTruncate(checkPoint.oldestMulti);

                /*
                 * If we see a shutdown checkpoint while waiting for an end-of-backup
                 * record, the backup was canceled and the end-of-backup record will
                 * never arrive.
                 */    此时ControlFile->backupStartPoint <>0 并且ControlFile->backupEndPoint = 0 说明备份未结束。
                if (ArchiveRecoveryRequested &&
                        !XLogRecPtrIsInvalid(ControlFile->backupStartPoint) &&
                        XLogRecPtrIsInvalid(ControlFile->backupEndPoint))
                        ereport(PANIC,
                        (errmsg("online backup was canceled, recovery cannot continue")));

我们可以看到，正常情况下是这样的：

postgres@digoal-> pg_controldata |grep -i backup

Backup start location: 0/0

Backup end location: 0/0

End-of-backup record required: no

但是当前备库的情况是这样的：

postgres@digoal-> pg_controldata /data01/pg_root_1922|grep -i back

Minimum recovery ending location: 29/2D000028

Min recovery ending loc's timeline: 6

Backup start location: 29/2D000028

Backup end location: 0/0

End-of-backup record required: no

如果我们调用了pg_stop_backend，那么会在XLOG写一笔XLOG_BACKUP_END的RECORD。

do_pg_stop_backup@src/backend/access/transam/xlog.c

stoppoint = XLogInsert(RM_XLOG_ID, XLOG_BACKUP_END, &rdata);

......

* Force a switch to a new xlog segment file, so that the backup is valid

* as soon as archiver moves out the current segment file.

RequestXLogSwitch();

然后，在恢复时，如果遇到这个RECORD，会把 ControlFile->backupStartPoint 改成0，也就是我们看到的正常的数据。


/*
 * XLOG resource manager's routines
 *
 * Definitions of info values are in include/catalog/pg_control.h, though
 * not all record types are related to control file updates.
 */
void
xlog_redo(XLogRecPtr lsn, XLogRecord *record)
{
......
        else if (info == XLOG_BACKUP_END)
        {
......
                if (ControlFile->backupStartPoint == startpoint)
                {
......
                        ControlFile->backupStartPoint = InvalidXLogRecPtr;

那么ControlFile->backupStartPoint == startpoint又是什么情况下会满足呢？

其实是在启动STARTUP进程时，判断有没有一个LABEL文件，有的话，就会将ControlFile->backupStartPoint改为checkPoint.redo。

* This must be called ONCE during postmaster or standalone-backend startup

void

StartupXLOG(void)

{

......

/* REDO */

if (InRecovery)

{

.......

* Set backupStartPoint if we're starting recovery from a base backup.

* Also set backupEndPoint and use minRecoveryPoint as the backup end

* location if we're starting recovery from a base backup which was

* taken from a standby. In this case, the database system status in

* pg_control must indicate that the database was already in

* recovery. Usually that will be DB_IN_ARCHIVE_RECOVERY but also can

* be DB_SHUTDOWNED_IN_RECOVERY if recovery previously was interrupted

* before reaching this point; e.g. because restore_command or

* primary_conninfo were faulty.

* Any other state indicates that the backup somehow became corrupted

* and we can't sensibly continue with recovery.

if (haveBackupLabel)

{

ControlFile->backupStartPoint = checkPoint.redo;

ControlFile->backupEndRequired = backupEndRequired;

[参考]

1. src/backend/access/transam/xlog.c

0 0