不要滥用UNLOGGED table 和 hash index

来源：互联网发布：妙趣横生的算法 c 编辑：程序博客网时间：2024/04/29 14:03

注意PostgreSQL的unlogged table是不记录xlog的，所以在备库上没有unlogged table的数据记录。

同时，数据库在进入恢复状态时，为了保证数据的一致性，postgresql会自动清除unlogged table的数据文件。

那么问题来了，万一你不小心误创建了unlogged table，你可能一开始会没有感知，但是一旦发生以下情况，你会发现数据不见了。

1. 数据库crash掉之后，重启。

2. 主备切换，备库变成主库，主库变成备库。

以上两种情况，都是数据库在启动并进入recovery状态后，原来主库下的unlogged table的datafile都会被清空。

代码见：

src/backend/access/transam/xlog.c

* This must be called ONCE during postmaster or standalone-backend startup

void

StartupXLOG(void)

{

......

/* REDO */

if (InRecovery)

{

......

* We're in recovery, so unlogged relations may be trashed and must be

* reset. This should be done BEFORE allowing Hot Standby

* connections, so that read-only backends don't try to read whatever

* garbage is left over from before.

ResetUnloggedRelations(UNLOGGED_RELATION_CLEANUP);

......

* Reset unlogged relations to the contents of their INIT fork. This is

* done AFTER recovery is complete so as to include any unlogged relations

* created during recovery, but BEFORE recovery is marked as having

* completed successfully. Otherwise we'd not retry if any of the post

* end-of-recovery steps fail.

if (InRecovery)

ResetUnloggedRelations(UNLOGGED_RELATION_INIT);

......

backend/storage/file/reinit.c


/*
 * Reset unlogged relations from before the last restart.
 *
 * If op includes UNLOGGED_RELATION_CLEANUP, we remove all forks of any
 * relation with an "init" fork, except for the "init" fork itself.
 *
 * If op includes UNLOGGED_RELATION_INIT, we copy the "init" fork to the main
 * fork.
 */
void
ResetUnloggedRelations(int op)
{
        char            temp_path[MAXPGPATH];
        DIR                *spc_dir;
        struct dirent *spc_de;
        MemoryContext tmpctx,
                                oldctx;
        /* Log it. */
        elog(DEBUG1, "resetting unlogged relations: cleanup %d init %d",
                 (op & UNLOGGED_RELATION_CLEANUP) != 0,
                 (op & UNLOGGED_RELATION_INIT) != 0);
        /*
         * Just to be sure we don't leak any memory, let's create a temporary
         * memory context for this operation.
         */
        tmpctx = AllocSetContextCreate(CurrentMemoryContext,
                                                                   "ResetUnloggedRelations",
                                                                   ALLOCSET_DEFAULT_MINSIZE,
                                                                   ALLOCSET_DEFAULT_INITSIZE,
                                                                   ALLOCSET_DEFAULT_MAXSIZE);
        oldctx = MemoryContextSwitchTo(tmpctx);
        /*
         * First process unlogged files in pg_default ($PGDATA/base)
         */
        ResetUnloggedRelationsInTablespaceDir("base", op);
        /*
         * Cycle through directories for all non-default tablespaces.
         */
        spc_dir = AllocateDir("pg_tblspc");
        while ((spc_de = ReadDir(spc_dir, "pg_tblspc")) != NULL)
        {
                if (strcmp(spc_de->d_name, ".") == 0 ||
                        strcmp(spc_de->d_name, "..") == 0)
                        continue;
                snprintf(temp_path, sizeof(temp_path), "pg_tblspc/%s/%s",
                                 spc_de->d_name, TABLESPACE_VERSION_DIRECTORY);
                ResetUnloggedRelationsInTablespaceDir(temp_path, op);
        }
        FreeDir(spc_dir);
        /*
         * Restore memory context.
         */
        MemoryContextSwitchTo(oldctx);
        MemoryContextDelete(tmpctx);
}

src/include/common/relpath.h

* Stuff for fork names.

* The physical storage of a relation consists of one or more forks.

* The main fork is always created, but in addition to that there can be

* additional forks for storing various metadata. ForkNumber is used when

* we need to refer to a specific fork in a relation.

typedef enum ForkNumber

{

InvalidForkNumber = -1,

MAIN_FORKNUM = 0,

FSM_FORKNUM,

VISIBILITYMAP_FORKNUM,

INIT_FORKNUM

* NOTE: if you add a new fork, change MAX_FORKNUM and possibly

* FORKNAMECHARS below, and update the forkNames array in

* src/common/relpath.c

} ForkNumber;

那么问题来了，如果真的这样了，有办法恢复吗？

1. 首先，如果你在原来的主库上有基础备份，你可以从基础备份恢复。为什么需要主库的备份呢，因为备库上没有unlogged table的数据文件内容，所以在备库备份是备不到unlogged table的datafile的。

但是这种恢复方法也务必要小心，你需要在启动数据库前，先把_init的文件都删掉，这样启动数据库时数据文件就不会被清除。

2. 从主库的逻辑备份中恢复。

3. 从审计日志中回放SQL恢复。

4. 如果你没有主库的基础备份，那么可有从文件系统中去恢复删掉的数据文件。例如ext4文件系统的恢复方式如下：

http://blog.163.com/digoal@126/blog/static/16387704020142124032866/

例子：

创建一个unlogged table，并记录它的filenode


postgres=# create unlogged table utbl1(id int);
CREATE TABLE
postgres=# insert into utbl1 select generate_series(1,1000);
INSERT 0 1000
postgres=# select pg_relation_filepath('utbl1'::regclass);
 pg_relation_filepath 
----------------------
 base/151898/33822
(1 row)

查看到PG对unlogged table做了_init的后缀标记

http://www.postgresql.org/docs/9.4/static/storage-init.html

postgres@digoal-> cd $PGDATA

postgres@digoal-> ll base/151898/33822*

-rw------- 1 postgres postgres 40K Sep 26 11:39 base/151898/33822

-rw------- 1 postgres postgres 24K Sep 26 11:39 base/151898/33822_fsm

-rw------- 1 postgres postgres 0 Sep 26 11:38 base/151898/33822_init

现在，我们进入恢复模式

postgres@digoal-> mv recovery.done recovery.conf

postgres@digoal-> pg_ctl start

数据库启动后，其实已经将unlogged table的数据文件清理掉了，见前面的源码。


postgres=# select count(*) from utbl1 ;
ERROR:  cannot access temporary or unlogged relations during recovery

现在再起来，unlogged table就没有数据了。


postgres@digoal-> mv recovery.conf recovery.done
postgres@digoal-> pg_ctl start
postgres=# select count(*) from utbl1 ;
 count 
-------
     0
(1 row)

恢复过程，先停库，然后umount数据文件所在的文件系统。

pg_ctl stop -m fast

root@digoal-> umount /data01

假设我已经安装了extundelete

http://sourceforge.net/projects/extundelete/files/extundelete/

http://blog.163.com/digoal@126/blog/static/16387704020142124032866/

root@digoal-> cd /opt/extundelete/

root@digoal-> ll

total 4.0K

drwxr-xr-x 2 root root 4.0K Sep 26 13:43 bin

root@digoal-> cd bin/

root@digoal-> ll

total 1.2M

-rwxr-xr-x 1 root root 1.2M Sep 26 13:43 extundelete

查看数据文件所在的文件系统的块设备的inode信息。

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01

./extundelete: unrecognized option '--ls'

NOTICE: Extended attributes are not restored.

Loading filesystem metadata ... 80 groups loaded.

Group: 0

Contents of inode 2:

0000 | ed 41 00 00 00 10 00 00 6d 08 fa 55 bc 08 fa 55 | .A......m..U...U

0010 | bc 08 fa 55 00 00 00 00 00 00 05 00 08 00 00 00 | ...U............

0020 | 00 00 00 00 02 00 00 00 a1 22 00 00 00 00 00 00 | ........."......

0030 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

0040 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

0050 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

0060 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

0070 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

0080 | 1c 00 00 00 88 90 cc 84 88 90 cc 84 00 00 00 00 | ................

0090 | 6d 08 fa 55 00 00 00 00 00 00 00 00 00 00 00 00 | m..U............

00a0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

00b0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

00c0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

00d0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

00e0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

00f0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | ................

Inode is Allocated

File mode: 16877

Low 16 bits of Owner Uid: 0

Size in bytes: 4096

Access time: 1442449517

Creation time: 1442449596

Modification time: 1442449596

Deletion Time: 0

Low 16 bits of Group Id: 0

Links count: 5

Blocks count: 8

File flags: 0

File version (for NFS): 0

File ACL: 0

Directory ACL: 0

Fragment address: 0

Direct blocks: 8865, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

Indirect block: 0

Double indirect block: 0

Triple indirect block: 0

File name | Inode number | Deleted status

. 2

.. 2

lost+found 11

pg_root_1921 131073

pg_root_1922 393217

我们的数据文件在pg_root_1921目录下，对应的inode=131073，根据这个inode继续查询下一级目录的inode

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131073

找到了base的inode

base 131077

继续找，找到了unlogged table所在的database的inode

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131077

151898

131078

继续找，找对应的_init文件，找到了它的前缀，根据前缀过滤

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131078|grep _init

33822_init 131152

找到了被删除的文件

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --inode 131078|grep 33822

./extundelete: unrecognized option '--ls'

33822 131116 Deleted

33822_fsm 131147 Deleted

33822_init 131152

使用inode恢复，恢复删除的datafile

root@digoal-> ./extundelete --ls --inode 2 /dev/mapper/vgdata01-lv01 --restore-inode 131116

恢复后放在这里

root@digoal-> ll RECOVERED_FILES/

total 40K

-rw-r--r-- 1 root root 40K Sep 26 13:50 file.131116

将文件拷贝到原来的位置

root@digoal-> mount /data01

root@digoal-> cp file.131116 /data01/pg_root_1921/base/151898/33822

root@digoal-> chown -R postgres:postgres /data01/

root@digoal-> su - postgres

启动数据库前，切记，删除_init后缀的文件，否则数据文件可能又会被清理掉。

cd $PGDATA

postgres@digoal-> rm -f base/151898/33822_init

启动数据库，数据已经回来了。


postgres@digoal-> pg_ctl start
postgres=# select count(*) from utbl1 ;
 count 
-------
  1000
(1 row)

大家千万要切记，不要滥用unlogged table，除非你可以为你的行为负责。

hash index也是这样，因为不写XLOG，所以如果你使用流复制来实施HA的话，在切换到备库后，如果走hash index scan，你会发现数据凭空"消失"，实际上是因为hash index的变更没有复制过去的原因。所以就不要使用hash index了吧。

0 0