Redis 2.8版部分同步功能源码浅析-Replication Partial Resynchronization

来源：互联网发布：数控编程要学哪些编辑：程序博客网时间：2024/06/06 01:04

前面的2篇文章分别介绍了Redis主从同步源码浅析-Master端以及 Redis主从同步源码浅析-Slave端相关的代码实现，从中我们可以看出redis主从同步的一个最大的缺点，也是阻碍大数据应用的地方便是其每次连接端开都需要重连master进行全量数据的重新同步，这个代价是可想而知的。

长连接断开在线上环境中出现得很频繁，如果需要重新同步所有RDB文件，几十G的文件，从建立RDB快照，发送文件内容到slave，然后slave执行命令一一加载进内存中，这个时间开销估计也得好几个小时，更别说树形结构的master->slave->slave，对网卡的压力，对服务器的压力都是很恐怖的。从这方面来说，动辄几个小时甚至一天的修复时间，没人敢用Redis主从同步在生产环境中使用。

但是福音来了：即将（2013年第三季度）发布的2.8版本会解决这个问题，通过：Replication partial resynchronization 的方式，也就是部分重新同步，这里就说部分同步吧，注意不是常规情况下的新写入指令同步。

具体的增量同步功能请看作者在刚开始的想法（Designing Redis replication partial resync）和中间的（Partial resyncs and synchronous replication.）以及最后的想法（PSYNC），从这里可以知道redis的部分同步功能很详细的解说。所以就不多说了，只是下面简单总结一下方便后面分析代码。

注意本文列出的代码是目前的最新代码，不是2.8版本的代码·https://github.com/antirez/redis

零、Partial Resynchronization介绍

为了避免每次重连都需要重新全量同步RDB文件，redis采用类似mysql的backlog的方式，允许slave在一定的时间内进行部分同步，只同步自己需要的部分回去，已经有的不需要同步了。注意如果重启了，那还是得重新同步，这个其实也有点悲剧，不知道后续会不会加入这个功能，实现也不太难的。

简单来讲，用口语就是：

对于slave ：master兄，我刚刚走神了断了连接，得重新找你同步一下。如果你还是昔日的那个replrunid，我刚才同步到的位置是这里reploff，如果还有机会请把我落下的数据马上发给我一下；否则请给我全部RDB文件；
对于master： slave们，如果你断了连接，请最好给我你刚才记着的runid和你算的同步到的位置发送给我，我看看是不是可以只让你同步你落下的部分；否则你得全量同步RDB文件。

根据这个设计，可想而知，master必须记住一定数目的backlog，也就是记住一段时间内的发送给slave们的命令列表，以及其起始，结束为止。slave必须在连接端开的时候记着自己同步到了什么位置，重连的时候用这位置去问master，自己是否还有机会赶上来。

一、SLAVE发起部分同步请求

大部分跟之前的2.6版本同步差不多：

标记server.repl_state为 REDIS_REPL_CONNECT状态；
replicationCron定时任务检测到调用connectWithMaster函数连接master；
slave连接成功调用syncWithMaster，发送PING指令；
slave发送SYNC指令通知master做RDB快照；
接收master的RDB快照文件；
加载新数据；

在2.8版本部分同步的时候，将上面的第4步修改了，加入了发送PSYNC指令尝试部分同步的功能。调用slaveTryPartialResynchronization函数尝试部分同步，如果发现master不认识这个指令，那就没办法了，再次发送SYNC进行全量同步。

1void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
2//·····
3    /* Try a partial resynchonization. If we don't have a cached master
4     * slaveTryPartialResynchronization() will at least try to use PSYNC
5     * to start a full resynchronization so that we get the master run id
6     * and the global offset, to try a partial resync at the next
7     * reconnection attempt. */
8    psync_result = slaveTryPartialResynchronization(fd);
9    if (psync_result == PSYNC_CONTINUE) {
10        redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.");
11        return;
12    }
13 
14    /* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
15     * and the server.repl_master_runid and repl_master_initial_offset are
16     * already populated. */
17    if (psync_result == PSYNC_NOT_SUPPORTED) {
18        redisLog(REDIS_NOTICE,"Retrying with SYNC...");
19        if (syncWrite(fd,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
20            redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",
21                strerror(errno));
22            goto error;
23        }
24    }

slaveTryPartialResynchronization是像master发送PSYNC指令的地方。PSYNC指令的语法为：PSYNC runid psync_offset 。下面解释一下2个参数的含义。

runid就是master告诉slave的一串字符串，用来记录master的实例，避免master重启后，同步错误的情况，这个值是master在slave第一次同步的时候告诉他的，且一直不变直到master重启；

psync_offset这个参数就是slave当前同步到的数据位置，实际上是同步了多少数据，以字节为单位。master根据这个来决定是否可以增量同步以及发送哪些数据给slave。第一次同步的时候master会告诉他的。以后slave每次收到从master过来的连接后，都会增加读取的数据长度到这个值，保存在c->reploff上面。
下面是发送PSYNC指令的代码。

1int slaveTryPartialResynchronization(int fd) {
2    char *psync_runid;
3    char psync_offset[32];
4    sds reply;
5 
6    if (server.cached_master) {
7        psync_runid = server.cached_master->replrunid;
8        snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
9        redisLog(REDIS_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_runid, psync_offset);
10    } else {
11        redisLog(REDIS_NOTICE,"Partial resynchronization not possible (no cached master)");
12        psync_runid = "?";
13        memcpy(psync_offset,"-1",3);
14    }
15 
16    /* Issue the PSYNC command */
17    reply = sendSynchronousCommand(fd,"PSYNC",psync_runid,psync_offset,NULL);

收到PSYNC指令后，master如果觉得可以进行增量同步，则会返回”+CONTINUE”，如果必须进行全量同步，会返回”+FULLRESYNC”，否则ERROR，这里具体待会介绍master的时候介绍。

1.只能进行全量同步

来看看如果必须进行全量同步的情况，这种情况下master会返回”+FULLRESYNC runid offset” 给slave。虽然得全量，但是还会告诉slave runid是多少，以及当前master的backlog offset位置，这样让slave下回来同步的时候能够进行部分同步。也算是互相沟通一下状态。

slave收到”+FULLRESYNC”结果后，会将runid保存到server.repl_master_runid上面，backlog offset位置放在server.repl_master_initial_offset里面。以便后面使用部分同步功能。读取完RDB文件后会设置到server.master->reploff上的。

注意PSYNC如果只能进行全量同步，master自己会做RDB快照的，不需要再次发送SYNC。看下面的代码：

1int slaveTryPartialResynchronization(int fd) {
2//`````
3    if (!strncmp(reply,"+FULLRESYNC",11)) {
4        char *runid = NULL, *offset = NULL;
5 
6        /* FULL RESYNC, parse the reply in order to extract the run id
7         * and the replication offset. */
8        runid = strchr(reply,' ');
9        if (runid) {
10            runid++;
11            offset = strchr(runid,' ');
12            if (offset) offset++;
13        }
14        if (!runid || !offset || (offset-runid-1) != REDIS_RUN_ID_SIZE) {
15            redisLog(REDIS_WARNING,
16                "Master replied with wrong +FULLRESYNC syntax.");
17            /* This is an unexpected condition, actually the +FULLRESYNC
18             * reply means that the master supports PSYNC, but the reply
19             * format seems wrong. To stay safe we blank the master
20             * runid to make sure next PSYNCs will fail. */
21            memset(server.repl_master_runid,0,REDIS_RUN_ID_SIZE+1);
22        } else {
23            memcpy(server.repl_master_runid, runid, offset-runid-1);
24            server.repl_master_runid[REDIS_RUN_ID_SIZE] = '\0';
25            server.repl_master_initial_offset = strtoll(offset,NULL,10);
26            redisLog(REDIS_NOTICE,"Full resync from master: %s:%lld",
27                server.repl_master_runid,
28                server.repl_master_initial_offset);
29        }
30        /* We are going to full resync, discard the cached master structure. */
31        replicationDiscardCachedMaster();
32        sdsfree(reply);
33        return PSYNC_FULLRESYNC;
34    }
35//····
36}
37 
38void readSyncBulkPayload(aeEventLoop *el, int fd, void *privdata, int mask) {
39//````
40        server.master->reploff = server.repl_master_initial_offset;
41        memcpy(server.master->replrunid, server.repl_master_runid,
42            sizeof(server.repl_master_runid));
43//·····
44}

2.可以进行部分同步

如果master返回”+CONTINUE”，那就可以进行部分同步。这个比较简单，继续接收后面的数据就行了。

3.发生错误

这个时候可能是master是老版本，不认识PSYNC，或者发生其他错误了，那就重新发送SYNC指令进行全量同步就行。

到这里还剩下几个问题，第一个是如果连接断开了，slave怎么记住master的runid和reploff位置的呢？

这个可以参考replicationCacheMaster，freeClient在断开一个连接的时候，会判断这个是不是master的连接，如果是，会调用replicationCacheMaster，将当前的状态cache住，并且断开跟本slave的下一级slave的连接。

1void replicationCacheMaster(redisClient *c) {
2//····
3    /* Save the master. Server.master will be set to null later by
4     * replicationHandleMasterDisconnection(). */
5    server.cached_master = server.master;
6//···
7    replicationHandleMasterDisconnection();
8}

下一个问题是slave的c->reploff如何保持跟master同步，因为他们必须绝对一致才行。

这个是通过在2端完成，双方只要是发送给对方的指令，都会讲指令的总长度加在offset上面，slave在readQueryFromClient读取连接数据的时候增加这个值。master在replicationFeedSlaves函数里面会调用feedReplicationBacklogWithObject，后者最终调用feedReplicationBacklog，进而调整offset和backlog，这个待会介绍。

1void readQueryFromClient(aeEventLoop *el, int fd, void *privdata, int mask) {
2//····
3    if (nread) {
4        sdsIncrLen(c->querybuf,nread);
5        c->lastinteraction = server.unixtime;
6        if (c->flags & REDIS_MASTER) c->reploff += nread;
7    }
8//····
9}

到这里slave部分介绍完毕。下一部分master端。

二、MASTER接收处理PSYNC指令

在master端，SYNC和PSYNC的处理函数都是syncCommand。只是增量了一段代码检测PSYNC指令，如果是，就会调用masterTryPartialResynchronization尝试部分同步，如果不能进行部分同步，那就按照SYNC的方式处理，也就是进行全量同步，这个请参考“Redis主从同步源码浅析-Slave端”。

1void syncCommand(redisClient *c) {
2//````
3    /* Try a partial resynchronization if this is a PSYNC command.
4     * If it fails, we continue with usual full resynchronization, however
5     * when this happens masterTryPartialResynchronization() already
6     * replied with:
7     *
8     * +FULLRESYNC <runid> <offset>
9     *
10     * So the slave knows the new runid and offset to try a PSYNC later
11     * if the connection with the master is lost. */
12    if (!strcasecmp(c->argv[0]->ptr,"psync")) {
13        if (masterTryPartialResynchronization(c) == REDIS_OK) {
14            server.stat_sync_partial_ok++;
15            return; /* No full resync needed, return. */
16        } else {
17            char *master_runid = c->argv[1]->ptr;
18 
19            /* Increment stats for failed PSYNCs, but only if the
20             * runid is not "?", as this is used by slaves to force a full
21             * resync on purpose when they are not albe to partially
22             * resync. */
23            if (master_runid[0] != '?') server.stat_sync_partial_err++;
24        }
25    } else {
26        /* If a slave uses SYNC, we are dealing with an old implementation
27         * of the replication protocol (like redis-cli --slave). Flag the client
28         * so that we don't expect to receive REPLCONF ACK feedbacks. */
29        c->flags |= REDIS_PRE_PSYNC_SLAVE;
30    }
31//````
32}

masterTryPartialResynchronization函数处理部分同步的检查。

首先检查runid是否匹配，如果不匹配那说明master重启过了，必须全量，调转到goto need_full_resync;

如果psync_offset 介于server.repl_backlog_off 和server.repl_backlog_off + server.repl_backlog_size 之间的话，那说明slave已经同步到的位置正好在我么的backlog之间，那说明他落下的东西master是记录在backlog里面的！good，可以进行增量同步。

1int masterTryPartialResynchronization(redisClient *c) {
2    long long psync_offset, psync_len;
3    char *master_runid = c->argv[1]->ptr;
4    char buf[128];
5    int buflen;
6 
7    /* Is the runid of this master the same advertised by the wannabe slave
8     * via PSYNC? If runid changed this master is a different instance and
9     * there is no way to continue. */
10    if (strcasecmp(master_runid, server.runid)) {
11        /* Run id "?" is used by slaves that want to force a full resync. */
12        if (master_runid[0] != '?') {
13            redisLog(REDIS_NOTICE,"Partial resynchronization not accepted: "
14                "Runid mismatch (Client asked for '%s', I'm '%s')",
15                master_runid, server.runid);
16        } else {
17            redisLog(REDIS_NOTICE,"Full resync requested by slave.");
18        }
19        goto need_full_resync;
20    }
21 
22    /* We still have the data our slave is asking for? */
23    if (getLongLongFromObjectOrReply(c,c->argv[2],&psync_offset,NULL) !=
24       REDIS_OK) goto need_full_resync;
25    if (!server.repl_backlog ||
26        psync_offset < server.repl_backlog_off ||
27        psync_offset >= (server.repl_backlog_off + server.repl_backlog_size))
28//上面这一行的计算我看有点问题，应该用将repl_backlog_size替换为repl_backlog_histlen，因为后者才是代表实际数据长度。
29 {
30 redisLog(REDIS_NOTICE,
31 "Unable to partial resync with the slave for lack of backlog (Slave request was: %lld).", psync_offset);
32 goto need_full_resync;
33 }

下面进行增量同步的工作包括：将这个连接加到server.slaves里面，然后给slave发送”+CONTINUE\r\n”告诉他“没事，你还可以赶得上”，然后使用addReplyReplicationBacklog把他落下的部分数据放到他的发送缓冲区中。

1/* If we reached this point, we are able to perform a partial resync:
2 * 1) Set client state to make it a slave.
3 * 2) Inform the client we can continue with +CONTINUE
4 * 3) Send the backlog data (from the offset to the end) to the slave. */
5c->flags |= REDIS_SLAVE;
6c->replstate = REDIS_REPL_ONLINE;
7c->repl_ack_time = server.unixtime;
8listAddNodeTail(server.slaves,c);
9/* We can't use the connection buffers since they are used to accumulate
10 * new commands at this stage. But we are sure the socket send buffer is
11 * emtpy so this write will never fail actually. */
12buflen = snprintf(buf,sizeof(buf),"+CONTINUE\r\n");
13if (write(c->fd,buf,buflen) != buflen) {
14    freeClientAsync(c);
15    return REDIS_OK;
16}
17psync_len = addReplyReplicationBacklog(c,psync_offset);

这样slave收到”+CONTINUE\r\n”后就会像正常情况一样接收master发送过来的数据，并且移动其c->reploff指针，部分同步开始。其实部分同步就是将落下的部分放到发送缓冲区发送给slave的事情。

关于addReplyReplicationBacklog函数就不多介绍了，里面是关于循环的backlog的处理，找出slave落下的数据，用addReplySds放到其缓冲区中准备发送。

如果不能进行部分同步，只能全部同步的话，master会附带将当前master的状态发送给slave。如下代码，用”+FULLRESYNC %s %lld\r\n”指令发送过去。

1int masterTryPartialResynchronization(redisClient *c) {
2//·····
3need_full_resync:
4    /* We need a full resync for some reason... notify the client. */
5    psync_offset = server.master_repl_offset;
6    /* Add 1 to psync_offset if it the replication backlog does not exists
7     * as when it will be created later we'll increment the offset by one. */
8    if (server.repl_backlog == NULL) psync_offset++;
9    /* Again, we can't use the connection buffers (see above). */
10    buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lld\r\n",
11                      server.runid,psync_offset);
12    if (write(c->fd,buf,buflen) != buflen) {
13        freeClientAsync(c);
14        return REDIS_OK;
15    }
16    return REDIS_ERR;
17}

到这里基本结束了，关于addReplyReplicationBacklog函数，其工作是将slave落下的backlog缓冲数据发送给slave。代码跟feedReplicationBacklog类似，后者的功能是往backlog填入数据，这里指介绍feedReplicationBacklog作为例子，介绍一下backlog。

redis的backlog是个循环的buffer，跟mysql不一样。其数据保存在server.repl_backlog 指针里面。下面分别介绍几个关键变量：

1struct redisServer {
2//····
3//master_repl_offset用来记录当前master发送给slave的所有数据的位置，字节为单位，其实就是发送一个命令增加相应字节，不断移动。
4    long long master_repl_offset;   /* Global replication offset */
5//backlog 数据保存在这个数组里面，可以由repl-backlog-size配置项配置。
6    char *repl_backlog;             /* Replication backlog for partial syncs */
7//也就是配置的repl-backlog-size，表示backlog缓冲区大小。默认为REDIS_DEFAULT_REPL_BACKLOG_SIZE = 1M，好小啊。
8    long long repl_backlog_size;    /* Backlog circular buffer size */
9//backlog缓冲区中的有效数据大小。开始的时候小于repl_backlog_size，但缓冲区满后就一直等于repl_backlog_size了。
10    long long repl_backlog_histlen; /* Backlog actual data length */
11//这个是说backlog指针指向的数组中，有效数据的起始位置，从0开始。
12    long long repl_backlog_idx;     /* Backlog circular buffer current offset */
13//这个表示backlog数据中，起始位置也就是上面的idx变量所指的位置的offset，一般等于master_repl_offset减去repl_backlog_histlen，
14//slave发送过来的offset只要大于这个值，说明slave落下的数据在backlog中，否则说明slave连接端开的太久了，已经没法找到历史记录了。
15    long long repl_backlog_off;     /* Replication offset of first byte in the
16}

通过上面的字段介绍，应该基本猜出feedReplicationBacklog往backlog填充最新要发给slave的数据的代码了。

1/* Add data to the replication backlog.
2 * This function also increments the global replication offset stored at
3 * server.master_repl_offset, because there is no case where we want to feed
4 * the backlog without incrementing the buffer. */
5void feedReplicationBacklog(void *ptr, size_t len) {
6    unsigned char *p = ptr;
7 
8    server.master_repl_offset += len;
9 
10    /* This is a circular buffer, so write as much data we can at every
11     * iteration and rewind the "idx" index if we reach the limit. */
12    while(len) {
13        size_t thislen = server.repl_backlog_size - server.repl_backlog_idx;
14        if (thislen > len) thislen = len;
15        memcpy(server.repl_backlog+server.repl_backlog_idx,p,thislen);
16        server.repl_backlog_idx += thislen;
17        if (server.repl_backlog_idx == server.repl_backlog_size)
18            server.repl_backlog_idx = 0;
19        len -= thislen;
20        p += thislen;
21        server.repl_backlog_histlen += thislen;
22    }
23    if (server.repl_backlog_histlen > server.repl_backlog_size)
24        server.repl_backlog_histlen = server.repl_backlog_size;
25    /* Set the offset of the first byte we have in the backlog. */
26    server.repl_backlog_off = server.master_repl_offset -
27                              server.repl_backlog_histlen + 1;
28}

注意函数开头对server.master_repl_offset的赋值，以及对server.repl_backlog_off的设置，这2个值的差就是backlog中的有效数据长度。并且master所有发给slave的指令，除了同步的基本指令外，都会增加这个计数。

同样对于slave，其每次从master收到的数据，也都会相应的在readQueryFromClient里面增加c->reploff的计数。这样master-slave对于offset就能保持一致，这就是其使用backlog通信的保证。

三、总结

很完美，master-slave能够部分同步了，这样避免了每次连接断开都需要进行全量同步的弊端。

不过redis2.8版本代码还没有发布，所以这里只是提前预告一下其功能，估计在这个季度就能发布了，目前已经比较稳定了。

不过还有一个个人觉得可能很有用的功能，那就是支持slave重启后部分同步的功能。目前重启后必须重新同步的。实现的话可以考虑在RDB文件和AOF文件写入的时候，同时增加一个文件记录对应的时刻slave上的runrid和c->reploff 值。启动的时候读取这个值就可以了。

同样的道理，支持master重启后不用重新同步所有数据，实现应该不难，类似的保存runid和offset等数据就行了。不然的话，生产环境中重启这种事情在所难免的。

如果能有这个功能就完美了，实现不难，不过很可能得泡汤了···

改天看看redis正在开发的功能：Redis Cluster。

0 0