Redis 主从复制

来源:互联网 发布:算命的为何准 知乎 编辑:程序博客网 时间:2024/06/03 13:19

Redis 支持简单且易用的主从复制(master-slave replication)功能,该功能可以让从服务器 (slave server)成为主服务器 (master server) 的精确复制品
以下是关于 Redis 复制功能的几个重要方面:
• Redis 使用异步复制。从 Redis 2.8 开始,从服务器会以每秒一次的频率向主服务器报告复制流(replication stream)的处理进度
• 一个主服务器可以有多个从服务器。
• 不仅主服务器可以有从服务器,从服务器也可以有自己的从服务器,多个从服务器之间可以构成一个图状结构
• 复制功能不会阻塞主服务器:即使有一个或多个从服务器正在进行初次同步,主服务器也可以继续处理命令请求

• 复制功能也不会阻塞从服务器:只要在 redis.conf 文件中进行了相应的设置,即使从服务器正在进行初次同步,服务器也可以使用旧版本的数据集来处理命令查询
不过,在从服务器删除旧版本数据集并载入新版本数据集的那段时间内,连接请求会被阻塞。

你还可以配置从服务器,让它在与主服务器之间的连接断开时,向客户端发送一个错误
• 复制功能可以单纯地用于数据冗余(data redundancy),也可以通过让多个从服务器处理只读命令请求来提升扩展性(scalability):比如说,繁重的SORT 命令可以交给附属节点去运行
• 可以通过复制功能来让主服务器免于执行持久化操作:只要关闭主服务器的持久化功能,然后由从服务器去执行持久化操作即可

主从复制运作原理

无论是初次连接还是重新连接,当建立一个从服务器时,从服务器都将向主服务器发送一个SYNC 命令
接到SYNC 命令的主服务器将开始执行BGSAVE ,并在保存操作执行期间,将所有新执行的写入命令都保存到一个缓冲区里面(repl_backlog)
当BGSAVE 执行完毕后,主服务器将执行保存操作所得的 .rdb 文件发送给从服务器,从服务器接收这个.rdb 文件,并将文件中的数据载入到内存中
之后主服务器会以 Redis 命令协议的格式,将写命令缓冲区中积累的所有内容都发送给从服务器

即使有多个从服务器同时向主服务器发送SYNC ,主服务器也只需执行一次BGSAVE 命令,就可以处理所有这些从服务器的同步请求
从服务器可以在主从服务器之间的连接断开时进行自动重连,在 Redis 2.8 版本之前,断线之后重连的从服务器总要执行一次完整重同步(full resynchronization)操作,但是从 Redis 2.8 版本开始,从服务器可以根据主服务器的情况来选择执行完整重同步还是部分重同步(partial resynchronization)

部分重同步

从 Redis 2.8 开始,在网络连接短暂性失效之后,主从服务器可以尝试继续执行原有的复制进程(process),而不一定要执行完整重同步操作
这个特性需要主服务器为被发送的复制流创建一个内存缓冲区(in-memory backlog),并且主服务器和所有从服务器之间都记录一个复制偏移量(replication offset)

和一个主服务器 ID (master run id),当出现网络连接断开时,从服务器会重新连接,并且向主服务器请求继续执行原来的复制进程:
• 如果从服务器记录的主服务器 ID 和当前要连接的主服务器的 ID 相同,并且从服务器记录的偏移量所指定的数据仍然保存在主服务器的复制流缓冲区里面,那么主服务器会向从服务器发送断线时缺失的那部分数据,然后复制工作可以继续执行
• 否则的话,从服务器就要执行完整重同步操作
Redis 2.8 的这个部分重同步特性会用到一个新增的PSYNC 内部命令,而 Redis 2.8 以前的旧版本只有SYNC 命令,不过,只要从服务器是 Redis 2.8 或以上的版本,它就会根据主服务器的版本来决定到底是使用PSYNC 还是SYNC :
• 如果主服务器是 Redis 2.8 或以上版本,那么从服务器使用PSYNC 命令来进行同步
• 如果主服务器是 Redis 2.8 之前的版本,那么从服务器使用SYNC 命令来进行同步

主从复制实现

在介绍主从复制的具体实现之前,先了解一下有关的一些结构和变量

  1: struct redisServer {
  2:     /* Replication (master) */
  3:     // 最近一次使用(访问)的数据集
  4:     int slaveseldb;                 /* Last SELECTed DB in replication output */
  5:  
  6:     // 全局的数据同步偏移量
  7:     long long master_repl_offset;   /* Global replication offset */
  8:  
  9:     // 主从连接心跳频率
 10:     int repl_ping_slave_period;     /* Master pings the slave every N seconds */
 11:  
 12:     // 积压空间指针
 13:     char *repl_backlog;             /* Replication backlog for partial syncs */
 14:  
 15:     // 积压空间大小
 16:     long long repl_backlog_size;    /* Backlog circular buffer size */
 17:  
 18:     // 积压空间中写入的新数据的大小
 19:     long long repl_backlog_histlen; /* Backlog actual data length */
 20:  
 21:     // 下一次向积压空间写入数据的起始位置
 22:     long long repl_backlog_idx;     /* Backlog circular buffer current offset */
 23:  
 24:     // 积压数据的起始位置,是一个宏观值
 25:     long long repl_backlog_off;     /* Replication offset of first byte in the
 26:                                        backlog buffer. */
 27:  
 28:     // 积压空间有效时间
 29:     time_t repl_backlog_time_limit; /* Time without slaves after the backlog
 30:                                        gets released. */
 31: }
 32: 
 33: struct redisClient{
 34: 
 35:      // 客户端状态标志
 36:     int flags;              /* REDIS_SLAVE | REDIS_MONITOR | REDIS_MULTI ... */
 37: 
 38:     // 当 server.requirepass 不为 NULL 时
 39:     // 代表认证的状态
 40:     // 0 代表未认证, 1 代表已认证
 41:     int authenticated;      /* when requirepass is non-NULL */
 42: 
 43:     // 复制状态
 44:     int replstate;          /* replication state if this is a slave */
 45:     // 用于保存主服务器传来的 RDB 文件的文件描述符
 46:     int repldbfd;           /* replication DB file descriptor */
 47: 
 48:     // 读取主服务器传来的 RDB 文件的偏移量
 49:     off_t repldboff;        /* replication DB file offset */
 50:     // 主服务器传来的 RDB 文件的大小
 51:     off_t repldbsize;       /* replication DB file size */
 52:     
 53:     sds replpreamble;       /* replication DB preamble. */
 54: 
 55:     // 主服务器的复制偏移量
 56:     long long reploff;      /* replication offset if this is our master */
 57:     // 从服务器最后一次发送 REPLCONF ACK 时的偏移量
 58:     long long repl_ack_off; /* replication ack offset, if this is a slave */
 59:     // 从服务器最后一次发送 REPLCONF ACK 的时间
 60:     long long repl_ack_time;/* replication ack time, if this is a slave */
 61:     // 主服务器的 master run ID
 62:     // 保存在客户端,用于执行部分重同步
 63:     char replrunid[REDIS_RUN_ID_SIZE+1]; /* master run id if this is a master */
 64:     // 从服务器的监听端口号
 65:     int slave_listening_port; /* As configured with: SLAVECONF listening-port */
 66: 
 67: }

以上的一些变量会在后续的代码分析中碰到

slave端主从复制

当一个客户端向一个server发送slaveof ip port命令时,server调用下面的回调函数来启动slave服务(或者说让自己进入slave状态)

  1: void slaveofCommand(redisClient *c) {
  2:     /* SLAVEOF is not allowed in cluster mode as replication is automatically
  3:      * configured using the current address of the master node. */
  4:     // 不允许在集群模式中使用
  5:     if (server.cluster_enabled) {
  6:         addReplyError(c,"SLAVEOF not allowed in cluster mode.");
  7:         return;
  8:     }
  9: 
 10:     /* The special host/port combination "NO" "ONE" turns the instance
 11:      * into a master. Otherwise the new master address is set. */
 12:     // SLAVEOF NO ONE 让从服务器转为主服务器
 13:     if (!strcasecmp(c->argv[1]->ptr,"no") &&
 14:         !strcasecmp(c->argv[2]->ptr,"one")) {
 15:         if (server.masterhost) {
 16:             // 让服务器取消复制,成为主服务器
 17:             replicationUnsetMaster();
 18:             redisLog(REDIS_NOTICE,"MASTER MODE enabled (user request)");
 19:         }
 20:     } else {
 21:         long port;
 22: 
 23:         // 获取端口参数
 24:         if ((getLongFromObjectOrReply(c, c->argv[2], &port, NULL) != REDIS_OK))
 25:             return;
 26: 
 27:         /* Check if we are already attached to the specified slave */
 28:         // 检查输入的 host 和 port 是否服务器目前的主服务器
 29:         // 如果是的话,向客户端返回 +OK ,不做其他动作
 30:         if (server.masterhost && !strcasecmp(server.masterhost,c->argv[1]->ptr)
 31:             && server.masterport == port) {
 32:             redisLog(REDIS_NOTICE,"SLAVE OF would result into synchronization with the master we are already connected with. No operation performed.");
 33:             addReplySds(c,sdsnew("+OK Already connected to specified master\r\n"));
 34:             return;
 35:         }
 36: 
 37:         /* There was no previous master or the user specified a different one,
 38:          * we can continue. */
 39:         // 没有前任主服务器,或者客户端指定了新的主服务器
 40:         // 开始执行复制操作
 41:         replicationSetMaster(c->argv[1]->ptr, port);
 42:         redisLog(REDIS_NOTICE,"SLAVE OF %s:%d enabled (user request)",
 43:             server.masterhost, server.masterport);
 44:     }
 45:     addReply(c,shared.ok);
 46: }
  1: /* Set replication to the specified master address and port. */
  2: // 将服务器设为指定地址的从服务器
  3: void replicationSetMaster(char *ip, int port) {
  4: 
  5:     // 清除原有的主服务器地址(如果有的话)
  6:     sdsfree(server.masterhost);
  7: 
  8:     // IP
  9:     server.masterhost = sdsnew(ip);
 10: 
 11:     // 端口
 12:     server.masterport = port;
 13: 
 14:     // 清除原来可能有的主服务器信息。。。
 15: 
 16:     // 如果之前有其他地址,那么释放它
 17:     if (server.master) freeClient(server.master);
 18:     // 断开所有从服务器的连接,强制所有从服务器执行重同步
 19:     disconnectSlaves(); /* Force our slaves to resync with us as well. */
 20:     // 清空可能有的 master 缓存,因为已经不会执行 PSYNC 了
 21:     replicationDiscardCachedMaster(); /* Don't try a PSYNC. */
 22:     // 释放 backlog ,同理, PSYNC 目前已经不会执行了
 23:     freeReplicationBacklog(); /* Don't allow our chained slaves to PSYNC. */
 24:     // 取消之前的复制进程(如果有的话)
 25:     cancelReplicationHandshake();
 26: 
 27:     // 进入连接状态(重点)
 28:     server.repl_state = REDIS_REPL_CONNECT;
 29:     server.master_repl_offset = 0;
 30:     server.repl_down_since = 0;
 31: }

该函数将repl_state置为REDIS_REPL_CONNECT

之后,在函数serverCron中会每隔一秒调用一次replicationCron函数,在这个函数中,会检查repl_state标志是否被置为REDIS_REPL_CONNECT,如果标志被设置,

则与master服务器进行连接

  1: /* --------------------------- REPLICATION CRON  ---------------------------- */
  2: 
  3: /* Replication cron funciton, called 1 time per second. */
  4: // 复制 cron 函数,每秒调用一次
  5: void replicationCron(void) {
  6: 
  7:     /* Non blocking connection timeout? */
  8:     // 尝试连接到主服务器,但超时
  9:     if (server.masterhost &&
 10:         (server.repl_state == REDIS_REPL_CONNECTING ||
 11:          server.repl_state == REDIS_REPL_RECEIVE_PONG) &&
 12:         (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
 13:     {
 14:         redisLog(REDIS_WARNING,"Timeout connecting to the MASTER...");
 15:         // 取消连接
 16:         undoConnectWithMaster();
 17:     }
 18: 
 19:     /* Bulk transfer I/O timeout? */
 20:     // RDB 文件的传送已超时?
 21:     if (server.masterhost && server.repl_state == REDIS_REPL_TRANSFER &&
 22:         (time(NULL)-server.repl_transfer_lastio) > server.repl_timeout)
 23:     {
 24:         redisLog(REDIS_WARNING,"Timeout receiving bulk data from MASTER... If the problem persists try to set the 'repl-timeout' parameter in redis.conf to a larger value.");
 25:         // 停止传送,并删除临时文件
 26:         replicationAbortSyncTransfer();
 27:     }
 28: 
 29:     /* Timed out master when we are an already connected slave? */
 30:     // 从服务器曾经连接上主服务器,但现在超时
 31:     if (server.masterhost && server.repl_state == REDIS_REPL_CONNECTED &&
 32:         (time(NULL)-server.master->lastinteraction) > server.repl_timeout)
 33:     {
 34:         redisLog(REDIS_WARNING,"MASTER timeout: no data nor PING received...");
 35:         // 释放主服务器
 36:         freeClient(server.master);
 37:     }
 38: 
 39:     /* Check if we should connect to a MASTER */
 40:     // 尝试连接主服务器
 41:     if (server.repl_state == REDIS_REPL_CONNECT) {
 42:         redisLog(REDIS_NOTICE,"Connecting to MASTER %s:%d",
 43:             server.masterhost, server.masterport);
 44:         if (connectWithMaster() == REDIS_OK) {
 45:             redisLog(REDIS_NOTICE,"MASTER <-> SLAVE sync started");
 46:         } 
 47:    }
 48:          //...................省略后续的工作
 49: }

在函数connectWithMaster函数中会与master服务器建立连接

  1: // 以非阻塞方式连接主服务器
  2: int connectWithMaster(void) {
  3:     int fd;
  4: 
  5:     // 连接主服务器
  6:     fd = anetTcpNonBlockConnect(NULL,server.masterhost,server.masterport);
  7:     if (fd == -1) {
  8:         redisLog(REDIS_WARNING,"Unable to connect to MASTER: %s",
  9:             strerror(errno));
 10:         return REDIS_ERR;
 11:     }
 12: 
 13:     // 监听主服务器 fd 的读和写事件,并绑定文件事件处理器
 14:     if (aeCreateFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE,syncWithMaster,NULL) ==
 15:             AE_ERR)
 16:     {
 17:         close(fd);
 18:         redisLog(REDIS_WARNING,"Can't create readable event for SYNC");
 19:         return REDIS_ERR;
 20:     }
 21: 
 22:     // 初始化统计变量
 23:     server.repl_transfer_lastio = server.unixtime;
 24:     server.repl_transfer_s = fd;
 25: 
 26:     // 将状态改为已连接
 27:     server.repl_state = REDIS_REPL_CONNECTING;
 28: 
 29:     return REDIS_OK;
 30: }

该函数先与master进行连接,得到连接fd

将fd的可读可写事件加入到监听队列中,并且绑定回调函数syncWithMaster

之后将repl_state标志设置为REDIS_REPL_CONNECTING

上面绑定的回调函数syncWithMaster会在之后执行,具体分析该函数

  1: 
  2: // 从服务器用于同步主服务器的回调函数
  3: void syncWithMaster(aeEventLoop *el, int fd, void *privdata, int mask) {
  4:     char tmpfile[256], *err;
  5:     int dfd, maxtries = 5;
  6:     int sockerr = 0, psync_result;
  7:     socklen_t errlen = sizeof(sockerr);
  8:     REDIS_NOTUSED(el);
  9:     REDIS_NOTUSED(privdata);
 10:     REDIS_NOTUSED(mask);
 11: 
 12:     /* If this event fired after the user turned the instance into a master
 13:      * with SLAVEOF NO ONE we must just return ASAP. */
 14:     // 如果处于 SLAVEOF NO ONE 模式,那么关闭 fd
 15:     if (server.repl_state == REDIS_REPL_NONE) {
 16:         close(fd);
 17:         return;
 18:     }
 19: 
 20:     /* Check for errors in the socket. */
 21:     // 检查套接字错误
 22:     if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &sockerr, &errlen) == -1)
 23:         sockerr = errno;
 24:     if (sockerr) {
 25:         aeDeleteFileEvent(server.el,fd,AE_READABLE|AE_WRITABLE);
 26:         redisLog(REDIS_WARNING,"Error condition on socket for SYNC: %s",
 27:             strerror(sockerr));
 28:         goto error;
 29:     }
 30: 
 31:     /* If we were connecting, it's time to send a non blocking PING, we want to
 32:      * make sure the master is able to reply before going into the actual
 33:      * replication process where we have long timeouts in the order of
 34:      * seconds (in the meantime the slave would block). */
 35:     // 如果状态为 CONNECTING ,那么在进行初次同步之前,
 36:     // 向主服务器发送一个非阻塞的 PONG 
 37:     // 因为接下来的 RDB 文件发送非常耗时,所以我们想确认主服务器真的能访问
 38:     if (server.repl_state == REDIS_REPL_CONNECTING) {
 39:         redisLog(REDIS_NOTICE,"Non blocking connect for SYNC fired the event.");
 40:         /* Delete the writable event so that the readable event remains
 41:          * registered and we can wait for the PONG reply. */
 42:         // 手动发送同步 PING ,暂时取消监听写事件
 43:         aeDeleteFileEvent(server.el,fd,AE_WRITABLE);
 44:         // 更新状态
 45:         server.repl_state = REDIS_REPL_RECEIVE_PONG;
 46:         /* Send the PING, don't check for errors at all, we have the timeout
 47:          * that will take care about this. */
 48:         // 同步发送 PING
 49:         syncWrite(fd,"PING\r\n",6,100);
 50: 
 51:         // 返回,等待 PONG 到达
 52:         return;
 53:     }
 54: 
 55:     /* Receive the PONG command. */
 56:     // 接收 PONG 命令
 57:     if (server.repl_state == REDIS_REPL_RECEIVE_PONG) {
 58:         char buf[1024];
 59: 
 60:         /* Delete the readable event, we no longer need it now that there is
 61:          * the PING reply to read. */
 62:         // 手动同步接收 PONG ,暂时取消监听读事件
 63:         aeDeleteFileEvent(server.el,fd,AE_READABLE);
 64: 
 65:         /* Read the reply with explicit timeout. */
 66:         // 尝试在指定时间限制内读取 PONG
 67:         buf[0] = '\0';
 68:         // 同步接收 PONG
 69:         if (syncReadLine(fd,buf,sizeof(buf),
 70:             server.repl_syncio_timeout*1000) == -1)
 71:         {
 72:             redisLog(REDIS_WARNING,
 73:                 "I/O error reading PING reply from master: %s",
 74:                 strerror(errno));
 75:             goto error;
 76:         }
 77: 
 78:         /* We accept only two replies as valid, a positive +PONG reply
 79:          * (we just check for "+") or an authentication error.
 80:          * Note that older versions of Redis replied with "operation not
 81:          * permitted" instead of using a proper error code, so we test
 82:          * both. */
 83:         // 接收到的数据只有两种可能:
 84:         // 第一种是 +PONG ,第二种是因为未验证而出现的 -NOAUTH 错误
 85:         if (buf[0] != '+' &&
 86:             strncmp(buf,"-NOAUTH",7) != 0 &&
 87:             strncmp(buf,"-ERR operation not permitted",28) != 0)
 88:         {
 89:             // 接收到未验证错误
 90:             redisLog(REDIS_WARNING,"Error reply to PING from master: '%s'",buf);
 91:             goto error;
 92:         } else {
 93:             // 接收到 PONG
 94:             redisLog(REDIS_NOTICE,
 95:                 "Master replied to PING, replication can continue...");
 96:         }
 97:     }
 98: 
 99:     /* AUTH with the master if required. */
100:     // 进行身份验证
101:     if(server.masterauth) {
102:         err = sendSynchronousCommand(fd,"AUTH",server.masterauth,NULL);
103:         if (err[0] == '-') {
104:             redisLog(REDIS_WARNING,"Unable to AUTH to MASTER: %s",err);
105:             sdsfree(err);
106:             goto error;
107:         }
108:         sdsfree(err);
109:     }
110: 
111:     /* Set the slave port, so that Master's INFO command can list the
112:      * slave listening port correctly. */
113:     // 将从服务器的端口发送给主服务器,
114:     // 使得主服务器的 INFO 命令可以显示从服务器正在监听的端口
115:     {
116:         sds port = sdsfromlonglong(server.port);
117:         err = sendSynchronousCommand(fd,"REPLCONF","listening-port",port,
118:                                          NULL);
119:         sdsfree(port);
120:         /* Ignore the error if any, not all the Redis versions support
121:          * REPLCONF listening-port. */
122:         if (err[0] == '-') {
123:             redisLog(REDIS_NOTICE,"(Non critical) Master does not understand REPLCONF listening-port: %s", err);
124:         }
125:         sdsfree(err);
126:     }
127: 
128:     /* Try a partial resynchonization. If we don't have a cached master
129:      * slaveTryPartialResynchronization() will at least try to use PSYNC
130:      * to start a full resynchronization so that we get the master run id
131:      * and the global offset, to try a partial resync at the next
132:      * reconnection attempt. */
133:     // 根据返回的结果决定是执行部分 resync ,还是 full-resync
134:     psync_result = slaveTryPartialResynchronization(fd);
135: 
136:     // 可以执行部分 resync
137:     if (psync_result == PSYNC_CONTINUE) {
138:         redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Master accepted a Partial Resynchronization.");
139:         // 返回
140:         return;
141:     }
142: 
143:     /* Fall back to SYNC if needed. Otherwise psync_result == PSYNC_FULLRESYNC
144:      * and the server.repl_master_runid and repl_master_initial_offset are
145:      * already populated. */
146:     // 主服务器不支持 PSYNC ,发送 SYNC
147:     if (psync_result == PSYNC_NOT_SUPPORTED) {
148:         redisLog(REDIS_NOTICE,"Retrying with SYNC...");
149:         // 向主服务器发送 SYNC 命令
150:         if (syncWrite(fd,"SYNC\r\n",6,server.repl_syncio_timeout*1000) == -1) {
151:             redisLog(REDIS_WARNING,"I/O error writing to MASTER: %s",
152:                 strerror(errno));
153:             goto error;
154:         }
155:     }
156: 
157:     // 如果执行到这里,
158:     // 那么 psync_result == PSYNC_FULLRESYNC 或 PSYNC_NOT_SUPPORTED
159: 
160:     /* Prepare a suitable temp file for bulk transfer */
161:     // 打开一个临时文件,用于写入和保存接下来从主服务器传来的 RDB 文件数据
162:     while(maxtries--) {
163:         snprintf(tmpfile,256,
164:             "temp-%d.%ld.rdb",(int)server.unixtime,(long int)getpid());
165:         dfd = open(tmpfile,O_CREAT|O_WRONLY|O_EXCL,0644);
166:         if (dfd != -1) break;
167:         sleep(1);
168:     }
169:     if (dfd == -1) {
170:         redisLog(REDIS_WARNING,"Opening the temp file needed for MASTER <-> SLAVE synchronization: %s",strerror(errno));
171:         goto error;
172:     }
173: 
174:     /* Setup the non blocking download of the bulk file. */
175:     // 设置一个读事件处理器,来读取主服务器的 RDB 文件
176:     if (aeCreateFileEvent(server.el,fd, AE_READABLE,readSyncBulkPayload,NULL)
177:             == AE_ERR)
178:     {
179:         redisLog(REDIS_WARNING,
180:             "Can't create readable event for SYNC: %s (fd=%d)",
181:             strerror(errno),fd);
182:         goto error;
183:     }
184: 
185:     // 设置状态
186:     server.repl_state = REDIS_REPL_TRANSFER;
187: 
188:     // 更新统计信息
189:     server.repl_transfer_size = -1;
190:     server.repl_transfer_read = 0;
191:     server.repl_transfer_last_fsync_off = 0;
192:     server.repl_transfer_fd = dfd;
193:     server.repl_transfer_lastio = server.unixtime;
194:     server.repl_transfer_tmpfile = zstrdup(tmpfile);
195: 
196:     return;
197: 
198: error:
199:     close(fd);
200:     server.repl_transfer_s = -1;
201:     server.repl_state = REDIS_REPL_CONNECT;
202:     return;
203: }

该函数主要执行了以下工作:

·检查套接字错误

·如果repl_state被置为REDIS_REPL_CONNECTING,则向master服务器发送一个PING命令,确保在传输RDB文件前master服务器是可以到达的

等待对方回复PONG,并将自己的repl_state置为REDIS_REPL_PONG

·同步接收对方对于命令PING的回复,如果为PONG则表示正常可以传输,如果不是则出错

·进行身份验证

·尝试进行部分重同步psync_result = slaveTryPartialResynchronization(fd);

·如果返回值psync_result为PSYNC_CONTINUE,说明执行部分重同步,则直接返回

若为PSYNC_NOT_SUPPORTED则表示不能执行部分重同步,进行全同步,并且向master服务器同步发送命令 SYNC

·打开一个临时文件用于写入和保存master服务器传来的RDB文件

·将连接fd的可读事件加入到监听队列中,并且绑定可读事件的回调函数readSyncBlukPayload()

`设置repl_state为REDIS_REPL_TRANSFER

在上述函数中调用slaveTryPartialResynchronization进行部分重同步的测试,它主要做了以下工作:

  1: int slaveTryPartialResynchronization(int fd) {
  2:     char *psync_runid;
  3:     char psync_offset[32];
  4:     sds reply;
  5: 
  6:     /* Initially set repl_master_initial_offset to -1 to mark the current
  7:      * master run_id and offset as not valid. Later if we'll be able to do
  8:      * a FULL resync using the PSYNC command we'll set the offset at the
  9:      * right value, so that this information will be propagated to the
 10:      * client structure representing the master into server.master. */
 11:     server.repl_master_initial_offset = -1;
 12: 
 13:     if (server.cached_master) {
 14:         // 缓存存在,尝试部分重同步(在每次与master断开连接前,会将master的runid存入cached_master中进行缓存,用于之后重新建立连接的部分重同步)
 15:         // 命令为 "PSYNC <master_run_id> <repl_offset>"
 16:         psync_runid = server.cached_master->replrunid;
 17:         snprintf(psync_offset,sizeof(psync_offset),"%lld", server.cached_master->reploff+1);
 18:         redisLog(REDIS_NOTICE,"Trying a partial resynchronization (request %s:%s).", psync_runid, psync_offset);
 19:     } else {
 20:         // 缓存不存在
 21:         // 发送 "PSYNC ? -1" ,要求完整重同步
 22:         redisLog(REDIS_NOTICE,"Partial resynchronization not possible (no cached master)");
 23:         psync_runid = "?";
 24:         memcpy(psync_offset,"-1",3);
 25:     }
 26: 
 27:     /* Issue the PSYNC command */
 28:     // 向主服务器发送 PSYNC 命令
 29:     reply = sendSynchronousCommand(fd,"PSYNC",psync_runid,psync_offset,NULL);
 30: 
 31:     // 接收到 FULLRESYNC ,进行 full-resync
 32:     if (!strncmp(reply,"+FULLRESYNC",11)) {
 33:         char *runid = NULL, *offset = NULL;
 34: 
 35:         /* FULL RESYNC, parse the reply in order to extract the run id
 36:          * and the replication offset. */
 37:         // 分析并记录主服务器的 run id
 38:         runid = strchr(reply,' ');
 39:         if (runid) {
 40:             runid++;
 41:             offset = strchr(runid,' ');
 42:             if (offset) offset++;
 43:         }
 44:         // 检查 run id 的合法性
 45:         if (!runid || !offset || (offset-runid-1) != REDIS_RUN_ID_SIZE) {
 46:             redisLog(REDIS_WARNING,
 47:                 "Master replied with wrong +FULLRESYNC syntax.");
 48:             /* This is an unexpected condition, actually the +FULLRESYNC
 49:              * reply means that the master supports PSYNC, but the reply
 50:              * format seems wrong. To stay safe we blank the master
 51:              * runid to make sure next PSYNCs will fail. */
 52:             // 主服务器支持 PSYNC ,但是却发来了异常的 run id
 53:             // 只好将 run id 设为 0 ,让下次 PSYNC 时失败
 54:             memset(server.repl_master_runid,0,REDIS_RUN_ID_SIZE+1);
 55:         } else {
 56:             // 保存 run id
 57:             memcpy(server.repl_master_runid, runid, offset-runid-1);
 58:             server.repl_master_runid[REDIS_RUN_ID_SIZE] = '\0';
 59:             // 以及 initial offset
 60:             server.repl_master_initial_offset = strtoll(offset,NULL,10);
 61:             // 打印日志,这是一个 FULL resync
 62:             redisLog(REDIS_NOTICE,"Full resync from master: %s:%lld",
 63:                 server.repl_master_runid,
 64:                 server.repl_master_initial_offset);
 65:         }
 66:         /* We are going to full resync, discard the cached master structure. */
 67:         // 要开始完整重同步,缓存中的 master 已经没用了,清除它
 68:         replicationDiscardCachedMaster();
 69:         sdsfree(reply);
 70:         
 71:         // 返回状态
 72:         return PSYNC_FULLRESYNC;
 73:     }
 74: 
 75:     // 接收到 CONTINUE ,进行 partial resync
 76:     if (!strncmp(reply,"+CONTINUE",9)) {
 77:         /* Partial resync was accepted, set the replication state accordingly */
 78:         redisLog(REDIS_NOTICE,
 79:             "Successful partial resynchronization with master.");
 80:         sdsfree(reply);
 81:         // 将缓存中的 master 设为当前 master
 82:         replicationResurrectCachedMaster(fd);//在此函数中设置回调函数用于接收master传来的缓存的命令
 83: 
 84:         // 返回状态
 85:         return PSYNC_CONTINUE;
 86:     }
 87: 
 88:     /* If we reach this point we receied either an error since the master does
 89:      * not understand PSYNC, or an unexpected reply from the master.
 90:      * Return PSYNC_NOT_SUPPORTED to the caller in both cases. */
 91: 
 92:     // 接收到错误?
 93:     if (strncmp(reply,"-ERR",4)) {
 94:         /* If it's not an error, log the unexpected event. */
 95:         redisLog(REDIS_WARNING,
 96:             "Unexpected reply to PSYNC from master: %s", reply);
 97:     } else {
 98:         redisLog(REDIS_NOTICE,
 99:             "Master does not support PSYNC or is in "
100:             "error state (reply: %s)", reply);
101:     }
102:     sdsfree(reply);
103:     replicationDiscardCachedMaster();
104: 
105:     // 主服务器不支持 PSYNC
106:     return PSYNC_NOT_SUPPORTED;
107: }
  1: 
  2: /* Turn the cached master into the current master, using the file descriptor
  3:  * passed as argument as the socket for the new master.
  4:  *
  5:  * 将缓存中的 master 设置为服务器的当前 master 。
  6:  *
  7:  * This funciton is called when successfully setup a partial resynchronization
  8:  * so the stream of data that we'll receive will start from were this
  9:  * master left. 
 10:  *
 11:  * 当部分重同步准备就绪之后,调用这个函数。
 12:  * master 断开之前遗留下来的数据可以继续使用。
 13:  */
 14: void replicationResurrectCachedMaster(int newfd) {
 15:     
 16:     // 设置 master
 17:     server.master = server.cached_master;
 18:     server.cached_master = NULL;
 19: 
 20:     server.master->fd = newfd;
 21: 
 22:     server.master->flags &= ~(REDIS_CLOSE_AFTER_REPLY|REDIS_CLOSE_ASAP);
 23: 
 24:     server.master->authenticated = 1;
 25:     server.master->lastinteraction = server.unixtime;
 26: 
 27:     // 回到已连接状态
 28:     server.repl_state = REDIS_REPL_CONNECTED;
 29: 
 30:     /* Re-add to the list of clients. */
 31:     // 将 master 重新加入到客户端列表中
 32:     listAddNodeTail(server.clients,server.master);
 33:     // 监听 master 的读事件,并设置回调函数
 34:     if (aeCreateFileEvent(server.el, newfd, AE_READABLE,
 35:                           readQueryFromClient, server.master)) {
 36:         redisLog(REDIS_WARNING,"Error resurrecting the cached master, impossible to add the readable handler: %s", strerror(errno));
 37:         freeClientAsync(server.master); /* Close ASAP. */
 38:     }
 39: 
 40:     /* We may also need to install the write handler as well if there is
 41:      * pending data in the write buffers. */
 42:     if (server.master->bufpos || listLength(server.master->reply)) {
 43:         if (aeCreateFileEvent(server.el, newfd, AE_WRITABLE,
 44:                           sendReplyToClient, server.master)) {
 45:             redisLog(REDIS_WARNING,"Error resurrecting the cached master, impossible to add the writable handler: %s", strerror(errno));
 46:             freeClientAsync(server.master); /* Close ASAP. */
 47:         }
 48:     }
 49: }

以上函数用来尝试进行部分重同步,slave在与master进行连接后总是优先尝试进行部分重同步

 

在完成了以上的工作后,slave便可以等待接收master服务器发送的RDB文件,调用之前绑定的可读事件的回调函数readSyncBlukPayload()

  1: void readSyncBulkPayload(aeEventLoop *el, int fd, void *privdata, int mask) {  
  2:     if (server.repl_transfer_left == -1) { //还没有接收到master发送过来的第一个报文:rdb文件大小的报文  
  3:         if (syncReadLine(fd,buf,1024,server.repl_syncio_timeout) == -1)   
  4:         if (buf[0] == '-') { //master 出错  
  5:         } else if (buf[0] == '\0') { //这是一个connection live的ping操作  
  6:             server.repl_transfer_lastio = time(NULL);  
  7:             return;  
  8:         } else if (buf[0] != '$') { //其它报文,见master过程的sendBulkToSlave  
  9:         …}  
 10:         server.repl_transfer_left = strtol(buf+1,NULL,10); //赋值等待接收的数据量  
 11:         return;  
 12:     }  
 13:   
 14:     /* Read bulk data 真正的数据报文*/  
 15:     readlen = (server.repl_transfer_left < (signed)sizeof(buf)) ?  
 16:         server.repl_transfer_left : (signed)sizeof(buf);  
 17:     nread = read(fd,buf,readlen); //读数据  
 18:     server.repl_transfer_lastio = time(NULL);  
 19:     if (write(server.repl_transfer_fd,buf,nread) != nread) { //写到前面创建的临时文件  
 20:     server.repl_transfer_left -= nread;  
 21:     /* Check if the transfer is now complete */  
 22:     if (server.repl_transfer_left == 0) { //接收完毕  
 23:         if (rename(server.repl_transfer_tmpfile,server.dbfilename) == -1) {  
 24:         …}  
 25:         redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Loading DB in memory");  
 26:         emptyDb();  
 27:         aeDeleteFileEvent(server.el,server.repl_transfer_s,AE_READABLE);//删除该file event事件  
 28:         if (rdbLoad(server.dbfilename) != REDIS_OK) {//把rdb文件加载到内存  
 29:         }  
 30:          zfree(server.repl_transfer_tmpfile);  
 31:         close(server.repl_transfer_fd);  
 32:         server.master = createClient(server.repl_transfer_s); //为该fd创建新的client,该client的file event为aeCreateFileEvent(server.el,fd,AE_READABLE, readQueryFromClient, c)  
 33:         server.master->flags |= REDIS_MASTER;  
 34:         server.master->authenticated = 1;  
 35:         server.replstate = REDIS_REPL_CONNECTED;  
 36:         redisLog(REDIS_NOTICE, "MASTER <-> SLAVE sync: Finished with success");  
 37:         /* Rewrite the AOF file now that the dataset changed. */  
 38:         if (server.appendonly) rewriteAppendOnlyFileBackground(); //写aof文件  
 39: }  
 40: }  

该函数主要分三个过程:读取第一个长度报文,读取数据报文,结束时把rdb加载到内存,创建新的file event 可读事件(readQueryFromClient),更新slave server状态到REDIS_REPL_CONNECTED。到此master-slave进入增加量的命令同步,slave把来自master的更新命令当做一般的client命令来处理,slave也可对外提供服务

master主从复制

master端的主从复制开始阶段执行以下工作:

·发送命令SLAVEOF hostname port 将自己的ip地址以及端口号发送给slave

·接受来自slave端的连接,创建一个redisClient结构放入保存普通客户端的链表中

·在接收到PING命令后,发送PONG,告知slave连接正常

·进行身份验证

·接收到对方发送的 SYNC命令,在此步开始进行同步操作

在接收到SYNC命令后会调用syncCommand函数

  1: 
  2: /* SYNC ad PSYNC command implemenation. */
  3: void syncCommand(redisClient *c) {
  4: 
  5:     /* ignore SYNC if already slave or in monitor mode */
  6:     // 已经是 SLAVE ,或者处于 MONITOR 模式,返回
  7:     if (c->flags & REDIS_SLAVE) return;
  8: 
  9:     /* Refuse SYNC requests if we are a slave but the link with our master
 10:      * is not ok... */
 11:     // 如果这是一个从服务器,但与主服务器的连接仍未就绪,那么拒绝 SYNC
 12:     if (server.masterhost && server.repl_state != REDIS_REPL_CONNECTED) {
 13:         addReplyError(c,"Can't SYNC while not connected with my master");
 14:         return;
 15:     }
 16: 
 17:     /* SYNC can't be issued when the server has pending data to send to
 18:      * the client about already issued commands. We need a fresh reply
 19:      * buffer registering the differences between the BGSAVE and the current
 20:      * dataset, so that we can copy to other slaves if needed. */
 21:     // 在客户端仍有输出数据等待输出,不能 SYNC
 22:     if (listLength(c->reply) != 0 || c->bufpos != 0) {
 23:         addReplyError(c,"SYNC and PSYNC are invalid with pending output");
 24:         return;
 25:     }
 26: 
 27:     redisLog(REDIS_NOTICE,"Slave asks for synchronization");
 28: 
 29:     /* Try a partial resynchronization if this is a PSYNC command.
 30:      * 如果这是一个 PSYNC 命令,那么尝试 partial resynchronization 。
 31:      *
 32:      * If it fails, we continue with usual full resynchronization, however
 33:      * when this happens masterTryPartialResynchronization() already
 34:      * replied with:
 35:      *
 36:      * 如果失败,那么使用 full resynchronization ,
 37:      * 在这种情况下, masterTryPartialResynchronization() 返回以下内容:
 38:      *
 39:      * +FULLRESYNC <runid> <offset>
 40:      *
 41:      * So the slave knows the new runid and offset to try a PSYNC later
 42:      * if the connection with the master is lost. 
 43:      *
 44:      * 这样的话,之后如果主服务器断开,那么从服务器就可以尝试 PSYNC 了。
 45:      */
 46:     if (!strcasecmp(c->argv[0]->ptr,"psync")) {
 47:         // 尝试进行 PSYNC
 48:         if (masterTryPartialResynchronization(c) == REDIS_OK) {
 49:             // 可执行 PSYNC
 50:             server.stat_sync_partial_ok++;
 51:             return; /* No full resync needed, return. */
 52:         } else {
 53:             // 不可执行 PSYNC
 54:             char *master_runid = c->argv[1]->ptr;
 55:             
 56:             /* Increment stats for failed PSYNCs, but only if the
 57:              * runid is not "?", as this is used by slaves to force a full
 58:              * resync on purpose when they are not albe to partially
 59:              * resync. */
 60:             if (master_runid[0] != '?') server.stat_sync_partial_err++;
 61:         }
 62:     } else {
 63:         /* If a slave uses SYNC, we are dealing with an old implementation
 64:          * of the replication protocol (like redis-cli --slave). Flag the client
 65:          * so that we don't expect to receive REPLCONF ACK feedbacks. */
 66:         // 旧版实现,设置标识,避免接收 REPLCONF ACK 
 67:         c->flags |= REDIS_PRE_PSYNC;
 68:     }
 69: 
 70:     // 以下是完整重同步的情况。。。
 71: 
 72:     /* Full resynchronization. */
 73:     // 执行 full resynchronization ,增加计数
 74:     server.stat_sync_full++;
 75: 
 76:     /* Here we need to check if there is a background saving operation
 77:      * in progress, or if it is required to start one */
 78:     // 检查是否有 BGSAVE 在执行
 79:     if (server.rdb_child_pid != -1) {
 80:         /* Ok a background save is in progress. Let's check if it is a good
 81:          * one for replication, i.e. if there is another slave that is
 82:          * registering differences since the server forked to save */
 83:         redisClient *slave;
 84:         listNode *ln;
 85:         listIter li;
 86: 
 87:         // 如果有至少一个 slave 在等待这个 BGSAVE 完成
 88:         // 那么说明正在进行的 BGSAVE 所产生的 RDB 也可以为其他 slave 所用
 89:         listRewind(server.slaves,&li);
 90:         while((ln = listNext(&li))) {
 91:             slave = ln->value;
 92:             if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END) break;
 93:         }
 94: 
 95:         if (ln) {
 96:             /* Perfect, the server is already registering differences for
 97:              * another slave. Set the right state, and copy the buffer. */
 98:             // 幸运的情况,可以使用目前 BGSAVE 所生成的 RDB
 99:             copyClientOutputBuffer(c,slave); /* 是的,则把先前的这个slave的reply回复给新的这个client*/  
100:             c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
101:             redisLog(REDIS_NOTICE,"Waiting for end of BGSAVE for SYNC");
102:         } else {        /* 没有,则该client必须等待该bgsave结束(是master自动发起的而不是由其它的slave发起的),然后重新进行一个bgsave*/                                                    
103:             /* No way, we need to wait for the next BGSAVE in order to
104:              * register differences */
105:             // 不好运的情况,必须等待下个 BGSAVE
106:             c->replstate = REDIS_REPL_WAIT_BGSAVE_START;
107:             redisLog(REDIS_NOTICE,"Waiting for next BGSAVE for SYNC");
108:         }
109:     } else {       
110:         /* Ok we don't have a BGSAVE in progress, let's start one */
111:         // 没有 BGSAVE 在进行,开始一个新的 BGSAVE
112:         redisLog(REDIS_NOTICE,"Starting BGSAVE for SYNC");
113:         if (rdbSaveBackground(server.rdb_filename) != REDIS_OK) {
114:             redisLog(REDIS_NOTICE,"Replication failed, can't BGSAVE");
115:             addReplyError(c,"Unable to perform background save");
116:             return;
117:         }
118:         // 设置状态
119:         c->replstate = REDIS_REPL_WAIT_BGSAVE_END;
120:         /* Flush the script cache for the new slave. */
121:         // 因为新 slave 进入,刷新复制脚本缓存
122:         replicationScriptCacheFlush();
123:     }
124: 
125:     if (server.repl_disable_tcp_nodelay)
126:         anetDisableTcpNoDelay(NULL, c->fd); /* Non critical if it fails. */
127: 
128:     c->repldbfd = -1;
129: 
130:     c->flags |= REDIS_SLAVE;
131: 
132:     server.slaveseldb = -1; /* Force to re-emit the SELECT command. */
133: 
134:     // 添加到 slave 列表中
135:     listAddNodeTail(server.slaves,c);
136:     // 如果是第一个 slave ,那么初始化 backlog
137:     if (listLength(server.slaves) == 1 && server.repl_backlog == NULL)
138:         createReplicationBacklog();
139:     return;
140: }

该函数完成以下工作:
·如果这是一个slave服务器,并且并没有与master服务器建立连接则拒绝执行SYNC命令

·尝试执行 PSYNC 命令 调用masterTryPartialResynchronization()函数

·如果上述的部分重同步执行失败,则进行全同步

(1)首先检查是否有子进程执行BGSAVE命令

(2)如果有子进程执行BGSAVE命令,检查存储slave服务器的slave链表中是否有slave的replstate被置为REDIS_REPL_WAIT_BGSAVE_END,如果有则将之前的这个slave的reply回复给新的client.如果没有,将c->replstate置为REDIS_REPL_WAIT_BGSAVE_START,并且只能等待当前BGSAVE结束后重新BGSAVE

(3)如果没有子进程执行BGSAVE,则调用rdbSaveBackground

(4)c->replstate置为REDIS_REPL_WAIT_BGSAVE_END

(5)将此redisClient即slave添加到slave链表中

masterTryPartialResynchronization()函数用于尝试进行部分重同步的,主要工作如下:

  1: 
  2: /* This function handles the PSYNC command from the point of view of a
  3:  * master receiving a request for partial resynchronization.
  4:  *
  5:  * On success return REDIS_OK, otherwise REDIS_ERR is returned and we proceed
  6:  * with the usual full resync. */
  7: // 尝试进行部分 resync ,成功返回 REDIS_OK ,失败返回 REDIS_ERR 。
  8: int masterTryPartialResynchronization(redisClient *c) {
  9:     long long psync_offset, psync_len;
 10:     char *master_runid = c->argv[1]->ptr;
 11:     char buf[128];
 12:     int buflen;
 13:     // 检查 master id 是否和 runid 一致,只有一致的情况下才有 PSYNC 的可能
 14:     if (strcasecmp(master_runid, server.runid)) {
 15:         /* Run id "?" is used by slaves that want to force a full resync. */
 16:         // 从服务器提供的 run id 和服务器的 run id 不一致
 17:         if (master_runid[0] != '?') {
 18:             redisLog(REDIS_NOTICE,"Partial resynchronization not accepted: "
 19:                 "Runid mismatch (Client asked for runid '%s', my runid is '%s')",
 20:                 master_runid, server.runid);
 21:         // 从服务器提供的 run id 为 '?' ,表示强制 FULL RESYNC
 22:         } else {
 23:             redisLog(REDIS_NOTICE,"Full resync requested by slave.");
 24:         }
 25:         // 需要 full resync
 26:         goto need_full_resync;
 27:     }
 28: 
 29:     /* We still have the data our slave is asking for? */
 30:     // 取出 psync_offset 参数
 31:     if (getLongLongFromObjectOrReply(c,c->argv[2],&psync_offset,NULL) !=
 32:        REDIS_OK) goto need_full_resync;
 33: 
 34:         // 如果没有 backlog
 35:     if (!server.repl_backlog ||
 36:         // 或者 psync_offset 小于 server.repl_backlog_off
 37:         // (想要恢复的那部分数据已经被覆盖)
 38:         psync_offset < server.repl_backlog_off ||
 39:         // psync offset 大于 backlog 所保存的数据的偏移量
 40:         psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen))
 41:     {
 42:         // 执行 FULL RESYNC
 43:         redisLog(REDIS_NOTICE,
 44:             "Unable to partial resync with the slave for lack of backlog (Slave request was: %lld).", psync_offset);
 45:         if (psync_offset > server.master_repl_offset) {
 46:             redisLog(REDIS_WARNING,
 47:                 "Warning: slave tried to PSYNC with an offset that is greater than the master replication offset.");
 48:         }
 49:         goto need_full_resync;
 50:     }
 51: 
 52:     /* If we reached this point, we are able to perform a partial resync:
 53:      * 程序运行到这里,说明可以执行 partial resync
 54:      *
 55:      * 1) Set client state to make it a slave.
 56:      *    将客户端状态设为 salve  
 57:      *
 58:      * 2) Inform the client we can continue with +CONTINUE
 59:      *    向 slave 发送 +CONTINUE ,表示 partial resync 的请求被接受
 60:      *
 61:      * 3) Send the backlog data (from the offset to the end) to the slave. 
 62:      *    发送 backlog 中,客户端所需要的数据
 63:      */
 64:     c->flags |= REDIS_SLAVE;
 65:     c->replstate = REDIS_REPL_ONLINE;
 66:     c->repl_ack_time = server.unixtime;
 67:     listAddNodeTail(server.slaves,c);
 68:     /* We can't use the connection buffers since they are used to accumulate
 69:      * new commands at this stage. But we are sure the socket send buffer is
 70:      * emtpy so this write will never fail actually. */
 71:     // 向从服务器发送一个同步 +CONTINUE ,表示 PSYNC 可以执行
 72:     buflen = snprintf(buf,sizeof(buf),"+CONTINUE\r\n");
 73:     if (write(c->fd,buf,buflen) != buflen) {
 74:         freeClientAsync(c);
 75:         return REDIS_OK;
 76:     }
 77:     // 发送 backlog 中的内容(也即是从服务器缺失的那些内容)到从服务器
 78:     psync_len = addReplyReplicationBacklog(c,psync_offset);
 79:     redisLog(REDIS_NOTICE,
 80:         "Partial resynchronization request accepted. Sending %lld bytes of backlog starting from offset %lld.", psync_len, psync_offset);
 81:     /* Note that we don't need to set the selected DB at server.slaveseldb
 82:      * to -1 to force the master to emit SELECT, since the slave already
 83:      * has this state from the previous connection with the master. */
 84: 
 85:     // 刷新低延迟从服务器的数量
 86:     refreshGoodSlavesCount();
 87:     return REDIS_OK; /* The caller can return, no full resync needed. */
 88: 
 89: need_full_resync:
 90:     /* We need a full resync for some reason... notify the client. */
 91:     // 刷新 psync_offset
 92:     psync_offset = server.master_repl_offset;
 93:     /* Add 1 to psync_offset if it the replication backlog does not exists
 94:      * as when it will be created later we'll increment the offset by one. */
 95:     // 刷新 psync_offset
 96:     if (server.repl_backlog == NULL) psync_offset++;
 97:     /* Again, we can't use the connection buffers (see above). */
 98:     // 发送 +FULLRESYNC ,表示需要完整重同步
 99:     buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lld\r\n",
100:                       server.runid,psync_offset);
101:     if (write(c->fd,buf,buflen) != buflen) {
102:         freeClientAsync(c);
103:         return REDIS_OK;
104:     }
105:     return REDIS_ERR;
106: }

从机连接主机后,会主动发起 PSYNC 命令,从机会提供 master_runid 和 offset,主机调用此函数验证 master_runid 和 offset 是否有效?
验证通过则,进行部分同步:主机返回 +CONTINUE(从机接收后会注册积压数据接收事件对应上述的slaveTryPartialResynchronization),接着发送积压空间数据

 

接下来master会在它的serverCron的时候等待该bgsave子进程的结束

  1:  /* Check if a background saving or AOF rewrite in progress terminated. */
  2:     // 检查 BGSAVE 或者 BGREWRITEAOF 是否已经执行完毕
  3:     if (server.rdb_child_pid != -1 || server.aof_child_pid != -1) {
  4:         int statloc;
  5:         pid_t pid;
  6: 
  7:         // 接收子进程发来的信号,非阻塞
  8:         if ((pid = wait3(&statloc,WNOHANG,NULL)) != 0) {
  9:             int exitcode = WEXITSTATUS(statloc);
 10:             int bysignal = 0;
 11:             
 12:             if (WIFSIGNALED(statloc)) bysignal = WTERMSIG(statloc);
 13: 
 14:             // BGSAVE 执行完毕
 15:             if (pid == server.rdb_child_pid) {
 16:                 backgroundSaveDoneHandler(exitcode,bysignal);
 17: 

在wait3的处理函数backgroundSaveDoneHandler的最后一步:updateSlavesWaitingBgsave(exitcode == 0 ? REDIS_OK : REDIS_ERR)

  1: 
  2: /* This function is called at the end of every background saving.
  3:  * 在每次 BGSAVE 执行完毕之后使用
  4:  *
  5:  * The argument bgsaveerr is REDIS_OK if the background saving succeeded
  6:  * otherwise REDIS_ERR is passed to the function.
  7:  * bgsaveerr 可能是 REDIS_OK 或者 REDIS_ERR ,显示 BGSAVE 的执行结果
  8:  *
  9:  * The goal of this function is to handle slaves waiting for a successful
 10:  * background saving in order to perform non-blocking synchronization. 
 11:  * 
 12:  * 这个函数是在 BGSAVE 完成之后的异步回调函数,
 13:  * 它指导该怎么执行和 slave 相关的 RDB 下一步工作。
 14:  */
 15: void updateSlavesWaitingBgsave(int bgsaveerr) {
 16:     listNode *ln;
 17:     int startbgsave = 0;
 18:     listIter li;
 19: 
 20:     // 遍历所有 slave
 21:     listRewind(server.slaves,&li);
 22:     while((ln = listNext(&li))) {
 23:         redisClient *slave = ln->value;
 24: 
 25:         if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START) {
 26:             // 之前的 RDB 文件不能被 slave 使用,
 27:             // 开始新的 BGSAVE
 28:             startbgsave = 1;
 29:             slave->replstate = REDIS_REPL_WAIT_BGSAVE_END;
 30:         } else if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_END) {
 31: 
 32:             // 执行到这里,说明有 slave 在等待 BGSAVE 完成
 33: 
 34:             struct redis_stat buf;
 35: 
 36:             // 但是 BGSAVE 执行错误
 37:             if (bgsaveerr != REDIS_OK) {
 38:                 // 释放 slave
 39:                 freeClient(slave);
 40:                 redisLog(REDIS_WARNING,"SYNC failed. BGSAVE child returned an error");
 41:                 continue;
 42:             }
 43: 
 44:             // 打开 RDB 文件
 45:             if ((slave->repldbfd = open(server.rdb_filename,O_RDONLY)) == -1 ||
 46:                 redis_fstat(slave->repldbfd,&buf) == -1) {
 47:                 freeClient(slave);
 48:                 redisLog(REDIS_WARNING,"SYNC failed. Can't open/stat DB after BGSAVE: %s", strerror(errno));
 49:                 continue;
 50:             }
 51: 
 52:             // 设置偏移量,各种值
 53:             slave->repldboff = 0;
 54:             slave->repldbsize = buf.st_size;
 55:             // 更新状态
 56:             slave->replstate = REDIS_REPL_SEND_BULK;
 57: 
 58:             slave->replpreamble = sdscatprintf(sdsempty(),"$%lld\r\n",
 59:                 (unsigned long long) slave->repldbsize);
 60: 
 61:             // 清空之前的写事件处理器
 62:             aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
 63:             // 将 sendBulkToSlave 安装为 slave 的写事件处理器
 64:             // 它用于将 RDB 文件发送给 slave
 65:             if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {
 66:                 freeClient(slave);
 67:                 continue;
 68:             }
 69:         }
 70:     }
 71: 
 72:     // 需要执行新的 BGSAVE
 73:     if (startbgsave) {
 74:         /* Since we are starting a new background save for one or more slaves,
 75:          * we flush the Replication Script Cache to use EVAL to propagate every
 76:          * new EVALSHA for the first time, since all the new slaves don't know
 77:          * about previous scripts. */
 78:         // 开始行的 BGSAVE ,并清空脚本缓存
 79:         replicationScriptCacheFlush();
 80:         if (rdbSaveBackground(server.rdb_filename) != REDIS_OK) {
 81:             listIter li;
 82: 
 83:             listRewind(server.slaves,&li);
 84:             redisLog(REDIS_WARNING,"SYNC failed. BGSAVE failed");
 85:             while((ln = listNext(&li))) {
 86:                 redisClient *slave = ln->value;
 87: 
 88:                 if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START)
 89:                     freeClient(slave);
 90:             }
 91:         }
 92:     }
 93: }

·遍历slave链表中的slave服务器节点

(1)如果其replstate被置为REDIS_REPL_WAIT_BGSAVE_START,则将标志startbgsave置为1,当处理完所有的REDIS_REPL_WAIT_BGSAVE_END节点后重新执行

bgsave命令得到新的RDB

(2)如果replstate为REDIS_REPL_WAIT_BGSAVE_END,则将其replstate改为REDIS_REPL_SEND_BULK

(3)清除之前fd对应可写的回调函数,绑定为sendBulkToSlave

(4)如果startbgsave被置为1,则再一次调用rdbSave

下面看一下回调函数sendBulkToSlave

  1: 
  2: // master 将 RDB 文件发送给 slave 的写事件处理器
  3: void sendBulkToSlave(aeEventLoop *el, int fd, void *privdata, int mask) {
  4:     redisClient *slave = privdata;
  5:     REDIS_NOTUSED(el);
  6:     REDIS_NOTUSED(mask);
  7:     char buf[REDIS_IOBUF_LEN];
  8:     ssize_t nwritten, buflen;
  9: 
 10:     /* Before sending the RDB file, we send the preamble as configured by the
 11:      * replication process. Currently the preamble is just the bulk count of
 12:      * the file in the form "$<length>\r\n". */
 13:     if (slave->replpreamble) {
 14:         nwritten = write(fd,slave->replpreamble,sdslen(slave->replpreamble));
 15:         if (nwritten == -1) {
 16:             redisLog(REDIS_VERBOSE,"Write error sending RDB preamble to slave: %s",
 17:                 strerror(errno));
 18:             freeClient(slave);
 19:             return;
 20:         }
 21:         sdsrange(slave->replpreamble,nwritten,-1);
 22:         if (sdslen(slave->replpreamble) == 0) {
 23:             sdsfree(slave->replpreamble);
 24:             slave->replpreamble = NULL;
 25:             /* fall through sending data. */
 26:         } else {
 27:             return;
 28:         }
 29:     }
 30: 
 31:     /* If the preamble was already transfered, send the RDB bulk data. */
 32:     lseek(slave->repldbfd,slave->repldboff,SEEK_SET);
 33:     // 读取 RDB 数据
 34:     buflen = read(slave->repldbfd,buf,REDIS_IOBUF_LEN);
 35:     if (buflen <= 0) {
 36:         redisLog(REDIS_WARNING,"Read error sending DB to slave: %s",
 37:             (buflen == 0) ? "premature EOF" : strerror(errno));
 38:         freeClient(slave);
 39:         return;
 40:     }
 41:     // 写入数据到 slave
 42:     if ((nwritten = write(fd,buf,buflen)) == -1) {
 43:         if (errno != EAGAIN) {
 44:             redisLog(REDIS_WARNING,"Write error sending DB to slave: %s",
 45:                 strerror(errno));
 46:             freeClient(slave);
 47:         }
 48:         return;
 49:     }
 50: 
 51:     // 如果写入成功,那么更新写入字节数到 repldboff ,等待下次继续写入
 52:     slave->repldboff += nwritten;
 53: 
 54:     // 如果写入已经完成
 55:     if (slave->repldboff == slave->repldbsize) {
 56:         // 关闭 RDB 文件描述符
 57:         close(slave->repldbfd);
 58:         slave->repldbfd = -1;
 59:         // 删除之前绑定的写事件处理器
 60:         aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);
 61:         // 将状态更新为 REDIS_REPL_ONLINE
 62:         slave->replstate = REDIS_REPL_ONLINE;
 63:         // 更新响应时间
 64:         slave->repl_ack_time = server.unixtime;
 65:         // 创建向从服务器发送命令的写事件处理器
 66:         // 将保存并发送 RDB 期间的回复全部发送给从服务器
 67:         if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE,
 68:             sendReplyToClient, slave) == AE_ERR) {
 69:             redisLog(REDIS_WARNING,"Unable to register writable event for slave bulk transfer: %s", strerror(errno));
 70:             freeClient(slave);
 71:             return;
 72:         }
 73:         // 刷新低延迟 slave 数量
 74:         refreshGoodSlavesCount();
 75:         redisLog(REDIS_NOTICE,"Synchronization with slave succeeded");
 76:     }
 77: }
 78: 

该函数就是用于向slave 节点发送rdb文件,直到结束时添加新的file event(AE_WRITABLE, sendReplyToClient)事件,以便来同步save rdb文件之后的更新操作,我们可以看到这个回调函数就是一般的响应客户请求的回调函数,同时slave client进入REDIS_REPL_ONLINE状态

在什么时候向slave发送后面的更新操作:

  1: // call() 函数是执行命令的核心函数,真正执行命令的地方
  2: /* Call() is the core of Redis execution of a command */
  3: void call(redisClient *c, int flags) {
  4:     ......
  5:     /* Call the command. */
  6:     c->flags &= ~(REDIS_FORCE_AOF|REDIS_FORCE_REPL);
  7:     redisOpArrayInit(&server.also_propagate);
  8:  
  9:     // 脏数据标记,数据是否被修改
 10:     dirty = server.dirty;
 11:  
 12:     // 执行命令对应的函数
 13:     c->cmd->proc(c);
 14:  
 15:     dirty = server.dirty-dirty;
 16:     duration = ustime()-start;
 17:  
 18:     ......
 19:  
 20:     // 将客户端请求的数据修改记录传播给 AOF 和从机
 21:     /* Propagate the command into the AOF and replication link */
 22:     if (flags & REDIS_CALL_PROPAGATE) {
 23:         int flags = REDIS_PROPAGATE_NONE;
 24:  
 25:         // 强制主从复制
 26:         if (c->flags & REDIS_FORCE_REPL) flags |= REDIS_PROPAGATE_REPL;
 27:  
 28:         // 强制 AOF 持久化
 29:         if (c->flags & REDIS_FORCE_AOF) flags |= REDIS_PROPAGATE_AOF;
 30:  
 31:         // 数据被修改
 32:         if (dirty)
 33:             flags |= (REDIS_PROPAGATE_REPL | REDIS_PROPAGATE_AOF);
 34:  
 35:         // 传播数据修改记录
 36:         if (flags != REDIS_PROPAGATE_NONE)
 37:             propagate(c->cmd,c->db->id,c->argv,c->argc,flags);
 38:     }
 39:     ......
 40: }
 41:  
 42: // 向 AOF 和从机发布数据更新
 43: /* Propagate the specified command (in the context of the specified database id)
 44:  * to AOF and Slaves.
 45:  *
 46:  * flags are an xor between:
 47:  * + REDIS_PROPAGATE_NONE (no propagation of command at all)
 48:  * + REDIS_PROPAGATE_AOF (propagate into the AOF file if is enabled)
 49:  * + REDIS_PROPAGATE_REPL (propagate into the replication link)
 50:  */
 51: void propagate(struct redisCommand *cmd, int dbid, robj **argv, int argc,
 52:                int flags)
 53: {
 54:     // AOF 策略需要打开,且设置 AOF 传播标记,将更新发布给本地文件
 55:     if (server.aof_state != REDIS_AOF_OFF && flags & REDIS_PROPAGATE_AOF)
 56:         feedAppendOnlyFile(cmd,dbid,argv,argc);
 57:  
 58:     // 设置了从机传播标记,将更新发布给从机
 59:     if (flags & REDIS_PROPAGATE_REPL)
 60:         replicationFeedSlaves(server.slaves,dbid,argv,argc);
 61: }
 62:  
 63: // 向积压空间和从机发送数据
 64: void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {
 65:     listNode *ln;
 66:     listIter li;
 67:     int j, len;
 68:     char llstr[REDIS_LONGSTR_SIZE];
 69:  
 70:     // 没有积压数据且没有从机,直接退出
 71:     /* If there aren't slaves, and there is no backlog buffer to populate,
 72:      * we can return ASAP. */
 73:     if (server.repl_backlog == NULL && listLength(slaves) == 0) return;
 74:  
 75:     /* We can't have slaves attached and no backlog. */
 76:     redisAssert(!(listLength(slaves) != 0 && server.repl_backlog == NULL));
 77:  
 78:     /* Send SELECT command to every slave if needed. */
 79:     if (server.slaveseldb != dictid) {
 80:         robj *selectcmd;
 81:  
 82:         // 小于等于 10 的可以用共享对象
 83:         /* For a few DBs we have pre-computed SELECT command. */
 84:         if (dictid >= 0 && dictid < REDIS_SHARED_SELECT_CMDS) {
 85:             selectcmd = shared.select[dictid];
 86:         } else {
 87:         // 不能使用共享对象,生成 SELECT 命令对应的 redis 对象
 88:             int dictid_len;
 89:  
 90:             dictid_len = ll2string(llstr,sizeof(llstr),dictid);
 91:             selectcmd = createObject(REDIS_STRING,
 92:                 sdscatprintf(sdsempty(),
 93:                 "*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n",
 94:                 dictid_len, llstr));
 95:         }
 96:  
 97:         // 这里可能会有疑问:为什么把数据添加入积压空间,又把数据分发给所有的从机?
 98:         // 为什么不仅仅将数据分发给所有从机呢?
 99:         // 因为有一些从机会因特殊情况与主机断开连接,注意从机断开前有暂存
100:         // 主机的状态信息,因此这些断开的从机就没有及时收到更新的数据。redis 为了让
101:         // 断开的从机在下次连接后能够获取更新数据,将更新数据加入了积压空间。
102:  
103:         // 将 SELECT 命令对应的 redis 对象数据添加到积压空间
104:         /* Add the SELECT command into the backlog. */
105:         if (server.repl_backlog) feedReplicationBacklogWithObject(selectcmd);
106:  
107:         // 将数据分发所有的从机
108:         /* Send it to slaves. */
109:         listRewind(slaves,&li);
110:         while((ln = listNext(&li))) {
111:             redisClient *slave = ln->value;
112:             addReply(slave,selectcmd);
113:         }
114:  
115:         // 销毁对象
116:         if (dictid < 0 || dictid >= REDIS_SHARED_SELECT_CMDS)
117:             decrRefCount(selectcmd);
118:     }
119:  
120:     // 更新最近一次使用(访问)的数据集
121:     server.slaveseldb = dictid;
122:  
123:     // 将命令写入积压空间
124:     /* Write the command to the replication backlog if any. */
125:     if (server.repl_backlog) {
126:         char aux[REDIS_LONGSTR_SIZE+3];
127:  
128:         // 命令个数
129:         /* Add the multi bulk reply length. */
130:         aux[0] = '*';
131:         len = ll2string(aux+1,sizeof(aux)-1,argc);
132:         aux[len+1] = '\r';
133:         aux[len+2] = '\n';
134:         feedReplicationBacklog(aux,len+3);
135:  
136:         // 逐个命令写入
137:         for (j = 0; j < argc; j++) {
138:             long objlen = stringObjectLen(argv[j]);
139:  
140:             /* We need to feed the buffer with the object as a bulk reply
141:              * not just as a plain string, so create the $..CRLF payload len
142:              * ad add the final CRLF */
143:             aux[0] = '$';
144:             len = ll2string(aux+1,sizeof(aux)-1,objlen);
145:             aux[len+1] = '\r';
146:             aux[len+2] = '\n';
147:  
148:             /* 每个命令格式如下:
149:             $3
150:             *3
151:             SET
152:             *4
153:             NAME
154:             *4
155:             Jhon*/
156:  
157:             // 命令长度
158:             feedReplicationBacklog(aux,len+3);
159:             // 命令
160:             feedReplicationBacklogWithObject(argv[j]);
161:             // 换行
162:             feedReplicationBacklog(aux+len+1,2);
163:         }
164:     }
165:  
166:     // 立即给每一个从机发送命令
167:     /* Write the command to every slave. */
168:     listRewind(slaves,&li);
169:     while((ln = listNext(&li))) {
170:         redisClient *slave = ln->value;
171:  
172:         // 如果从机要求全同步,则不对此从机发送数据
173:         /* Don't feed slaves that are still waiting for BGSAVE to start */
174:         if (slave->replstate == REDIS_REPL_WAIT_BGSAVE_START) continue;
175:  
176:         /* Feed slaves that are waiting for the initial SYNC (so these commands
177:          * are queued in the output buffer until the initial SYNC completes),
178:          * or are already in sync with the master. */
179:  
180:         // 向从机命令的长度
181:         /* Add the multi bulk length. */
182:         addReplyMultiBulkLen(slave,argc);
183:  
184:         // 向从机发送命令
185:         /* Finally any additional argument that was not stored inside the
186:          * static buffer if any (from j to argc). */
187:         for (j = 0; j < argc; j++)
188:             addReplyBulk(slave,argv[j]);
189:     }
190: }

在每次执行客户端请求之后判断是否有更新内容,以及master的slave队列是否有成员,调用replicationFeedSlaves来向slave client的reply buf里增加数据并将更新的数据发送给每个slave,同时每个master也会维持一个char *repl_backlog,这个缓冲区我们称作积压空间。每当收到一个写数据库的命令后,我们不仅需要将其发送给每个slave,而且还需要将其记录在repl_backlog这个空间中,这个空间用于slave与master之间的连接断开后再次连接而进行部分重同步的

整体redis主从复制流程如下图所示:
redis-replication-interaction

0 0
原创粉丝点击