结合redis设计与实现的redis源码学习-20-复制（replication.c）

来源：互联网发布：域名主机记录编辑：程序博客网时间：2024/06/05 03:21

在Redis中，用户可以通过执行slaveof命令或者设置slaveof选项，让一个服务器去复制（replicate）另一个服务器，我们称呼被复制的服务器为主服务器（master），而对主服务器进行复制的服务器则被称为从服务器（slave）。

功能实现

PSYNC命令：具有完整重同步和部分重同步两种模式：

-1、完整重同步用于处理初次复制的情况：完整重同步会将主服务器生成的RDB文件发送给从服务器，并且记录生成RDB文件后的命令，在从服务器完成载入RDB文件后发送缓冲区的命令，使从服务器和主服务器的数据一致；
-2、部分重同步处理断线后的重复制情况：当从服务器再断线后重新连接主服务器时，如果条件允许，主服务器可以将主从服务器连接断开期间执行的写命令发送给从服务器，从服务器只要接受并执行这些写命令，就可以将数据库更新至主服务器当前所处的状态。

部分重同步功能的实现

-1、复制偏移量：执行复制的双方会分别维护一个复制偏移量，主服务器每次向从服务器传播N个字节的数据时，就将自己的复制偏移量加上N。从服务器每次收到主服务器传播来的N个字节数据时，就将自己的复制偏移量加上N。
通过对比主从服务器的复制偏移量，可以很容易的知道主从服务器是否处于一致状态，如果复制偏移量相同，那么主从服务器状态一致，反之亦然。
-2、复制积压缓冲区：由主服务器维护的一个固定长度先进先出队列，默认1M大小。当主服务器进行命令传播时，它不仅会将写命令发送给所有从服务器，还会将写命令入队到复制积压缓冲区中。并且复制积压缓冲区会为队列中的每个字节记录相应的复制偏移量。
如果offset偏移量之后的数据仍然存在于复制积压缓冲区中，那么主从服务器执行部分重同步操作。反之执行完整重同步操作。
-3、服务器的运行ID：每个redis服务器都会有自己的运行ID，运行ID在服务器启动时自动生成，由四十个随机的十六进制字符组成（SHA1）。当从服务器对主服务器进行初次复制时，主服务器会将自己的运行ID传送给从服务器，而从服务器会将这个ID保存起来。当从服务器断链后重新连接上一个主服务器时，从服务器将向当前连接的主服务器发送之前保存的运行ID：与当前主服务器ID相同，尝试部分重同步操作。否则执行完整重同步操作。

PSYNC命令的实现

调用方式：
-1、如果从服务器以前没有父之过任何主服务器，或者之前执行过SLAVEOF on one命令，那么从服务器在开始一次新的复制时将向主服务器发送PSYNC ？ -1命令，主动请求完整重同步。
-2、如果从服务器已经复制过某个主服务器，那么从服务器在开始一次新的复制时将向主服务器发送PSYNC 命令：其中runid是上一次复制的主服务器的id，offset是复制偏移量，接收的主服务器判断使用哪种操作。
主服务器的回复：
-1、如果返回+FULLRESYNC 回复，表示主服务器与从服务器执行完整重同步操作。
-2、如果返回+CONTINUE回复，表示主服务器将与从服务器执行部分重同步操作，从服务器只要等着主服务器将自己缺少的那部分数据发送过来就可以了。
-3、如果主服务器返回-ERR回复，那么表示主服务器的版本低于Redis2.8，执行不了PSYNC命令，从服务器将向主服务器发送SYNC命令，执行完整重同步。

复制的实现

-1、设置主服务器的地址和端口：当重复服务器执行slaveof ip port 命令后，会将信息保存到redisServer中的masterhost和masterport中，成功后进行下一步。
-2、建立套接字连接：从服务器根据命令设置的ip和port创建连向主服务器的套接字连接。如果成功创建，那么从服务器将未这个套接字关联一个专门用于处理复制工作的文件事件处理器，这个处理器负责执行后续的复制工作，比如接收RDB文件，以及接收主服务器传播来的命令。而主服务器在接受从服务器的套接字连接后，将为该套接字创建相应的客户端状态，并将从服务器看做是一个连接到主服务器的客户端来对待，这是从服务器将同时具有server和client两个状态：从服务器可以向主服务器发送命令请求，主服务器会向从服务器返回命令回复。
-3、发送ping命令：检查套接字是否正常，并判断主服务器是否可以正常处理命令请求。
如果主服务器向从服务器返回了一个命令回复，但从服务器却不能再规定时间内读取命令回复的内容，那么表示网络不好，不能执行后续操作，从服务器会断开并重新创建tcp连接。
如果主服务器返回错误，那么表示主服务器暂时不能处理从服务器的命令，不能继续执行后续操作。从服务器会断开重连。
如果从服务器收到PONG，那么表示OK，继续执行。
-4、身份验证：如果从服务器设置了masterauth选项，那么进行身份验证，从服务器将向主服务器发送一条auth命令，参数为masterauth的值。否则不进行验证。
如果主服务器没有设置requirepass选项，并且从服务器也没有设置masterauth选项，那么主服务器继续执行从服务器的命令。
如果从服务器发送的auth命令的秘钥和主服务器requirepass选项所设置的密码相同，那么主服务器将继续执行从服务器发送的命令。否则返回一个invalid password错误。
如果主服务器设置了requirepass选项，但是从服务器没有，那么主服务器返回一个NOAUTH错误。如果主服务器设置了而从服务器没有设置，那么返回no password is set 错误。
所有的错误都会使从服务器断开重连。
-5、发送端口信息：在身份验证步骤之后，从服务器将执行命令REPLCONF listening-port port，向主服务器发送从服务器的监听端口号。主服务器会将port保存到redisClient的slave_listening_port中。
-6、同步：从服务器向主服务器发送PSYNC命令，开始同步。在这一步，双方互为客户端。
-7、命令传播：完成同步后，主从服务器就进入命令传播阶段，这时主服务器只要一直将自己执行的写命令发送给从服务器，从服务器只要一直接受并执行这些命令就可以保持主从服务器一致了。

心跳检测

在命令传播阶段，从服务器默认会以每秒一次的频率，向主服务器发送命令：REPLCONF ACK 其中offset是复制偏移量。
这个命令有三个作用：
-1、检测主从服务器的网络连接状态：如果主服务器超过1秒没有收到这个命令，网络就有问题了。
-2、辅助实现min-slaves配置选项：min-slaves-to-write（从服务器数量）和min-slaves-max-lag（最大延迟描述）两个选项可以防止主服务器在不安全的情况下执行写命令。
-3、检测命令丢失：如果offset少于主服务器的复制偏移量然后主服务器就会根据从服务器提交的复制偏移量，主服务器会找到并重新发送这些数据。
replication.c
显示服务器端使用的复制相关函数

#include "server.h"#include <sys/time.h>#include <unistd.h>#include <fcntl.h>#include <sys/socket.h>#include <sys/stat.h>void replicationDiscardCachedMaster(void);void replicationResurrectCachedMaster(int newfd);void replicationSendAck(void);void putSlaveOnline(client *slave);int cancelReplicationHandshake(void);/* --------------------------- Utility functions ---------------------------- *//* Return the pointer to a string representing the slave ip:listening_port pair.   将指针返回到表示ip：port的对字符串,对于日志记录非常有用，因为我们要使用它来记录一个从属  Mostly useful for logging, since we want to log a slave using its  IP address and its listening port which is more clear for the user, for  example: "Closing connection with slave 10.1.2.3:6380". ip和端口对于用户来说更加清晰*///这个函数用来将ip和port转换为字符串  char *replicationGetSlaveName(client *c) {    static char buf[NET_PEER_ID_LEN];    char ip[NET_IP_STR_LEN];    ip[0] = '\0';    buf[0] = '\0';    if (c->slave_ip[0] != '\0' ||        anetPeerToString(c->fd,ip,sizeof(ip),NULL) != -1)    {        /* Note that the 'ip' buffer is always larger than 'c->slave_ip' ip缓冲区总是大于从属ip*/        if (c->slave_ip[0] != '\0') memcpy(ip,c->slave_ip,sizeof(c->slave_ip));        if (c->slave_listening_port)            anetFormatAddr(buf,sizeof(buf),ip,c->slave_listening_port);        else            snprintf(buf,sizeof(buf),"%s:<unknown-slave-port>",ip);    } else {        snprintf(buf,sizeof(buf),"client id #%llu",            (unsigned long long) c->id);    }    return buf;}/* ---------------------------------- MASTER -------------------------------- *///创建复制积压缓冲区void createReplicationBacklog(void) {    serverAssert(server.repl_backlog == NULL);    server.repl_backlog = zmalloc(server.repl_backlog_size);    server.repl_backlog_histlen = 0;    server.repl_backlog_idx = 0;    /* When a new backlog buffer is created, we increment the replication      offset by one to make sure we'll not be able to PSYNC with any      previous slave. This is needed because we avoid incrementing the      master_repl_offset if no backlog exists nor slaves are attached.       当一个新的积压缓冲区被创建时，我们增加复制偏移量，确保我们将无法与任何之前的从属      进行psync,如果没有积压，并且连接了从属，我们可以避免增加master_repl_offset*/    server.master_repl_offset++;    /* We don't have any data inside our buffer, but virtually the first      byte we have is the next byte that will be generated for the      replication stream. 我们的缓冲区中没有任何数据，但实际上，我们的第一个字节是将为复制流生成的下一个字节*/    server.repl_backlog_off = server.master_repl_offset+1;}/* This function is called when the user modifies the replication backlog * size at runtime. It is up to the function to both update the * server.repl_backlog_size and to resize the buffer and setup it so that * it contains the same data as the previous one (possibly less data, but * the most recent bytes, or the same data and more free space in case the * buffer is enlarged).当用户在运行时修改复制积压大小时，将调用此函数，这是由函数来更新复制大小的，    并调整缓冲区大小，设置它，使其包含与前一个相同的数据*/void resizeReplicationBacklog(long long newsize) {    if (newsize < CONFIG_REPL_BACKLOG_MIN_SIZE)        newsize = CONFIG_REPL_BACKLOG_MIN_SIZE;    if (server.repl_backlog_size == newsize) return;    server.repl_backlog_size = newsize;    if (server.repl_backlog != NULL) {        /* What we actually do is to flush the old buffer and realloc a new         * empty one. It will refill with new data incrementally.         * The reason is that copying a few gigabytes adds latency and even         * worse often we need to alloc additional space before freeing the         * old buffer. 我们实际上做的是刷新旧的缓冲区，并重新分配一个新的空的缓冲区。         它将递增地填充新的数据。原因是复制几千兆字节会增加延迟，甚至我们需要在释放旧的缓冲区之前分配额外的空间*/        zfree(server.repl_backlog);        server.repl_backlog = zmalloc(server.repl_backlog_size);        server.repl_backlog_histlen = 0;        server.repl_backlog_idx = 0;        /* Next byte we have is... the next since the buffer is empty. */        server.repl_backlog_off = server.master_repl_offset+1;    }}//释放复制积压缓冲区void freeReplicationBacklog(void) {    serverAssert(listLength(server.slaves) == 0);    zfree(server.repl_backlog);    server.repl_backlog = NULL;}/* Add data to the replication backlog.将数据添加到复制积压缓冲区  This function also increments the global replication offset stored at  server.master_repl_offset, because there is no case where we want to feed  the backlog without incrementing the buffer. 这个函数还增加了存储在master_repl_offset中  的全局复制偏移量，因为没有任何情况下我们想要提交积压而不递增缓冲区*/void feedReplicationBacklog(void *ptr, size_t len) {    unsigned char *p = ptr;    server.master_repl_offset += len;    /* This is a circular buffer, so write as much data we can at every      iteration and rewind the "idx" index if we reach the limit. 这是一个循环缓冲区      所以如果我们达到最大值，就要在每次迭代时写出尽可能多的数据，并重置索引位置*/    while(len) {        size_t thislen = server.repl_backlog_size - server.repl_backlog_idx;        if (thislen > len) thislen = len;//如果缓冲区剩余大小大于这次写入的大小        memcpy(server.repl_backlog+server.repl_backlog_idx,p,thislen);        server.repl_backlog_idx += thislen;        if (server.repl_backlog_idx == server.repl_backlog_size)            server.repl_backlog_idx = 0;//索引等于最大值，从缓冲区头部开始写        len -= thislen;//这里类似于读tcp缓冲区        p += thislen;        server.repl_backlog_histlen += thislen;    }    if (server.repl_backlog_histlen > server.repl_backlog_size)        server.repl_backlog_histlen = server.repl_backlog_size;    /* Set the offset of the first byte we have in the backlog. 设置我们在积压缓冲区中的    第一个字节的偏移量*/    server.repl_backlog_off = server.master_repl_offset -                              server.repl_backlog_histlen + 1;}/* Wrapper for feedReplicationBacklog() that takes Redis string objects  as input. 使用字符串对象作为输入的包装器*/void feedReplicationBacklogWithObject(robj *o) {    char llstr[LONG_STR_SIZE];    void *p;    size_t len;    if (o->encoding == OBJ_ENCODING_INT) {        len = ll2string(llstr,sizeof(llstr),(long)o->ptr);//如果是整形的先转为字符串        p = llstr;    } else {        len = sdslen(o->ptr);        p = o->ptr;    }    feedReplicationBacklog(p,len);}//复制填充从属void replicationFeedSlaves(list *slaves, int dictid, robj **argv, int argc) {    listNode *ln;    listIter li;    int j, len;    char llstr[LONG_STR_SIZE];    /* If there aren't slaves, and there is no backlog buffer to populate,      we can return ASAP. 如果没有从属，并且没有积压缓冲区来填充，返回ASAP*/    if (server.repl_backlog == NULL && listLength(slaves) == 0) return;    /* We can't have slaves attached and no backlog. 我们不能有从属附加但是没有积压*/    serverAssert(!(listLength(slaves) != 0 && server.repl_backlog == NULL));    /* Send SELECT command to every slave if needed. 在需要的情况下给每个从属发送选择数据库命令*/    if (server.slaveseldb != dictid) {        robj *selectcmd;        /* For a few DBs we have pre-computed SELECT command. 对于几个DB，我们有预先计算的SELECT命令*/        if (dictid >= 0 && dictid < PROTO_SHARED_SELECT_CMDS) {            selectcmd = shared.select[dictid];        } else {            int dictid_len;            dictid_len = ll2string(llstr,sizeof(llstr),dictid);            selectcmd = createObject(OBJ_STRING,                sdscatprintf(sdsempty(),                "*2\r\n$6\r\nSELECT\r\n$%d\r\n%s\r\n",                dictid_len, llstr));        }        /* Add the SELECT command into the backlog. 将select命令写到积压缓冲区*/        if (server.repl_backlog) feedReplicationBacklogWithObject(selectcmd);        /* Send it to slaves. 发送给从属*/        listRewind(slaves,&li);        while((ln = listNext(&li))) {//遍历从属链表，            client *slave = ln->value;            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;//如果从属正在等待RDB开始，先不发            addReply(slave,selectcmd);//发送命令        }        if (dictid < 0 || dictid >= PROTO_SHARED_SELECT_CMDS)            decrRefCount(selectcmd);//如果dictid不符合要求，接触这个命令的引用    }    server.slaveseldb = dictid;    /* Write the command to the replication backlog if any. 任何情况下都将命令写入复制积压*/    if (server.repl_backlog) {        char aux[LONG_STR_SIZE+3];        /* Add the multi bulk reply length. 添加多个批量答复长度*/        aux[0] = '*';        len = ll2string(aux+1,sizeof(aux)-1,argc);        aux[len+1] = '\r';        aux[len+2] = '\n';        feedReplicationBacklog(aux,len+3);        for (j = 0; j < argc; j++) {            long objlen = stringObjectLen(argv[j]);            /* We need to feed the buffer with the object as a bulk reply              not just as a plain string, so create the $..CRLF payload len              and add the final CRLF 我们需要将对象作为批量回复提供给缓冲区，而不仅仅是一个普通的字符串*/            aux[0] = '$';            len = ll2string(aux+1,sizeof(aux)-1,objlen);            aux[len+1] = '\r';            aux[len+2] = '\n';            feedReplicationBacklog(aux,len+3);            feedReplicationBacklogWithObject(argv[j]);            feedReplicationBacklog(aux+len+1,2);//这里是写/r/n        }    }    /* Write the command to every slave. 将命令写入每个从属*/    listRewind(server.slaves,&li);    while((ln = listNext(&li))) {        client *slave = ln->value;        /* Don't feed slaves that are still waiting for BGSAVE to start 不填充在等待RDB的从属*/        if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) continue;        /* Feed slaves that are waiting for the initial SYNC (so these commands          are queued in the output buffer until the initial SYNC completes),          or are already in sync with the master. 正在等待初始SYNC的从属，这些命令在缓冲区中排队，知道SYNC结束*/        /* Add the multi bulk length. 添加多批量长度*/        addReplyMultiBulkLen(slave,argc);        /* Finally any additional argument that was not stored inside the          static buffer if any (from j to argc). 最后，如果有的话，所有的附加参数没有存储在静态缓冲区内*/        for (j = 0; j < argc; j++)            addReplyBulk(slave,argv[j]);    }}//复制填充检查void replicationFeedMonitors(client *c, list *monitors, int dictid, robj **argv, int argc) {    listNode *ln;    listIter li;    int j;    sds cmdrepr = sdsnew("+");    robj *cmdobj;    struct timeval tv;    gettimeofday(&tv,NULL);    cmdrepr = sdscatprintf(cmdrepr,"%ld.%06ld ",(long)tv.tv_sec,(long)tv.tv_usec);    if (c->flags & CLIENT_LUA) {        cmdrepr = sdscatprintf(cmdrepr,"[%d lua] ",dictid);    } else if (c->flags & CLIENT_UNIX_SOCKET) {        cmdrepr = sdscatprintf(cmdrepr,"[%d unix:%s] ",dictid,server.unixsocket);    } else {        cmdrepr = sdscatprintf(cmdrepr,"[%d %s] ",dictid,getClientPeerId(c));    }    for (j = 0; j < argc; j++) {        if (argv[j]->encoding == OBJ_ENCODING_INT) {            cmdrepr = sdscatprintf(cmdrepr, "\"%ld\"", (long)argv[j]->ptr);        } else {            cmdrepr = sdscatrepr(cmdrepr,(char*)argv[j]->ptr,                        sdslen(argv[j]->ptr));        }        if (j != argc-1)            cmdrepr = sdscatlen(cmdrepr," ",1);    }    cmdrepr = sdscatlen(cmdrepr,"\r\n",2);    cmdobj = createObject(OBJ_STRING,cmdrepr);    listRewind(monitors,&li);    while((ln = listNext(&li))) {        client *monitor = ln->value;        addReply(monitor,cmdobj);    }    decrRefCount(cmdobj);}/* Feed the slave 'c' with the replication backlog starting from the  specified 'offset' up to the end of the backlog. 以从属指定的偏移开始到积压结束的复制待办事项为从属提供服务*/long long addReplyReplicationBacklog(client *c, long long offset) {    long long j, skip, len;    serverLog(LL_DEBUG, "[PSYNC] Slave request offset: %lld", offset);//这里是主机还没有执行过命令，实际长度为0    if (server.repl_backlog_histlen == 0) {        serverLog(LL_DEBUG, "[PSYNC] Backlog history len is zero");        return 0;    }    serverLog(LL_DEBUG, "[PSYNC] Backlog size: %lld",             server.repl_backlog_size);    serverLog(LL_DEBUG, "[PSYNC] First byte: %lld",             server.repl_backlog_off);    serverLog(LL_DEBUG, "[PSYNC] History len: %lld",             server.repl_backlog_histlen);    serverLog(LL_DEBUG, "[PSYNC] Current index: %lld",             server.repl_backlog_idx);    /* Compute the amount of bytes we need to discard. 计算我们需要丢弃的字节数量*/    skip = offset - server.repl_backlog_off;    serverLog(LL_DEBUG, "[PSYNC] Skipping: %lld", skip);    /* Point j to the oldest byte, that is actaully our      server.repl_backlog_off byte. j指向最早的字节，实际上是服务器的*/    j = (server.repl_backlog_idx +        (server.repl_backlog_size-server.repl_backlog_histlen)) %        server.repl_backlog_size;    serverLog(LL_DEBUG, "[PSYNC] Index of first byte: %lld", j);    /* Discard the amount of data to seek to the specified 'offset'. 丢弃要查找指定偏移量的数据量*/    j = (j + skip) % server.repl_backlog_size;    /* Feed slave with data. Since it is a circular buffer we have to      split the reply in two parts if we are cross-boundary. 填充从属数据，由于这是一个      循环缓冲区，如果我们是跨界的话，我们必须分两部分来回答*/    len = server.repl_backlog_histlen - skip;    serverLog(LL_DEBUG, "[PSYNC] Reply total length: %lld", len);    while(len) {        long long thislen =            ((server.repl_backlog_size - j) < len) ?            (server.repl_backlog_size - j) : len;        serverLog(LL_DEBUG, "[PSYNC] addReply() length: %lld", thislen);        addReplySds(c,sdsnewlen(server.repl_backlog + j, thislen));        len -= thislen;        j = 0;    }    return server.repl_backlog_histlen - skip;}/* Return the offset to provide as reply to the PSYNC command received  from the slave. The returned value is only valid immediately after  the BGSAVE process started and before executing any other command  from clients. 返回偏移量以提供从属收到PSYNC命令的回复，返回值仅在BGSAVE进程启动之后并且在执行来自客户端的任何其他命令之前立即生效*/long long getPsyncInitialOffset(void) {    long long psync_offset = server.master_repl_offset;    /* Add 1 to psync_offset if it the replication backlog does not exists     as when it will be created later we'll increment the offset by one. 如果复制积压不存在，则将其添加到psync_offset，稍后将偏移量+1*/    if (server.repl_backlog == NULL) psync_offset++;    return psync_offset;}/* Send a FULLRESYNC reply in the specific case of a full resynchronization, * as a side effect setup the slave for a full sync in different ways: *在指定条件下发送FULLRESYNC回复，作为一个副作用设置的奴隶完全同步以不同的方式： * 1) Remember, into the slave client structure, the offset we sent *    here, so that if new slaves will later attach to the same *    background RDB saving process (by duplicating this client output *    buffer), we can get the right offset from this slave.记住，在从属客户端结构中，我们    在这里发送的偏移量，以便如果新的从属服务器稍后将附加到相同的后台RDB保存过程，我们可以从该从属设备获得正确的偏移量 * 2) Set the replication state of the slave to WAIT_BGSAVE_END so that *    we start accumulating differences from this point.将从属的复制状态设置为等待BGSAVE_END，以便我们从这一点开始累积差异 * 3) Force the replication stream to re-emit a SELECT statement so *    the new slave incremental differences will start selecting the *    right database number.强制复制流重新发送一个SELECT语句，以便新的从属增量差异将开始选择正确的数据库编号 * * Normally this function should be called immediately after a successful * BGSAVE for replication was started, or when there is one already in * progress that we attached our slave to. 通常这个函数应该在一个成功的BGSAVE复制开始之后立即被调用，或者当 有一个已经在进行中的时候我们连接到我们的从属*/int replicationSetupSlaveForFullResync(client *slave, long long offset) {    char buf[128];    int buflen;    slave->psync_initial_offset = offset;    slave->replstate = SLAVE_STATE_WAIT_BGSAVE_END;    /* We are going to accumulate the incremental changes for this      slave as well. Set slaveseldb to -1 in order to force to re-emit      a SLEECT statement in the replication stream. 我们也要积累这个从属的增量变化，将slaveseldb设置为-1，      以便强制在复制流中重新发出select语句*/    server.slaveseldb = -1;    /* Don't send this reply to slaves that approached us with      the old SYNC command. 不要把这个回复发给那些用旧的SYNC密令连接的从属*/    if (!(slave->flags & CLIENT_PRE_PSYNC)) {        buflen = snprintf(buf,sizeof(buf),"+FULLRESYNC %s %lld\r\n",                          server.runid,offset);        if (write(slave->fd,buf,buflen) != buflen) {            freeClientAsync(slave);            return C_ERR;        }    }    return C_OK;}/* This function handles the PSYNC command from the point of view of a * master receiving a request for partial resynchronization. * 这个函数用来处理PSYNC命令，当主机收到一个部分重同步的请求时 * On success return C_OK, otherwise C_ERR is returned and we proceed * with the usual full resync. */int masterTryPartialResynchronization(client *c) {    long long psync_offset, psync_len;    char *master_runid = c->argv[1]->ptr;    char buf[128];    int buflen;    /* Is the runid of this master the same advertised by the wannabe slave     * via PSYNC? If runid changed this master is a different instance and     * there is no way to continue. 如果主机的运行id和从属的运行id不同，就返回*/    if (strcasecmp(master_runid, server.runid)) {        /* Run id "?" is used by slaves that want to force a full resync. */        if (master_runid[0] != '?') {            serverLog(LL_NOTICE,"Partial resynchronization not accepted: "                "Runid mismatch (Client asked for runid '%s', my runid is '%s')",                master_runid, server.runid);        } else {            serverLog(LL_NOTICE,"Full resync requested by slave %s",                replicationGetSlaveName(c));        }        goto need_full_resync;    }    /* We still have the data our slave is asking for? 我们仍然有从属要求的数据？*/    if (getLongLongFromObjectOrReply(c,c->argv[2],&psync_offset,NULL) !=       C_OK) goto need_full_resync;    if (!server.repl_backlog ||        psync_offset < server.repl_backlog_off ||        psync_offset > (server.repl_backlog_off + server.repl_backlog_histlen))    {        serverLog(LL_NOTICE,            "Unable to partial resync with slave %s for lack of backlog (Slave request was: %lld).", replicationGetSlaveName(c), psync_offset);        if (psync_offset > server.master_repl_offset) {            serverLog(LL_WARNING,                "Warning: slave %s tried to PSYNC with an offset that is greater than the master replication offset.", replicationGetSlaveName(c));        }        goto need_full_resync;    }    /* If we reached this point, we are able to perform a partial resync:到这里就要执行部分重同步了     * 1) Set client state to make it a slave.设置客户端状态为一个从属     * 2) Inform the client we can continue with +CONTINUE 通知客户端我们可以继续     * 3) Send the backlog data (from the offset to the end) to the slave. 发送积压的数据*/    c->flags |= CLIENT_SLAVE;    c->replstate = SLAVE_STATE_ONLINE;    c->repl_ack_time = server.unixtime;    c->repl_put_online_on_ack = 0;    listAddNodeTail(server.slaves,c);    /* We can't use the connection buffers since they are used to accumulate      new commands at this stage. But we are sure the socket send buffer is      empty so this write will never fail actually. 我们不能使用连接缓冲区，因为在这个阶段它们被用来累加新的命令      但是我们确信发送的缓冲区是空的，所以这个写入不会实际上失败*/    buflen = snprintf(buf,sizeof(buf),"+CONTINUE\r\n");    if (write(c->fd,buf,buflen) != buflen) {        freeClientAsync(c);        return C_OK;    }    psync_len = addReplyReplicationBacklog(c,psync_offset);    serverLog(LL_NOTICE,        "Partial resynchronization request from %s accepted. Sending %lld bytes of backlog starting from offset %lld.",            replicationGetSlaveName(c),            psync_len, psync_offset);    /* Note that we don't need to set the selected DB at server.slaveseldb      to -1 to force the master to emit SELECT, since the slave already      has this state from the previous connection with the master. 注意，我们不需要在      server.slaveseldb中将所选数据库设置为-1，以强制主机发送select，因为从机已经与之前的      主机连接具有此状态*/    refreshGoodSlavesCount();    return C_OK; /* The caller can return, no full resync needed. */need_full_resync:    /* We need a full resync for some reason... Note that we can't     * reply to PSYNC right now if a full SYNC is needed. The reply     * must include the master offset at the time the RDB file we transfer     * is generated, so we need to delay the reply to that moment.      由于某种原因，我们需要完全重同步，如果需要完整的SYNC，我们现在不能回复PSYNC，在我们传输RDB文件生成时，回复必须包含主偏移量，     所以我们需要延迟回复那一刻*/    return C_ERR;}/* Start a BGSAVE for replication goals, which is, selecting the disk or * socket target depending on the configuration, and making sure that * the script cache is flushed before to start. *为复制目标启动BGSAVE，根据配置选择磁盘或者套接字目标，并确保在启动之前刷新脚本缓存 * The mincapa argument is the bitwise AND among all the slaves capabilities * of the slaves waiting for this BGSAVE, so represents the slave capabilities * all the slaves support. Can be tested via SLAVE_CAPA_* macros. *参数是等待这个BGSAVE的所有从属的从属能力之间的按位与，因此代表所有从属支持的从属能力，可以通过SLAVE_CAPA_*宏进行测试 * Side effects, other than starting a BGSAVE:副作用 * * 1) Handle the slaves in WAIT_START state, by preparing them for a full *    sync if the BGSAVE was succesfully started, or sending them an error *    and dropping them from the list of slaves. *在WAIT_START状态下处理从属，如果BGSAVE成功启动，则准备进行完全同步，或者发送错误消息并从列表中删除他们 * 2) Flush the Lua scripting script cache if the BGSAVE was actually *    started.如果开始了BGSAVE会刷新lua脚本的缓存 * * Returns C_OK on success or C_ERR otherwise. */int startBgsaveForReplication(int mincapa) {    int retval;    int socket_target = server.repl_diskless_sync && (mincapa & SLAVE_CAPA_EOF);    listIter li;    listNode *ln;    serverLog(LL_NOTICE,"Starting BGSAVE for SYNC with target: %s",        socket_target ? "slaves sockets" : "disk");    if (socket_target)        retval = rdbSaveToSlavesSockets();//生成RDB文件    else        retval = rdbSaveBackground(server.rdb_filename);    /* If we failed to BGSAVE, remove the slaves waiting for a full     * resynchorinization from the list of salves, inform them with     * an error about what happened, close the connection ASAP.      如果我们执行BGSAVE失败了，从等待列表中删除等待完全重同步的服务器，并通知他们发生了什么错误，并尽快关闭连接*/    if (retval == C_ERR) {        serverLog(LL_WARNING,"BGSAVE for replication failed");        listRewind(server.slaves,&li);        while((ln = listNext(&li))) {            client *slave = ln->value;            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {                slave->flags &= ~CLIENT_SLAVE;                listDelNode(server.slaves,ln);                addReplyError(slave,                    "BGSAVE failed, replication can't continue");                slave->flags |= CLIENT_CLOSE_AFTER_REPLY;            }        }        return retval;    }    /* If the target is socket, rdbSaveToSlavesSockets() already setup     * the salves for a full resync. Otherwise for disk target do it now.     如果目标是套接字，则函数已经设置完全同步的附件，否则，对于磁盘的目标现在就做*/    if (!socket_target) {        listRewind(server.slaves,&li);        while((ln = listNext(&li))) {            client *slave = ln->value;            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {                    replicationSetupSlaveForFullResync(slave,                            getPsyncInitialOffset());            }        }    }    /* Flush the script cache, since we need that slave differences are     * accumulated without requiring slaves to match our cached scripts.      刷新脚本缓存，因为我们需要从属差异积累，而不需要奴隶来匹配我们的缓存脚本*/    if (retval == C_OK) replicationScriptCacheFlush();    return retval;}/* SYNC and PSYNC command implemenation. 复制命令的实现*/void syncCommand(client *c) {    /* ignore SYNC if already slave or in monitor mode 在是早期从属或者检查模式的情况下忽略SYNC命令*/    if (c->flags & CLIENT_SLAVE) return;    /* Refuse SYNC requests if we are a slave but the link with our master     * is not ok... 拒绝SYNC请求，如果自己是一个从属，但与主人的连接不好*/    if (server.masterhost && server.repl_state != REPL_STATE_CONNECTED) {        addReplyError(c,"Can't SYNC while not connected with my master");        return;    }    /* SYNC can't be issued when the server has pending data to send to     * the client about already issued commands. We need a fresh reply     * buffer registering the differences between the BGSAVE and the current     * dataset, so that we can copy to other slaves if needed.      当服务器有待处理的数据发送到客户端已经发出的命令时，不能发出SYNC，     我们需要一个新的回复缓冲区来记录BGSAVE和当前数据集之间的差异     以便我们可以根据需要复制到其他从属*/    if (clientHasPendingReplies(c)) {        addReplyError(c,"SYNC and PSYNC are invalid with pending output");        return;    }    serverLog(LL_NOTICE,"Slave %s asks for synchronization",        replicationGetSlaveName(c));    /* Try a partial resynchronization if this is a PSYNC command.     * If it fails, we continue with usual full resynchronization, however     * when this happens masterTryPartialResynchronization() already     * replied with:     *如果这是PSYNC命令，请尝试部分重新同步，如果失败，我们继续通常的完全重同步，但是发生时已经回复了     * +FULLRESYNC <runid> <offset>     *     * So the slave knows the new runid and offset to try a PSYNC later     * if the connection with the master is lost. 因此，如果与主设备的连接丢失，则从属知道新的运行id和偏移量，以便稍后尝试PSYNC*/    if (!strcasecmp(c->argv[0]->ptr,"psync")) {        if (masterTryPartialResynchronization(c) == C_OK) {            server.stat_sync_partial_ok++;            return; /* No full resync needed, return. 不需要完全重同步，返回*/        } else {            char *master_runid = c->argv[1]->ptr;            /* Increment stats for failed PSYNCs, but only if the             * runid is not "?", as this is used by slaves to force a full             * resync on purpose when they are not albe to partially             * resync. 递增失败的PSYNC统计信息，但只有在runid不是？的情况下才可以，因为从属使用这种统计信息来强制重新同步，因为他们不是部分重同步*/            if (master_runid[0] != '?') server.stat_sync_partial_err++;        }    } else {        /* If a slave uses SYNC, we are dealing with an old implementation         * of the replication protocol (like redis-cli --slave). Flag the client         * so that we don't expect to receive REPLCONF ACK feedbacks. 如果从机使用SYNC，         我们正在处理复制协议的一个老的实现，标记客户端，以便我们不希望收到REPLCONF ACK反馈*/        c->flags |= CLIENT_PRE_PSYNC;    }    /* Full resynchronization. 完全重同步*/    server.stat_sync_full++;    /* Setup the slave as one waiting for BGSAVE to start. The following code     * paths will change the state if we handle the slave differently. 设置从属等待BGSAVE启动。     如果我们处理从属的方式不同，下面的路径将会改变*/    c->replstate = SLAVE_STATE_WAIT_BGSAVE_START;    if (server.repl_disable_tcp_nodelay)        anetDisableTcpNoDelay(NULL, c->fd); /* Non critical if it fails. 这里失败不关键*/    c->repldbfd = -1;    c->flags |= CLIENT_SLAVE;    listAddNodeTail(server.slaves,c);    /* CASE 1: BGSAVE is in progress, with disk target. 这在执行磁盘目标的BGSAVE*/    if (server.rdb_child_pid != -1 &&        server.rdb_child_type == RDB_CHILD_TYPE_DISK)    {        /* Ok a background save is in progress. Let's check if it is a good         * one for replication, i.e. if there is another slave that is         * registering differences since the server forked to save. 确定后台保存在进行中，让我们来检查它是否适合复制，也就是说，如果有另一个从属         分叉保存的从属正在注册差异*/        client *slave;        listNode *ln;        listIter li;        listRewind(server.slaves,&li);        while((ln = listNext(&li))) {            slave = ln->value;            if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) break;        }        /* To attach this slave, we check that it has at least all the         * capabilities of the slave that triggered the current BGSAVE.          为了链接这个从机，我们检查它是否至少具有触发当前BGSAVE的从机的所有功能*/        if (ln && ((c->slave_capa & slave->slave_capa) == slave->slave_capa)) {            /* Perfect, the server is already registering differences for             * another slave. Set the right state, and copy the buffer.              完美，这个服务已经和其他从属注册了差异，设置正确的状态，复制缓冲区*/            copyClientOutputBuffer(c,slave);            replicationSetupSlaveForFullResync(c,slave->psync_initial_offset);            serverLog(LL_NOTICE,"Waiting for end of BGSAVE for SYNC");        } else {            /* No way, we need to wait for the next BGSAVE in order to             * register differences. 没办法，我们需要等待下一个BGSAVE才能注册差异*/            serverLog(LL_NOTICE,"Can't attach the slave to the current BGSAVE. Waiting for next BGSAVE for SYNC");        }    /* CASE 2: BGSAVE is in progress, with socket target. 套接字目标BGSAVE在执行中*/    } else if (server.rdb_child_pid != -1 &&               server.rdb_child_type == RDB_CHILD_TYPE_SOCKET)    {        /* There is an RDB child process but it is writing directly to         * children sockets. We need to wait for the next BGSAVE         * in order to synchronize. 有一个RDB子进程，但它直接写入到自套接字，我们需要等待下一个BGSAVE才能同步*/        serverLog(LL_NOTICE,"Current BGSAVE has socket target. Waiting for next BGSAVE for SYNC");    /* CASE 3: There is no BGSAVE is progress. 后台没有执行BGSAVE*/    } else {        if (server.repl_diskless_sync && (c->slave_capa & SLAVE_CAPA_EOF)) {            /* Diskless replication RDB child is created inside             * replicationCron() since we want to delay its start a             * few seconds to wait for more slaves to arrive. 无盘复制RDB子是在复制调度内部             创建的，因为我们想要延迟它的开始几秒钟，等待更多的从属到达*/            if (server.repl_diskless_sync_delay)                serverLog(LL_NOTICE,"Delay next BGSAVE for diskless SYNC");        } else {            /* Target is disk (or the slave is not capable of supporting             * diskless replication) and we don't have a BGSAVE in progress,             * let's start one. 目标是磁盘，我们没有进行BGSAVE，开始一个*/            if (server.aof_child_pid == -1) {                startBgsaveForReplication(c->slave_capa);            } else {                serverLog(LL_NOTICE,                    "No BGSAVE in progress, but an AOF rewrite is active. "                    "BGSAVE for replication delayed");            }        }    }    if (listLength(server.slaves) == 1 && server.repl_backlog == NULL)        createReplicationBacklog();    return;}/* REPLCONF <option> <value> <option> <value> ... * This command is used by a slave in order to configure the replication * process before starting it with the SYNC command. *这个命令是让从属复制的 * Currently the only use of this command is to communicate to the master * what is the listening port of the Slave redis instance, so that the * master can accurately list slaves and their listening ports in * the INFO output. * * In the future the same command can be used in order to configure * the replication to initiate an incremental replication instead of a * full resync. 复制命令的实现*/void replconfCommand(client *c) {    int j;    if ((c->argc % 2) == 0) {        /* Number of arguments must be odd to make sure that every         * option has a corresponding value. 确保所有选项有相应的值*/        addReply(c,shared.syntaxerr);        return;    }    /* Process every option-value pair. 执行每个键值对*/    for (j = 1; j < c->argc; j+=2) {        if (!strcasecmp(c->argv[j]->ptr,"listening-port")) {            long port;            if ((getLongFromObjectOrReply(c,c->argv[j+1],                    &port,NULL) != C_OK))                return;            c->slave_listening_port = port;        } else if (!strcasecmp(c->argv[j]->ptr,"ip-address")) {            sds ip = c->argv[j+1]->ptr;            if (sdslen(ip) < sizeof(c->slave_ip)) {                memcpy(c->slave_ip,ip,sdslen(ip)+1);            } else {                addReplyErrorFormat(c,"REPLCONF ip-address provided by "                    "slave instance is too long: %zd bytes", sdslen(ip));                return;            }        } else if (!strcasecmp(c->argv[j]->ptr,"capa")) {            /* Ignore capabilities not understood by this master. 忽略这个主机不知道的容量*/            if (!strcasecmp(c->argv[j+1]->ptr,"eof"))                c->slave_capa |= SLAVE_CAPA_EOF;        } else if (!strcasecmp(c->argv[j]->ptr,"ack")) {            /* REPLCONF ACK is used by slave to inform the master the amount             * of replication stream that it processed so far. It is an             * internal only command that normal clients should never use. 从机使用REPLCONF ACK来通知主机到目前为止处理的复制流量             这是普通客户端不应该使用的内部命令*/            long long offset;            if (!(c->flags & CLIENT_SLAVE)) return;            if ((getLongLongFromObject(c->argv[j+1], &offset) != C_OK))                return;            if (offset > c->repl_ack_off)                c->repl_ack_off = offset;            c->repl_ack_time = server.unixtime;            /* If this was a diskless replication, we need to really put             * the slave online when the first ACK is received (which             * confirms slave is online and ready to get more data).              如果这是一个无盘复制，收到第一个ACK时，我们需要把从机联机*/            if (c->repl_put_online_on_ack && c->replstate == SLAVE_STATE_ONLINE)                putSlaveOnline(c);            /* Note: this command does not reply anything! 注意这个命令不会返回任何数据*/            return;        } else if (!strcasecmp(c->argv[j]->ptr,"getack")) {            /* REPLCONF GETACK is used in order to request an ACK ASAP             * to the slave. */            if (server.masterhost && server.master) replicationSendAck();            /* Note: this command does not reply anything! */        } else {            addReplyErrorFormat(c,"Unrecognized REPLCONF option: %s",                (char*)c->argv[j]->ptr);            return;        }    }    addReply(c,shared.ok);}/* This function puts a slave in the online state, and should be called just * after a slave received the RDB file for the initial synchronization, and * we are finally ready to send the incremental stream of commands. *这个函数使从属处于在线状态，应该在从机接收到用于初始同步的RDB文件之后立即调用， 并且我们终于准备发送增量地命令流。 * It does a few things: * * 1) Put the slave in ONLINE state (useless when the function is called *    because state is already ONLINE but repl_put_online_on_ack is true). 将从属置于在线状态 * 2) Make sure the writable event is re-installed, since calling the SYNC *    command disables it, so that we can accumulate output buffer without *    sending it to the slave.确保可写事件被重新安装，因为调用SYNC命令会禁用它，所以我们可以累积输出缓冲区而不发送给从机 * 3) Update the count of good slaves. 更新从属的数量*/void putSlaveOnline(client *slave) {    slave->replstate = SLAVE_STATE_ONLINE;    slave->repl_put_online_on_ack = 0;    slave->repl_ack_time = server.unixtime; /* Prevent false timeout. 避免超时*/    if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE,        sendReplyToClient, slave) == AE_ERR) {        serverLog(LL_WARNING,"Unable to register writable event for slave bulk transfer: %s", strerror(errno));        freeClient(slave);        return;    }    refreshGoodSlavesCount();    serverLog(LL_NOTICE,"Synchronization with slave %s succeeded",        replicationGetSlaveName(slave));}void sendBulkToSlave(aeEventLoop *el, int fd, void *privdata, int mask) {    client *slave = privdata;    UNUSED(el);    UNUSED(mask);    char buf[PROTO_IOBUF_LEN];    ssize_t nwritten, buflen;    /* Before sending the RDB file, we send the preamble as configured by the     * replication process. Currently the preamble is just the bulk count of     * the file in the form "$<length>\r\n". 在发送RDB文件之前，我们发送由复制过程配置的前导码，*/    if (slave->replpreamble) {        nwritten = write(fd,slave->replpreamble,sdslen(slave->replpreamble));        if (nwritten == -1) {            serverLog(LL_VERBOSE,"Write error sending RDB preamble to slave: %s",                strerror(errno));            freeClient(slave);            return;        }        server.stat_net_output_bytes += nwritten;        sdsrange(slave->replpreamble,nwritten,-1);        if (sdslen(slave->replpreamble) == 0) {            sdsfree(slave->replpreamble);            slave->replpreamble = NULL;            /* fall through sending data. */        } else {            return;        }    }    /* If the preamble was already transfered, send the RDB bulk data.     如果序言已经被转移，则发送RDB批量数据*/    lseek(slave->repldbfd,slave->repldboff,SEEK_SET);    buflen = read(slave->repldbfd,buf,PROTO_IOBUF_LEN);    if (buflen <= 0) {        serverLog(LL_WARNING,"Read error sending DB to slave: %s",            (buflen == 0) ? "premature EOF" : strerror(errno));        freeClient(slave);        return;    }    if ((nwritten = write(fd,buf,buflen)) == -1) {        if (errno != EAGAIN) {            serverLog(LL_WARNING,"Write error sending DB to slave: %s",                strerror(errno));            freeClient(slave);        }        return;    }    slave->repldboff += nwritten;    server.stat_net_output_bytes += nwritten;    if (slave->repldboff == slave->repldbsize) {        close(slave->repldbfd);        slave->repldbfd = -1;        aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);        putSlaveOnline(slave);    }}/* This function is called at the end of every background saving, * or when the replication RDB transfer strategy is modified from * disk to socket or the other way around. *此功能在每次后台保存结束时调用，或者当复制RDB传输策略从磁盘更改为套接字或其他方式时。 * The goal of this function is to handle slaves waiting for a successful * background saving in order to perform non-blocking synchronization, and * to schedule a new BGSAVE if there are slaves that attached while a * BGSAVE was in progress, but it was not a good one for replication (no * other slave was accumulating differences). * 这个功能的目标是处理等待成功后台保存的从服务器，以便执行非阻塞同步，并且如果在BGSAVE正在进行时 有负数的从服务器，则安排新的BGSAVE，但是这不是一个好的复制 * The argument bgsaveerr is C_OK if the background saving succeeded * otherwise C_ERR is passed to the function.如果后台保存成功，返回正确 * The 'type' argument is the type of the child that terminated * (if it had a disk or socket target). */void updateSlavesWaitingBgsave(int bgsaveerr, int type) {    listNode *ln;    int startbgsave = 0;    int mincapa = -1;    listIter li;    listRewind(server.slaves,&li);    while((ln = listNext(&li))) {        client *slave = ln->value;        if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_START) {            startbgsave = 1;            mincapa = (mincapa == -1) ? slave->slave_capa :                                        (mincapa & slave->slave_capa);        } else if (slave->replstate == SLAVE_STATE_WAIT_BGSAVE_END) {            struct redis_stat buf;            /* If this was an RDB on disk save, we have to prepare to send             * the RDB from disk to the slave socket. Otherwise if this was             * already an RDB -> Slaves socket transfer, used in the case of             * diskless replication, our work is trivial, we can just put             * the slave online. 如果这是磁盘保存的RDB，我们必须准备将RDB从磁盘发送到             从属套接字，否则如果这已经是一个RDB到从属的套接字传输，在无盘复制的情况下使用，我们的工作是微不足道的，我们可以吧从属在线。*/            if (type == RDB_CHILD_TYPE_SOCKET) {                serverLog(LL_NOTICE,                    "Streamed RDB transfer with slave %s succeeded (socket). Waiting for REPLCONF ACK from slave to enable streaming",                        replicationGetSlaveName(slave));                /* Note: we wait for a REPLCONF ACK message from slave in                 * order to really put it online (install the write handler                 * so that the accumulated data can be transfered). However                 * we change the replication state ASAP, since our slave                 * is technically online now. 我们等待来自从属的REPLCONF ACK消息，以便真正将其联机，                 但是，我们尽快地更改了复制状态，因为我们的从属在技术上已经联机*/                slave->replstate = SLAVE_STATE_ONLINE;                slave->repl_put_online_on_ack = 1;                slave->repl_ack_time = server.unixtime; /* Timeout otherwise. */            } else {                if (bgsaveerr != C_OK) {                    freeClient(slave);                    serverLog(LL_WARNING,"SYNC failed. BGSAVE child returned an error");                    continue;                }                if ((slave->repldbfd = open(server.rdb_filename,O_RDONLY)) == -1 ||                    redis_fstat(slave->repldbfd,&buf) == -1) {                    freeClient(slave);                    serverLog(LL_WARNING,"SYNC failed. Can't open/stat DB after BGSAVE: %s", strerror(errno));                    continue;                }                slave->repldboff = 0;                slave->repldbsize = buf.st_size;                slave->replstate = SLAVE_STATE_SEND_BULK;                slave->replpreamble = sdscatprintf(sdsempty(),"$%lld\r\n",                    (unsigned long long) slave->repldbsize);                aeDeleteFileEvent(server.el,slave->fd,AE_WRITABLE);                if (aeCreateFileEvent(server.el, slave->fd, AE_WRITABLE, sendBulkToSlave, slave) == AE_ERR) {                    freeClient(slave);                    continue;                }            }        }    }    if (startbgsave) startBgsaveForReplication(mincapa);}

客户端相关函数

待续

阅读全文

0 0