redis源码分析（三）-ziplist的实现

来源：互联网发布：linux hadoop 下载编辑：程序博客网时间：2024/05/21 09:22

一 ziplist简介以及应用

ziplist称之为压缩链表，顾名思义，压缩链表有压缩节省空间的语义。回想redis另外一种链表：双向链表，每个节点都需要prev、next指针来指向前后节点，如何数据只占1字节，大量的此类数据会造成空间上的浪费。因此redis提出了另外一种更具有空间效率的链表：压缩链表，压缩链表则没有这两个指针，压缩链表含有两个数，一个代表前一个节点的长度，有一个表示当前节点的长度，最少只需2个字节，相比双向链表节省了很多空间。

压缩链表是列表键和哈希键的底层实现之一。当一个列表键只包含少量列表项，并且每个列表项要么是小整数值，要么就是长度较短的字符串，那么Redis就会使用压缩链表来做列表键的实现。

二一个例子

虽然注释已经详细说明了ziplist的设计思路，但是ziplist的代码较难理解，不妨先来看一个例子：

创建一个空的ziplist链表

加入了第1个节点"abcde"以后

加入了第2个节点"ABC"以后（加在最后）
加入了第3个节点"10"以后（加在最前面）

三 ziplist实现

先看一下作者对ziplist的一些讲解

/* The ziplist is a specially encoded dually linked list that is designed * to be very memory efficient.  * * Ziplist 是为了尽可能地节约内存而设计的特殊编码双端链表。 * * It stores both strings and integer values, * where integers are encoded as actual integers instead of a series of * characters.  * * Ziplist 可以储存字符串值和整数值， * 其中，整数值被保存为实际的整数，而不是字符数组。 * * It allows push and pop operations on either side of the list * in O(1) time. However, because every operation requires a reallocation of * the memory used by the ziplist, the actual complexity is related to the * amount of memory used by the ziplist. * * Ziplist 允许在列表的两端进行 O(1) 复杂度的 push 和 pop 操作。 * 但是，因为这些操作都需要对整个 ziplist 进行内存重分配， * 所以实际的复杂度和 ziplist 占用的内存大小有关。 * * ---------------------------------------------------------------------------- * * ZIPLIST OVERALL LAYOUT: * Ziplist 的整体布局： * * The general layout of the ziplist is as follows: * 以下是 ziplist 的一般布局： * * <zlbytes><zltail><zllen><entry><entry><zlend> * * <zlbytes> is an unsigned integer to hold the number of bytes that the * ziplist occupies. This value needs to be stored to be able to resize the * entire structure without the need to traverse it first. * * <zlbytes> 是一个无符号整数，保存着 ziplist 使用的内存数量。 * * 通过这个值，程序可以直接对 ziplist 的内存大小进行调整， * 而无须为了计算 ziplist 的内存大小而遍历整个列表。 * * <zltail> is the offset to the last entry in the list. This allows a pop * operation on the far side of the list without the need for full traversal. * * <zltail> 保存着到达列表中最后一个节点的偏移量。 * * 这个偏移量使得对表尾的 pop 操作可以在无须遍历整个列表的情况下进行。 * * <zllen> is the number of entries.When this value is larger than 2**16-2, * we need to traverse the entire list to know how many items it holds. * * <zllen> 保存着列表中的节点数量。 *  * 当 zllen 保存的值大于 2**16-2 时， * 程序需要遍历整个列表才能知道列表实际包含了多少个节点。 * * <zlend> is a single byte special value, equal to 255, which indicates the * end of the list. * * <zlend> 的长度为 1 字节，值为 255 ，标识列表的末尾。 * * ZIPLIST ENTRIES: * ZIPLIST 节点： * * Every entry in the ziplist is prefixed by a header that contains two pieces * of information. First, the length of the previous entry is stored to be * able to traverse the list from back to front. Second, the encoding with an * optional string length of the entry itself is stored. * * 每个 ziplist 节点的前面都带有一个 header ，这个 header 包含两部分信息： * * 1)前置节点的长度，在程序从后向前遍历时使用。 * * 2)当前节点所保存的值的类型和长度。 * * The length of the previous entry is encoded in the following way: * If this length is smaller than 254 bytes, it will only consume a single * byte that takes the length as value. When the length is greater than or * equal to 254, it will consume 5 bytes. The first byte is set to 254 to * indicate a larger value is following. The remaining 4 bytes take the * length of the previous entry as value. * * 编码前置节点的长度的方法如下： * * 1) 如果前置节点的长度小于 254 字节，那么程序将使用 1 个字节来保存这个长度值。 * * 2) 如果前置节点的长度大于等于 254 字节，那么程序将使用 5 个字节来保存这个长度值： *    a) 第 1 个字节的值将被设为 254 ，用于标识这是一个 5 字节长的长度值。 *    b) 之后的 4 个字节则用于保存前置节点的实际长度。 * * The other header field of the entry itself depends on the contents of the * entry. When the entry is a string, the first 2 bits of this header will hold * the type of encoding used to store the length of the string, followed by the * actual length of the string. When the entry is an integer the first 2 bits * are both set to 1. The following 2 bits are used to specify what kind of * integer will be stored after this header. An overview of the different * types and encodings is as follows: * * header 另一部分的内容和节点所保存的值有关。 * * 1) 如果节点保存的是字符串值， *    那么这部分 header 的头 2 个位将保存编码字符串长度所使用的类型， *    而之后跟着的内容则是字符串的实际长度。 * * |00pppppp| - 1 byte *      String value with length less than or equal to 63 bytes (6 bits). *      字符串的长度小于或等于 63 字节。 * |01pppppp|qqqqqqqq| - 2 bytes *      String value with length less than or equal to 16383 bytes (14 bits). *      字符串的长度小于或等于 16383 字节。 * |10______|qqqqqqqq|rrrrrrrr|ssssssss|tttttttt| - 5 bytes *      String value with length greater than or equal to 16384 bytes. *      字符串的长度大于或等于 16384 字节。 * * 2) 如果节点保存的是整数值， *    那么这部分 header 的头 2 位都将被设置为 1 ， *    而之后跟着的 2 位则用于标识节点所保存的整数的类型。 * * |11000000| - 1 byte *      Integer encoded as int16_t (2 bytes). *      节点的值为 int16_t 类型的整数，长度为 2 字节。 * |11010000| - 1 byte *      Integer encoded as int32_t (4 bytes). *      节点的值为 int32_t 类型的整数，长度为 4 字节。 * |11100000| - 1 byte *      Integer encoded as int64_t (8 bytes). *      节点的值为 int64_t 类型的整数，长度为 8 字节。 * |11110000| - 1 byte *      Integer encoded as 24 bit signed (3 bytes). *      节点的值为 24 位（3 字节）长的整数。 * |11111110| - 1 byte *      Integer encoded as 8 bit signed (1 byte). *      节点的值为 8 位（1 字节）长的整数。 * |1111xxxx| - (with xxxx between 0000 and 1101) immediate 4 bit integer. *      Unsigned integer from 0 to 12. The encoded value is actually from *      1 to 13 because 0000 and 1111 can not be used, so 1 should be *      subtracted from the encoded 4 bit value to obtain the right value. *      节点的值为介于 0 至 12 之间的无符号整数。 *      因为 0000 和 1111 都不能使用，所以位的实际值将是 1 至 13 。 *      程序在取得这 4 个位的值之后，还需要减去 1 ，才能计算出正确的值。 *      比如说，如果位的值为 0001 = 1 ，那么程序返回的值将是 1 - 1 = 0 。 * |11111111| - End of ziplist. *      ziplist 的结尾标识 * * All the integers are represented in little endian byte order. * * 所有整数都表示为小端字节序。 * * ---------------------------------------------------------------------------- * * Copyright (c) 2009-2012, Pieter Noordhuis <pcnoordhuis at gmail dot com> * Copyright (c) 2009-2012, Salvatore Sanfilippo <antirez at gmail dot com> * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * *   * Redistributions of source code must retain the above copyright notice, *     this list of conditions and the following disclaimer. *   * Redistributions in binary form must reproduce the above copyright *     notice, this list of conditions and the following disclaimer in the *     documentation and/or other materials provided with the distribution. *   * Neither the name of Redis nor the names of its contributors may be used *     to endorse or promote products derived from this software without *     specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */

ziplist节点结构体

/* * 保存 ziplist 节点信息的结构 */typedef struct zlentry {    // prevrawlen ：前置节点的长度    // prevrawlensize ：编码 prevrawlen 所需的字节大小    unsigned int prevrawlensize, prevrawlen;    // len ：当前节点值的长度    // lensize ：编码 len 所需的字节大小    unsigned int lensize, len;    // 当前节点 header 的大小    // 等于 prevrawlensize + lensize    unsigned int headersize;    // 当前节点值所使用的编码类型    unsigned char encoding;    // 指向当前节点的指针    unsigned char *p;} zlentry;

以下是插入的操作

/* Insert item at "p". *//* * 根据指针 p 所指定的位置，将长度为 slen 的字符串 s 插入到 zl 中。 * * 函数的返回值为完成插入操作之后的 ziplist * * T = O(N^2) */static unsigned char *__ziplistInsert(unsigned char *zl, unsigned char *p, unsigned char *s, unsigned int slen) {    // 记录当前 ziplist 的长度    size_t curlen = intrev32ifbe(ZIPLIST_BYTES(zl)), reqlen, prevlen = 0;    size_t offset;    int nextdiff = 0;    unsigned char encoding = 0;    long long value = 123456789; /* initialized to avoid warning. Using a value                                    that is easy to see if for some reason                                    we use it uninitialized. */    zlentry entry, tail;    /* Find out prevlen for the entry that is inserted. */    if (p[0] != ZIP_END) {        // 如果 p[0] 不指向列表末端，说明列表非空，并且 p 正指向列表的其中一个节点        // 那么取出 p 所指向节点的信息，并将它保存到 entry 结构中        // 然后用 prevlen 变量记录前置节点的长度        // （当插入新节点之后 p 所指向的节点就成了新节点的前置节点）        // T = O(1)        entry = zipEntry(p);        prevlen = entry.prevrawlen;    } else {        // 如果 p 指向表尾末端，那么程序需要检查列表是否为：        // 1)如果 ptail 也指向 ZIP_END ，那么列表为空；        // 2)如果列表不为空，那么 ptail 将指向列表的最后一个节点。        unsigned char *ptail = ZIPLIST_ENTRY_TAIL(zl);        if (ptail[0] != ZIP_END) {            // 表尾节点为新节点的前置节点            // 取出表尾节点的长度            // T = O(1)            prevlen = zipRawEntryLength(ptail);        }    }    /* See if the entry can be encoded */    // 尝试看能否将输入字符串转换为整数，如果成功的话：    // 1)value 将保存转换后的整数值    // 2)encoding 则保存适用于 value 的编码方式    // 无论使用什么编码， reqlen 都保存节点值的长度    // T = O(N)    if (zipTryEncoding(s,slen,&value,&encoding)) {        /* 'encoding' is set to the appropriate integer encoding */        reqlen = zipIntSize(encoding);    } else {        /* 'encoding' is untouched, however zipEncodeLength will use the         * string length to figure out how to encode it. */        reqlen = slen;    }    /* We need space for both the length of the previous entry and     * the length of the payload. */    // 计算编码前置节点的长度所需的大小    // T = O(1)    reqlen += zipPrevEncodeLength(NULL,prevlen);    // 计算编码当前节点值所需的大小    // T = O(1)    reqlen += zipEncodeLength(NULL,encoding,slen);    /* When the insert position is not equal to the tail, we need to     * make sure that the next entry can hold this entry's length in     * its prevlen field. */    // 只要新节点不是被添加到列表末端，    // 那么程序就需要检查看 p 所指向的节点（的 header）能否编码新节点的长度。    // nextdiff 保存了新旧编码之间的字节大小差，如果这个值大于 0     // 那么说明需要对 p 所指向的节点（的 header ）进行扩展    // T = O(1)    nextdiff = (p[0] != ZIP_END) ? zipPrevLenByteDiff(p,reqlen) : 0;    /* Store offset because a realloc may change the address of zl. */    // 因为重分配空间可能会改变 zl 的地址    // 所以在分配之前，需要记录 zl 到 p 的偏移量，然后在分配之后依靠偏移量还原 p     offset = p-zl;    // curlen 是 ziplist 原来的长度    // reqlen 是整个新节点的长度    // nextdiff 是新节点的后继节点扩展 header 的长度（要么 0 字节，要么 4 个字节）    // T = O(N)    zl = ziplistResize(zl,curlen+reqlen+nextdiff);    p = zl+offset;    /* Apply memory move when necessary and update tail offset. */    if (p[0] != ZIP_END) {        // 新元素之后还有节点，因为新元素的加入，需要对这些原有节点进行调整        /* Subtract one because of the ZIP_END bytes */        // 移动现有元素，为新元素的插入空间腾出位置        // T = O(N)        memmove(p+reqlen,p-nextdiff,curlen-offset-1+nextdiff);        /* Encode this entry's raw length in the next entry. */        // 将新节点的长度编码至后置节点        // p+reqlen 定位到后置节点        // reqlen 是新节点的长度        // T = O(1)        zipPrevEncodeLength(p+reqlen,reqlen);        /* Update offset for tail */        // 更新到达表尾的偏移量，将新节点的长度也算上        ZIPLIST_TAIL_OFFSET(zl) =            intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+reqlen);        /* When the tail contains more than one entry, we need to take         * "nextdiff" in account as well. Otherwise, a change in the         * size of prevlen doesn't have an effect on the *tail* offset. */        // 如果新节点的后面有多于一个节点        // 那么程序需要将 nextdiff 记录的字节数也计算到表尾偏移量中        // 这样才能让表尾偏移量正确对齐表尾节点        // T = O(1)        tail = zipEntry(p+reqlen);        if (p[reqlen+tail.headersize+tail.len] != ZIP_END) {            ZIPLIST_TAIL_OFFSET(zl) =                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);        }    } else {        /* This element will be the new tail. */        // 新元素是新的表尾节点        ZIPLIST_TAIL_OFFSET(zl) = intrev32ifbe(p-zl);    }    /* When nextdiff != 0, the raw length of the next entry has changed, so     * we need to cascade the update throughout the ziplist */    // 当 nextdiff != 0 时，新节点的后继节点的（header 部分）长度已经被改变，    // 所以需要级联地更新后续的节点    if (nextdiff != 0) {        offset = p-zl;        // T  = O(N^2)        zl = __ziplistCascadeUpdate(zl,p+reqlen);        p = zl+offset;    }    /* Write the entry */    // 一切搞定，将前置节点的长度写入新节点的 header    p += zipPrevEncodeLength(p,prevlen);    // 将节点值的长度写入新节点的 header    p += zipEncodeLength(p,encoding,slen);    // 写入节点值    if (ZIP_IS_STR(encoding)) {        // T = O(N)        memcpy(p,s,slen);    } else {        // T = O(1)        zipSaveInteger(p,value,encoding);    }    // 更新列表的节点数量计数器    // T = O(1)    ZIPLIST_INCR_LENGTH(zl,1);    return zl;}

以下是删除的操作

/* Delete "num" entries, starting at "p". Returns pointer to the ziplist.  * * 从位置 p 开始，连续删除 num 个节点。 * * 函数的返回值为处理删除操作之后的 ziplist 。 * * T = O(N^2) */static unsigned char *__ziplistDelete(unsigned char *zl, unsigned char *p, unsigned int num) {    unsigned int i, totlen, deleted = 0;    size_t offset;    int nextdiff = 0;    zlentry first, tail;    // 计算被删除节点总共占用的内存字节数    // 以及被删除节点的总个数    // T = O(N)    first = zipEntry(p);    for (i = 0; p[0] != ZIP_END && i < num; i++) {        p += zipRawEntryLength(p);        deleted++;    }    // totlen 是所有被删除节点总共占用的内存字节数    totlen = p-first.p;    if (totlen > 0) {        if (p[0] != ZIP_END) {            // 执行这里，表示被删除节点之后仍然有节点存在            /* Storing `prevrawlen` in this entry may increase or decrease the             * number of bytes required compare to the current `prevrawlen`.             * There always is room to store this, because it was previously             * stored by an entry that is now being deleted. */            // 因为位于被删除范围之后的第一个节点的 header 部分的大小            // 可能容纳不了新的前置节点，所以需要计算新旧前置节点之间的字节数差            // T = O(1)            nextdiff = zipPrevLenByteDiff(p,first.prevrawlen);            // 如果有需要的话，将指针 p 后退 nextdiff 字节，为新 header 空出空间            p -= nextdiff;            // 将 first 的前置节点的长度编码至 p 中            // T = O(1)            zipPrevEncodeLength(p,first.prevrawlen);            /* Update offset for tail */            // 更新到达表尾的偏移量            // T = O(1)            ZIPLIST_TAIL_OFFSET(zl) =                intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))-totlen);            /* When the tail contains more than one entry, we need to take             * "nextdiff" in account as well. Otherwise, a change in the             * size of prevlen doesn't have an effect on the *tail* offset. */            // 如果被删除节点之后，有多于一个节点            // 那么程序需要将 nextdiff 记录的字节数也计算到表尾偏移量中            // 这样才能让表尾偏移量正确对齐表尾节点            // T = O(1)            tail = zipEntry(p);            if (p[tail.headersize+tail.len] != ZIP_END) {                ZIPLIST_TAIL_OFFSET(zl) =                   intrev32ifbe(intrev32ifbe(ZIPLIST_TAIL_OFFSET(zl))+nextdiff);            }            /* Move tail to the front of the ziplist */            // 从表尾向表头移动数据，覆盖被删除节点的数据            // T = O(N)            memmove(first.p,p,                intrev32ifbe(ZIPLIST_BYTES(zl))-(p-zl)-1);        } else {            // 执行这里，表示被删除节点之后已经没有其他节点了            /* The entire tail was deleted. No need to move memory. */            // T = O(1)            ZIPLIST_TAIL_OFFSET(zl) =                intrev32ifbe((first.p-zl)-first.prevrawlen);        }        /* Resize and update length */        // 缩小并更新 ziplist 的长度        offset = first.p-zl;        zl = ziplistResize(zl, intrev32ifbe(ZIPLIST_BYTES(zl))-totlen+nextdiff);        ZIPLIST_INCR_LENGTH(zl,-deleted);        p = zl+offset;        /* When nextdiff != 0, the raw length of the next entry has changed, so         * we need to cascade the update throughout the ziplist */        // 如果 p 所指向的节点的大小已经变更，那么进行级联更新        // 检查 p 之后的所有节点是否符合 ziplist 的编码要求        // T = O(N^2)        if (nextdiff != 0)            zl = __ziplistCascadeUpdate(zl,p);    }    return zl;}

0 0