dpdk ring 性能测试

来源：互联网发布：arm单片机介绍编辑：程序博客网时间：2024/06/04 20:15

在问及DPDK为何是高性能时，答案基本上都是DMA，零拷贝，hugepage，PMD轮询，以及无锁等。所以无锁结构的性能有多高呢。

DPDK无锁结构的实现

在dpdk中，无锁的结构的思路基本是这样的：

#operate  n size burstdo{    copy r->head_ptr -> local_head    local_head +n -> local_next    success = atomic32_cas(&r->head_ptr,local_head,local_next) }while(success == 0)//operate the n size burst

这是dpdk ring中无锁的核心实现。主要做了三件事情：
1. 将ring中的标记位复制到本地
2. 对本地的标记位先进行一次移位，并计算出自己占用完空间后，新的标记位应该指向的位置。
3. 原子操作CAS(compare and set), 比较本地标记位和ring中的标记位，如果相等，将ring中的标记位设置为local_next。否则回退到1.

可以看到，实际上这还是一个锁。他锁在CAS这个操作。这个原子操作如果失败了，则说明在123步的时候，有另外的线程对ring进行了操作，需要进行回退。这是一个最小粒度的锁。在多并发的时候，能够最小化等待的时间。

DPDK无锁测试结果

DPDK中提供了一个test程序来进行测试。
这个test 程序是$RTE_SDK/$RTE_TARGET/app/test。
test->输入ring_perf_autotest
结果：

RTE>>ring_perf_autotest### Testing single element and burst enq/deq ###SP/SC single enq/dequeue: 11MP/MC single enq/dequeue: 47SP/SC burst enq/dequeue (size: 8): 4MP/MC burst enq/dequeue (size: 8): 9SP/SC burst enq/dequeue (size: 32): 3MP/MC burst enq/dequeue (size: 32): 4### Testing empty dequeue ###SC empty dequeue: 2.41MC empty dequeue: 3.30### Testing using a single lcore ###SP/SC bulk enq/dequeue (size: 8): 4.64MP/MC bulk enq/dequeue (size: 8): 9.54SP/SC bulk enq/dequeue (size: 32): 3.45MP/MC bulk enq/dequeue (size: 32): 4.62### Testing using two hyperthreads ###SP/SC bulk enq/dequeue (size: 8): 15.40MP/MC bulk enq/dequeue (size: 8): 24.20SP/SC bulk enq/dequeue (size: 32): 6.89MP/MC bulk enq/dequeue (size: 32): 7.90### Testing using two physical cores ###SP/SC bulk enq/dequeue (size: 8): 30.26MP/MC bulk enq/dequeue (size: 8): 64.44SP/SC bulk enq/dequeue (size: 32): 12.70MP/MC bulk enq/dequeue (size: 32): 20.16### Testing using two NUMA nodes ###SP/SC bulk enq/dequeue (size: 8): 55.61MP/MC bulk enq/dequeue (size: 8): 183.63SP/SC bulk enq/dequeue (size: 32): 21.95MP/MC bulk enq/dequeue (size: 32): 51.76Test OK

这段测试的源码在$RTE_SDK/app/test/test_ring_perf.c中。
测试使用的是rdtsc来衡量操作所消耗的时间。
迭代1<<24次。然后计算时间，然后将迭代的时间在移位t >> 24。最后得到操作一次的平均时间。此外，如果是一次操作一个burst，还会除以burst_size。所以，得到的结果是单次入队再出队的cpu耗时。
测试环境是2Ghz 的cpu的服务器，1s的rdstc约为2000230640，同时NUMA内存值设置为2048。
rdstc的测量程序如下：

#include<stdio.h>#include<stdlib.h>#include <unistd.h>static __inline__ unsigned long long RDTSC(void){  unsigned long long int x;     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));     return x;}int main(){        unsigned long long int st, end;        st = RDTSC();        sleep(1);        end = RDTSC();        printf("%lld",end-st);        return 0;}

结论分析：
1. 在单元素/单块出入队的情况下：sp/sc的速度要快于mp/mc,块入队的时候，平均元素入队速度要优于sp/sc。这个时候进行多写多读操作，32 burst size 时单元素最快速度是5亿/s.
2. 对于空队列的出队测试，sc要优于mc，约2:3的速度。
3. 在单核测试中，sc/sp的速度要优于mc/mp，但是这个差距会随着入队burst的增大而缩小。因为在一次入队操作中，mc中原子粒度的锁的耗时是固定的，这个时候，随着一次入队操作的元素的增大，分摊到每个元素的之间的锁消耗会越来越小。这个时候进行多写多读操作，32 burst size 时单元素最快速度是4.3亿/s.
4. 在超线程测试中（超线程即在一个核上模拟两个线程),由于这个时候两个线程开始真正意义上的并行使用了，开始更多的涉及到多线程争用,速度要比单核测试中慢了3倍左右，sc/sp的速度会优于mp/mc，同样，随着入队burst增大，差距缩小。缩小到1个cpu时钟左右。这个时候进行多写多读操作，32 burst size 时单元素最快速度是2.5亿/s.
5. 在两个物理核上，时间开销要比在同一个物理核上的cpu差的更多。因为这个时候，还需要涉及到cpu间的内存寻址等。性能开销要比4中要低一倍。这个时候进行多写多读操作，32 burst size 时单元素最快速度是1亿/s.
6. 在两个NUMA node上，时间开销要比同一个物理核上的cpu差的一倍。这个时候，不同cpu所占用的内存可能不在同一根内存上了，需要跨总线进行交互，性能会下降的更厉害。这个时候进行多写多读操作，32 burst size 时单元素最快速度是0.4亿/s.

对比使用互斥锁

如果使用mutex互斥锁会如何。
在原先使用无锁结构的地方使用互斥锁来替换掉do while循环来测试一下使用互斥锁会对性能造成多少影响。

修改如下：github link
在ring结构体中加入互斥锁。

    pthread_mutex_t mut_p;//add by nachtz, for test mutex    pthread_mutex_t mut_c;//add by nachtz, for test mutex

同时在初始化ring的地方初始化锁，在mc和mp的出入队函数加入锁替换掉do while。

static inline int __attribute__((always_inline))__rte_ring_mc_do_dequeue(struct rte_ring *r, void **obj_table,         unsigned n, enum rte_ring_queue_behavior behavior){    uint32_t cons_head, prod_tail;    uint32_t cons_next, entries;    const unsigned max = n;    int success;    unsigned i, rep = 0;    uint32_t mask = r->prod.mask;    /* Avoid the unnecessary cmpset operation below, which is also     * potentially harmful when n equals 0. */    if (n == 0)        return 0;    /* move cons.head atomically */    //do { by nachtz    pthread_mutex_lock(&r->mut_c);//add by nacht, for test mutex        /* Restore n as it may change every loop */        n = max;        cons_head = r->cons.head;        prod_tail = r->prod.tail;        /* The subtraction is done between two unsigned 32bits value         * (the result is always modulo 32 bits even if we have         * cons_head > prod_tail). So 'entries' is always between 0         * and size(ring)-1. */        entries = (prod_tail - cons_head);        /* Set the actual entries for dequeue */        if (n > entries) {            if (behavior == RTE_RING_QUEUE_FIXED) {                __RING_STAT_ADD(r, deq_fail, n);                pthread_mutex_unlock(&r->mut_c);//add by nacht, for test mutex                return -ENOENT;            }            else {                if (unlikely(entries == 0)){                    __RING_STAT_ADD(r, deq_fail, n);                    pthread_mutex_unlock(&r->mut_c);//add by nacht, for test mutex                    return 0;                }                n = entries;            }        }        cons_next = cons_head + n;        success = rte_atomic32_cmpset(&r->cons.head, cons_head,                          cons_next);    //} while (unlikely(success == 0));//by nachtz    pthread_mutex_unlock(&r->mut_c);//add by nacht, for test mutex      /* copy in table */    DEQUEUE_PTRS();    rte_smp_rmb();    /*     * If there are other dequeues in progress that preceded us,     * we need to wait for them to complete     */    while (unlikely(r->cons.tail != cons_head)) {        rte_pause();        /* Set RTE_RING_PAUSE_REP_COUNT to avoid spin too long waiting         * for other thread finish. It gives pre-empted thread a chance         * to proceed and finish with ring dequeue operation. */        if (RTE_RING_PAUSE_REP_COUNT &&            ++rep == RTE_RING_PAUSE_REP_COUNT) {            rep = 0;            sched_yield();        }    }    __RING_STAT_ADD(r, deq_success, n);    r->cons.tail = cons_next;    return behavior == RTE_RING_QUEUE_FIXED ? 0 : n;}

上面是对mc的修改，mp的修改累死。完整的修改可以见github，替换掉dpdk-16.04的两个同名文件即可。
这个修改中，把锁锁在了上文提到的无锁操作算法中的123步，也就是说，整个线程中，同一个时间片只能有一个线程在做123步。这是使用mutex锁中能加到的最小粒度的地方了。

修改版性能测试

修改版测试：

RTE>>ring_perf_autotest### Testing single element and burst enq/deq ###SP/SC single enq/dequeue: 11MP/MC single enq/dequeue: 173SP/SC burst enq/dequeue (size: 8): 4MP/MC burst enq/dequeue (size: 8): 25SP/SC burst enq/dequeue (size: 32): 3MP/MC burst enq/dequeue (size: 32): 8### Testing empty dequeue ###SC empty dequeue: 2.42MC empty dequeue: 69.72### Testing using a single lcore ###SP/SC bulk enq/dequeue (size: 8): 4.64MP/MC bulk enq/dequeue (size: 8): 25.60SP/SC bulk enq/dequeue (size: 32): 3.41MP/MC bulk enq/dequeue (size: 32): 8.59### Testing using two hyperthreads ###SP/SC bulk enq/dequeue (size: 8): 13.73MP/MC bulk enq/dequeue (size: 8): 53.37SP/SC bulk enq/dequeue (size: 32): 6.91MP/MC bulk enq/dequeue (size: 32): 16.79### Testing using two physical cores ###SP/SC bulk enq/dequeue (size: 8): 30.18MP/MC bulk enq/dequeue (size: 8): 141.87SP/SC bulk enq/dequeue (size: 32): 12.66MP/MC bulk enq/dequeue (size: 32): 37.86### Testing using two NUMA nodes ###SP/SC bulk enq/dequeue (size: 8): 60.75MP/MC bulk enq/dequeue (size: 8): 422.54SP/SC bulk enq/dequeue (size: 32): 21.97MP/MC bulk enq/dequeue (size: 32): 113.99Test OK

可以看到，使用锁了之后，性能都下降了一倍以上。尤其是在资源争用频繁的多线程场景中，性能差距更加明显。所以使用无锁结构，对性能的提升还是很大的。

结论

DPDK无锁ring环的结构要比使用mutex锁要高一倍以上，另外，细看的话，还会发现DPDK在出入队的宏上还用到了loop unrolling。此外，在使用上，尽可能使用大的burst会更好的提升性能。比如从结果上看，burst size 32的性能是burst size 8时候的一倍以上。一般来说习惯使用64，尽可能使64的倍数，因为DPDK在loop unrolling的时候，是4个为一组减少循环次数的。所以4的倍数可以减少循环判断次数。

阅读全文

0 0