多线程false sharing带来的影响和一些优化.

来源：互联网发布：fft算法c语言编辑：程序博客网时间：2024/06/04 23:35

最近在线项目中测试一个无锁队列的性能的时候发现,在一个线程push另一个线程pop整型数据的时候,吞吐量竟然和std::queue+spinlock类似甚至更差，这样完全体现不出lockfree的优势, 决定找找原因.

这个无锁队列是通过一个头指针来push数据,一个尾指针来pop数据来实现的.

template<typename T>class LockFreeQueue {    struct Node {        T value;        Node *next;    }    ...    bool Push(const T & data) {        //只操作tail成员        ...    }    bool Pop(T &data) {        //只操作head成员        ...    }    ...private:    Node * head;    Node * tail;}

使用上一个线程只调用队列的push方法, 另一个线程调用队列的pop方法.
push方法只访问tail成员实现入队,pop方法只方法head成员实现出队.

注意到head和tail指针有极大的可能在同一个L1缓存行上的, 这就会造成false sharing,这里体现在push线程操作tail指针的时候将缓存行刷新掉了,cpu会通知读线程所在的cpu将该相同的缓存行也刷新掉,以达到多处理器指针的cache coherence, 读线程修改head指针的时候也会做相同的事情, 在系统高压的情况下这种访存方式会对性能造成很大的伤害,具体通过下面一个例子来看.

例子程序P1

#include <pthread.h>#include <stdio.h>#define ITERATIONS 1e9int A;int B;static void * thread_func(void *){        for (int i = 0; i < ITERATIONS; i++){                if (A == 1) {                        A = 0;                } else {                        A = 1;                }        }}int main() {        pthread_t tid;        pthread_create(&tid, NULL, thread_func, NULL);        for (int i = 0; i < ITERATIONS; i++){                if (B == 1) {                        B = 0;                } else {                        B = 1;                }        }        return 0;}

这个简单程序展示了false sharing带来的性能损失,两个线程访问不同的地址,但是却在同一个缓存行上.这里通过linux下的perf工具来统计程序的运行指标。

>> perf stat -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches -e cycles -e cs ./cacheline_unaligned Performance counter stats for './cacheline_unaligned':    19,006,842,755 instructions              # 注意这行   0.97  insns per cycle         [36.43%]        78,160,551 cache-references                                             [45.54%]            24,661 cache-misses              #    0.032 % of all cache refs     [45.54%]     7,990,941,413 L1-dcache-loads                                              [45.54%]        78,220,759 L1-dcache-load-misses     #  注意这行  0.98% of all L1-dcache hits   [45.54%]     4,009,177,234 L1-dcache-stores                                             [36.31%]        69,339,246 L1-dcache-store-misses #注意这行                                   [36.54%]          8,499,780 LLC-loads                                                    [36.49%]             5,684 LLC-load-misses           #    0.07% of all LL-cache hits    [36.45%]        76,310,160 LLC-prefetches                                               [18.20%]    19,623,169,444 cycles                    [27.27%]               612 cs                                                                 3.438158407 seconds time elapsed

对代码做一个简单的修改得到例子程序P2,对变量之间加以padding, 使得变量A和B存在于独立的缓存行上,这里机器的cacheline大小为64个字节, linux下可以通过getconf LEVEL1_DCACHE_LINESIZE来得到这个大小.

#include <pthread.h>#include <stdio.h>#define ITERATIONS 1e9int A;int32_t __padding__[16]; //padding, 使A,B独立存在一个缓存行上int B;static void * thread_func(void *){        for (int i = 0; i < ITERATIONS; i++){                if (A == 1) {                        A = 0;                } else {                        A = 1;                }        }}int main() {        pthread_t tid;        pthread_create(&tid, NULL, thread_func, NULL);        for (int i = 0; i < ITERATIONS; i++){                if (B == 1) {                        B = 0;                } else {                        B = 1;                }        }        return 0;}

接着看修改后的结果.

>> perf stat -e instructions -e cache-references -e cache-misses -e L1-dcache-loads -e L1-dcache-load-misses -e L1-dcache-stores -e L1-dcache-store-misses -e LLC-loads -e LLC-load-misses -e LLC-prefetches -e cycles -e cs ./cacheline_aligned  Performance counter stats for './cacheline_aligned':    18,128,002,993 instructions              # 注意这行   1.58  insns per cycle         [36.49%]           180,503 cache-references                                             [45.61%]            17,691 cache-misses              #    9.801 % of all cache refs     [45.61%]     7,640,076,865 L1-dcache-loads                                              [45.61%]           208,122 L1-dcache-load-misses     # 注意这行   0.00% of all L1-dcache hits   [45.61%]     3,841,727,742 L1-dcache-stores                                             [36.27%]            16,237 L1-dcache-store-misses   # 注意这行                                 [36.65%]            53,850 LLC-loads                                                    [36.58%]             3,879 LLC-load-misses           #    7.20% of all LL-cache hits    [36.51%]             4,254 LLC-prefetches                                               [18.22%]    11,479,745,606 cycles                    [27.27%]               379 cs                                                                 2.030437774 seconds time elapsed

看到修改后L1-dcache-store-misses数量约为未修改版本的0.26%, ipc指标从0.97提升到了1.58. 这带来的好处是相当的明显的,因为L1缓存的miss使得cpu流水线的暂停去访问下一级缓存获取数据，减少了ipc(instructions per cycle)这个重要的指标.

上述的无锁队列的情况恰好和这种访存方式相同, 所以很自然可以做这样的一个优化.

template<typename T>class LockFreeQueue {    struct Node {        T value;        Node *next;    }    ...    bool Push(const T & data) {        //只操作tail成员        ...    }    bool Pop(T &data) {        //只操作head成员        ...    }    ...private:    Node * head;    char __padding__[CACHELINE_SIZE - sizeof(Node *)];    Node * tail;}

经过测试之后, 相比之前的队列实现,效率约有5~60%的提升, 还是比较满意的!

通过这个简单的例子可以得出在多线程环境下要尽量避免Flase sharing的发生,我自己总结了方法.即将一个线程独享的变量通过padding放置在同一个缓存行上.

0 0