cpu cacheline对性能影响实验

来源：互联网发布：淘宝卖家国际转运服务编辑：程序博客网时间：2024/06/15 17:06

一、cacheline概念

cpu利用cache和内存之间交换数据的最小粒度不是字节，而是称为cacheline的一块固定大小的区域，详细信息参见wiki文档：
http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure

二、cacheline查看方法

前文《cpu cache信息查看》中介绍了查看cacheline大小的方法：
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64

三、cacheline对性能的影响

关于cpu cache对性能的影响， Igor Ostrovsky有一篇精彩的文章：
http://igoro.com/archive/gallery-of-processor-cache-effects/

本文尝试验证上文中的观点，编写了下面的例子程序：
cacheline.c

点击(此处)折叠或打开

#include <stdio.h>
#include <string.h>

#define BUF_SIZE 8388608
#define LOOPS 16 

char arr[BUF_SIZE] __attribute__((__aligned__((64)),__section__(".data.cacheline_aligned"))) ; 

int main(int argc, char **argv)
{
  int step = atoi(argv[1]);
  int i = 0;
  int j = 0;
  int iter = 0;
  
  for (i = 0; i < LOOPS; i++){
    for (j = 0; j < BUF_SIZE; j += step){
      iter++;
      arr[j] = 3;
    }
  }

  printf("%d\n", iter);
  return 0;
}

编译一下： gcc -O0 -o cacheline cacheline.c

下面开始看看cacheline对程序性能的影响。按照cacheline的定义，我们可以推测step从1到64，加载cacheline的次数是一致的。而继续增大step，加载cacheline的次数就会变少。

看看结果：
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1
134217728

Performance counter stats for './cacheline 1':

2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits
673,338,076 L1-dcache-load
1,041,209,909 cycles # 0.000 GHz

0.433421077 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2
67108864

Performance counter stats for './cacheline 2':

2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits
337,577,957 L1-dcache-load
524,684,462 cycles # 0.000 GHz

0.254773008 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4
33554432

Performance counter stats for './cacheline 4':

2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits
169,703,215 L1-dcache-load
255,623,966 cycles # 0.000 GHz

0.154640897 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64
2097152

Performance counter stats for './cacheline 64':

2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits
12,299,250 L1-dcache-load
55,040,163 cycles # 0.000 GHz

0.034769960 seconds time elapsed

可以看出，
i）step从1调整到64，L1 cache misses非常接近
ii）程序执行时间不光取决于cache miss，还与很多因素有关（比如cpu clocks）

继续增大step：
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128
1048576

Performance counter stats for './cacheline 128':

1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits
7,048,673 L1-dcache-load
38,773,055 cycles # 0.000 GHz

0.024586981 seconds time elapsed

perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024
131072

Performance counter stats for './cacheline 1024':

442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits
2,427,631 L1-dcache-load
17,618,913 cycles # 0.000 GHz

0.011433279 seconds time elapsed

L1 cache miss有了非常明显的下降。

0 0