cpu cacheline对性能影响实验
来源:互联网 发布:淘宝卖家国际转运服务 编辑:程序博客网 时间:2024/06/15 17:06
一、cacheline概念
cpu利用cache和内存之间交换数据的最小粒度不是字节,而是称为cacheline的一块固定大小的区域,详细信息参见wiki文档:
http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure
http://en.wikipedia.org/wiki/CPU_cache#Cache_entry_structure
二、cacheline查看方法
前文《cpu cache信息查看 》中介绍了查看cacheline大小的方法:
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
64
三、cacheline对性能的影响
关于cpu cache对性能的影响, Igor Ostrovsky有一篇精彩的文章:
http://igoro.com/archive/gallery-of-processor-cache-effects/
本文尝试验证上文中的观点,编写了下面的例子程序:
cacheline.c
编译一下: gcc -O0 -o cacheline cacheline.c
下面开始看看cacheline对程序性能的影响。按照cacheline的定义,我们可以推测step从1到64,加载cacheline的次数是一致的。而继续增大step,加载cacheline的次数就会变少。
看看结果:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1
134217728
Performance counter stats for './cacheline 1':
2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits
673,338,076 L1-dcache-load
1,041,209,909 cycles # 0.000 GHz
0.433421077 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2
67108864
Performance counter stats for './cacheline 2':
2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits
337,577,957 L1-dcache-load
524,684,462 cycles # 0.000 GHz
0.254773008 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4
33554432
Performance counter stats for './cacheline 4':
2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits
169,703,215 L1-dcache-load
255,623,966 cycles # 0.000 GHz
0.154640897 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64
2097152
Performance counter stats for './cacheline 64':
2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits
12,299,250 L1-dcache-load
55,040,163 cycles # 0.000 GHz
0.034769960 seconds time elapsed
可以看出,
i)step从1调整到64,L1 cache misses非常接近
ii) 程序执行时间不光取决于cache miss,还与很多因素有关(比如cpu clocks)
继续增大step:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128
1048576
Performance counter stats for './cacheline 128':
1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits
7,048,673 L1-dcache-load
38,773,055 cycles # 0.000 GHz
0.024586981 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024
131072
Performance counter stats for './cacheline 1024':
442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits
2,427,631 L1-dcache-load
17,618,913 cycles # 0.000 GHz
0.011433279 seconds time elapsed
http://igoro.com/archive/gallery-of-processor-cache-effects/
本文尝试验证上文中的观点,编写了下面的例子程序:
cacheline.c
点击(此处)折叠或打开
- #include <stdio.h>
- #include <string.h>
- #define BUF_SIZE 8388608
- #define LOOPS 16
- char arr[BUF_SIZE] __attribute__((__aligned__((64)),__section__(".data.cacheline_aligned"))) ;
- int main(int argc, char **argv)
- {
- int step = atoi(argv[1]);
- int i = 0;
- int j = 0;
- int iter = 0;
-
- for (i = 0; i < LOOPS; i++){
- for (j = 0; j < BUF_SIZE; j += step){
- iter++;
- arr[j] = 3;
- }
- }
- printf("%d\n", iter);
- return 0;
- }
编译一下: gcc -O0 -o cacheline cacheline.c
下面开始看看cacheline对程序性能的影响。按照cacheline的定义,我们可以推测step从1到64,加载cacheline的次数是一致的。而继续增大step,加载cacheline的次数就会变少。
看看结果:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1
134217728
Performance counter stats for './cacheline 1':
2,352,446 L1-dcache-loads-misses # 0.35% of all L1-dcache hits
673,338,076 L1-dcache-load
1,041,209,909 cycles # 0.000 GHz
0.433421077 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 2
67108864
Performance counter stats for './cacheline 2':
2,326,564 L1-dcache-loads-misses # 0.69% of all L1-dcache hits
337,577,957 L1-dcache-load
524,684,462 cycles # 0.000 GHz
0.254773008 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 4
33554432
Performance counter stats for './cacheline 4':
2,309,318 L1-dcache-loads-misses # 1.36% of all L1-dcache hits
169,703,215 L1-dcache-load
255,623,966 cycles # 0.000 GHz
0.154640897 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 64
2097152
Performance counter stats for './cacheline 64':
2,292,510 L1-dcache-loads-misses # 18.64% of all L1-dcache hits
12,299,250 L1-dcache-load
55,040,163 cycles # 0.000 GHz
0.034769960 seconds time elapsed
可以看出,
i)step从1调整到64,L1 cache misses非常接近
ii) 程序执行时间不光取决于cache miss,还与很多因素有关(比如cpu clocks)
继续增大step:
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 128
1048576
Performance counter stats for './cacheline 128':
1,308,532 L1-dcache-loads-misses # 18.56% of all L1-dcache hits
7,048,673 L1-dcache-load
38,773,055 cycles # 0.000 GHz
0.024586981 seconds time elapsed
perf stat -e L1-dcache-loads-misses -e L1-dcache-load -e cycles ./cacheline 1024
131072
Performance counter stats for './cacheline 1024':
442,176 L1-dcache-loads-misses # 18.21% of all L1-dcache hits
2,427,631 L1-dcache-load
17,618,913 cycles # 0.000 GHz
0.011433279 seconds time elapsed
L1 cache miss有了非常明显的下降。
0 0
- cpu cacheline对性能影响实验
- cpu cache对程序性能的影响
- CPU超频与电源设置对性能测试的影响
- batch_size 对分类器性能影响实验记录
- 安卓手机RAM的重要性以及与CPU、GPU对手机性能影响的影响
- cacheline
- cpu对编程的影响
- 以矩阵乘法为例,了解cpu cache对程序性能的影响
- 以矩阵乘法为例 了解cpu cache对程序性能的影响
- 主频和架构哪个对CPU性能的影响更重要
- Intel的“雄霸之道”,探究编译器对CPU性能的影响
- 通过实验分析索引对MySQL插入时性能的影响
- TeraSort实验--测试Map和Reduce Task数量对Hadoop性能的影响
- TeraSort实验--测试Map和Reduce Task数量对Hadoop性能的影响
- ORACLE空间管理实验3:区管理之大区小区对I/O性能的影响
- TeraSort实验--测试Map和Reduce Task数量对Hadoop性能的影响
- TeraSort实验--测试Map和Reduce Task数量对Hadoop性能的影响
- ToString()对性能的影响
- SQLiteLog : (1) no such column: playTime
- 关于Android中图片大小、内存占用与drawable文件夹关系的研究与分析
- Java NIO系列教程(七) FileChannel
- CK------json的小小知识点
- list和数组转换
- cpu cacheline对性能影响实验
- [干货]30条Android开发建议
- Python-第三方库requests详解
- 清除移动端网站点击a标签时闪现的边框或遮罩层(CSS)
- SQL查询
- inotify-tools使用
- linux系统SSH免密码登录--已解决
- 第五届蓝桥杯省赛解题报告--神奇算式
- 使用模板创建二维数组