Cache 优化(矩阵乘积为例)
来源:互联网 发布:qq代理公布软件 编辑:程序博客网 时间:2024/06/06 09:10
2 * (blockSize)2 * wordSize = L1 cache size.
Degrees of Latency
The latency of data access becomes greater with each cache level. Latency of memory access is best measured in CPU clock cycles. One cycle occupies from 4 to 6 nanoseconds, depending on the CPU clock speed. The latencies to the different levels of the memory hierarchy are as follows:
CPU Register: 0 cycles.
L1 cache hit: 2 or 3 cycles.
L1 cache miss satisfied by L2 cache hit: 8 to 10 cycles.
L2 cache miss satisfied from main memory, no TLB miss: 75 to 250 cycles; that is, 300 to 1100 nanoseconds, depending on the node where the memory resides (see Table 1-3).
TLB miss requiring only reload of the TLB to refer to a virtual page already in memory: approximately 2000 cycles.
TLB miss requiring virtual page to load from backing store: hundreds of millions of cycles; that is, tens to hundreds of milliseconds.
A miss at each level of the memory hierarchy multiplies the latency by an order of magnitude or more. Clearly a program can sustain high performance only by achieving a very high ratio of cache hits at every level. Fortunately, hit ratios of 95% and higher are commonly achieved.
Cache Efficient Matrix Multiplication
Chapter 6. Optimizing Cache Utilization
test_std_faster只用1秒多
test_std_mul用7秒多
#include "../../profile.h"#include <stdio.h>#include <light/memory.h>#include <light/profiler.h>void test(int *arr, const int N, const int K){ for(int i=0; i<N; i+=K) { arr[i]*=3; }}
template<typename T>void test_std_faster(T** a, T** b, T** c, int n){ int i,j,k; LIGHT_PROFILE_FUNCTION_SCOPE(); for (k = 0; k < n; k++) { for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { c[i][j] = c[i][j] + a[i][k]*b[k][j]; } } }}template<typename T>void test_std_mul(T** a, T** b, T** c, int n){ int i,j,k; LIGHT_PROFILE_FUNCTION_SCOPE(); for (k = 0; k < n; k++) { for (i = 0; i < n; i++) { for (j = 0; j < n; j++) { c[k][i] = c[k][i] + a[k][j]*b[j][i]; } } }}int main(){ using namespace light; typedef float real; typedef int large; large m=1000, n=1000; real** A=new_array2d<real>(m,n); real** B=new_array2d<real>(m,n); real** C=new_array2d<real>(m,n); test_std_mul(A,B,C,n); test_std_faster(A,B,C,n); delete_array2d(A); delete_array2d(B); delete_array2d(C); const int dimX=64, dimY=1024, dimZ=1024; const int len=dimX*dimY*dimZ; int *arr=new int[dimX*dimY*dimZ]; { int K=1; PROFILE_SCOPE("test1"); test(arr, len, K); } for(int k=1; k<128; k+=2){ char str[20]; sprintf(str, "K=%d", k); { PROFILE_SCOPE(str); test(arr, len, k); } } delete []arr; return 0;}
具体结果:
In [test_std_mul]:7.69271374 secondsIn [test_std_faster]:1.06887197 secondsIn [test1]:0.126983577 secondsIn [K=1]:0.071336323 secondsIn [K=3]:0.070501934 secondsIn [K=5]:0.070173198 secondsIn [K=7]:0.069914513 secondsIn [K=9]:0.070830328 secondsIn [K=11]:0.069679724 secondsIn [K=13]:0.070218294 secondsIn [K=15]:0.070343338 secondsIn [K=17]:0.06793199 secondsIn [K=19]:0.065443525 secondsIn [K=21]:0.063027865 secondsIn [K=23]:0.059502513 secondsIn [K=25]:0.055603195 secondsIn [K=27]:0.05297714 secondsIn [K=29]:0.050074194 secondsIn [K=31]:0.048622758 secondsIn [K=33]:0.042584798 secondsIn [K=35]:0.0476709 secondsIn [K=37]:0.034328566 secondsIn [K=39]:0.046996806 secondsIn [K=41]:0.039327118 secondsIn [K=43]:0.028445512 secondsIn [K=45]:0.03679236 secondsIn [K=47]:0.039664869 secondsIn [K=49]:0.040579371 secondsIn [K=51]:0.03415459 secondsIn [K=53]:0.034461676 secondsIn [K=55]:0.022190094 secondsIn [K=57]:0.031438485 secondsIn [K=59]:0.031826945 secondsIn [K=61]:0.031029584 secondsIn [K=63]:0.030291529 secondsIn [K=65]:0.029730277 secondsIn [K=67]:0.029219664 secondsIn [K=69]:0.027173476 secondsIn [K=71]:0.026998234 secondsIn [K=73]:0.019982384 secondsIn [K=75]:0.027318676 secondsIn [K=77]:0.026787412 secondsIn [K=79]:0.026784303 secondsIn [K=81]:0.024683844 secondsIn [K=83]:0.023558035 secondsIn [K=85]:0.015945637 secondsIn [K=87]:0.023418765 secondsIn [K=89]:0.018921708 secondsIn [K=91]:0.016292988 secondsIn [K=93]:0.022331199 secondsIn [K=95]:0.012433372 secondsIn [K=97]:0.021208601 secondsIn [K=99]:0.01638945 secondsIn [K=101]:0.01571856 secondsIn [K=103]:0.019857388 secondsIn [K=105]:0.010748321 secondsIn [K=107]:0.01605354 secondsIn [K=109]:0.013885589 secondsIn [K=111]:0.015515276 secondsIn [K=113]:0.014337689 secondsIn [K=115]:0.012188802 secondsIn [K=117]:0.015207584 secondsIn [K=119]:0.009342447 secondsIn [K=121]:0.018276332 secondsIn [K=123]:0.008979882 secondsIn [K=125]:0.015199975 secondsIn [K=127]:0.010937606 seconds
Reduce demands on memory bandwidth by pre-loading into local variables
while( … ) { *res++ = filter[0]*signal[0] + filter[1]*signal[1] + filter[2]*signal[2]; signal++;}
float f0 = filter[0];float f1 = filter[1];float f2 = filter[2];while( … ) { *res++ = f0*signal[0] + f1*signal[1] + f2*signal[2]; signal++;}
Expose instruction-level parallelism
float f0 = filter[0], f1 = filter[1], f2 = filter[2];float s0 = signal[0], s1 = signal[1], s2 = signal[2];*res++ = f0*s0 + f1*s1 + f2*s2;do { signal += 3; s0 = signal[0]; res[0] = f0*s1 + f1*s2 + f2*s0; s1 = signal[1]; res[1] = f0*s2 + f1*s0 + f2*s1; s2 = signal[2]; res[2] = f0*s0 + f1*s1 + f2*s2; res += 3;} while( … );
Copy input operands or blocks
Reduce cache conflicts
Constant array offsets for fixed size blocks
Expose page-level locality
- Cache 优化(矩阵乘积为例)
- 矩阵乘法cache优化
- 以矩阵乘积为例看AAuto的运行速度
- cache测试及其矩阵优化
- 矩阵乘积
- 矩阵乘积
- 网站性能优化:Cache为王篇
- 以矩阵乘法为例,了解cpu cache对程序性能的影响
- 以矩阵乘法为例 了解cpu cache对程序性能的影响
- 数据结构--数组和广义表--以行逻辑链接的顺序表为存储结构的矩阵的基本运算(求矩阵乘积)
- C++ 8(乘积矩阵)
- poj 3318 (矩阵乘积)
- 矩阵链乘积
- 矩阵乘积验证器
- 任意大小矩阵乘积
- 计算矩阵连乘积
- 计算矩阵连乘积
- 计算矩阵连乘积
- Eclipse导入Tomcat源码
- windows下安装rails
- 遗传算法解迷宫问题
- Lapack 笔记(1)用户手册(0~1章)记录
- linux一日一命令 - 添加修改用户 用户组
- Cache 优化(矩阵乘积为例)
- 新的篇章4
- Android的网络应用前篇
- Java Enum学习笔记
- PHP学习笔记。。环境快速搭建/等
- PHP学习笔记。小技巧
- php防止sql注入
- 网站建设
- he.jasper.JasperException: javax.el.PropertyNotFoundException: Property '0' not found on type com.jy