Cache 优化（矩阵乘积为例）

来源：互联网发布：qq代理公布软件编辑：程序博客网时间：2024/06/06 09:10

2 * (blockSize)2 * wordSize = L1 cache size.

-O4 –funroll- loops

2 * (blockSize)2 * wordSize = L2 cache size

Degrees of Latency

The latency of data access becomes greater with each cache level. Latency of memory access is best measured in CPU clock cycles. One cycle occupies from 4 to 6 nanoseconds, depending on the CPU clock speed. The latencies to the different levels of the memory hierarchy are as follows:

CPU Register: 0 cycles.
L1 cache hit: 2 or 3 cycles.
L1 cache miss satisfied by L2 cache hit: 8 to 10 cycles.
L2 cache miss satisfied from main memory, no TLB miss: 75 to 250 cycles; that is, 300 to 1100 nanoseconds, depending on the node where the memory resides (see Table 1-3).
TLB miss requiring only reload of the TLB to refer to a virtual page already in memory: approximately 2000 cycles.
TLB miss requiring virtual page to load from backing store: hundreds of millions of cycles; that is, tens to hundreds of milliseconds.

A miss at each level of the memory hierarchy multiplies the latency by an order of magnitude or more. Clearly a program can sustain high performance only by achieving a very high ratio of cache hits at every level. Fortunately, hit ratios of 95% and higher are commonly achieved.

Cache Efficient Matrix Multiplication

Chapter 6. Optimizing Cache Utilization

问题的关键：test_std_faster 中内嵌的2重循环可直接放入到L2Cache中。

test_std_faster

只用1秒多

test_std_mul

用7秒多

#include "../../profile.h"#include <stdio.h>#include <light/memory.h>#include <light/profiler.h>void test(int *arr, const int N, const int K){    for(int i=0; i<N; i+=K)    {        arr[i]*=3;    }}

template<typename T>void test_std_faster(T** a, T** b, T** c, int n){    int i,j,k;    LIGHT_PROFILE_FUNCTION_SCOPE();    for (k = 0; k < n; k++)    {        for (i = 0; i < n; i++)        {            for (j = 0; j < n; j++)            {                c[i][j] = c[i][j] + a[i][k]*b[k][j];            }        }    }}template<typename T>void test_std_mul(T** a, T** b, T** c, int n){    int i,j,k;    LIGHT_PROFILE_FUNCTION_SCOPE();    for (k = 0; k < n; k++)    {        for (i = 0; i < n; i++)        {            for (j = 0; j < n; j++)            {                c[k][i] = c[k][i] + a[k][j]*b[j][i];            }        }    }}int main(){    using namespace light;    typedef float real;    typedef int large;    large m=1000, n=1000;    real** A=new_array2d<real>(m,n);    real** B=new_array2d<real>(m,n);    real** C=new_array2d<real>(m,n);    test_std_mul(A,B,C,n);    test_std_faster(A,B,C,n);    delete_array2d(A);    delete_array2d(B);    delete_array2d(C);    const int dimX=64, dimY=1024, dimZ=1024;    const int len=dimX*dimY*dimZ;    int *arr=new int[dimX*dimY*dimZ];    {        int K=1;        PROFILE_SCOPE("test1");        test(arr, len, K);    }    for(int k=1; k<128; k+=2){        char str[20];        sprintf(str, "K=%d", k);        {            PROFILE_SCOPE(str);            test(arr, len, k);        }    }    delete []arr;    return 0;}

具体结果：

In [test_std_mul]:7.69271374 secondsIn [test_std_faster]:1.06887197 secondsIn [test1]:0.126983577 secondsIn [K=1]:0.071336323 secondsIn [K=3]:0.070501934 secondsIn [K=5]:0.070173198 secondsIn [K=7]:0.069914513 secondsIn [K=9]:0.070830328 secondsIn [K=11]:0.069679724 secondsIn [K=13]:0.070218294 secondsIn [K=15]:0.070343338 secondsIn [K=17]:0.06793199 secondsIn [K=19]:0.065443525 secondsIn [K=21]:0.063027865 secondsIn [K=23]:0.059502513 secondsIn [K=25]:0.055603195 secondsIn [K=27]:0.05297714 secondsIn [K=29]:0.050074194 secondsIn [K=31]:0.048622758 secondsIn [K=33]:0.042584798 secondsIn [K=35]:0.0476709 secondsIn [K=37]:0.034328566 secondsIn [K=39]:0.046996806 secondsIn [K=41]:0.039327118 secondsIn [K=43]:0.028445512 secondsIn [K=45]:0.03679236 secondsIn [K=47]:0.039664869 secondsIn [K=49]:0.040579371 secondsIn [K=51]:0.03415459 secondsIn [K=53]:0.034461676 secondsIn [K=55]:0.022190094 secondsIn [K=57]:0.031438485 secondsIn [K=59]:0.031826945 secondsIn [K=61]:0.031029584 secondsIn [K=63]:0.030291529 secondsIn [K=65]:0.029730277 secondsIn [K=67]:0.029219664 secondsIn [K=69]:0.027173476 secondsIn [K=71]:0.026998234 secondsIn [K=73]:0.019982384 secondsIn [K=75]:0.027318676 secondsIn [K=77]:0.026787412 secondsIn [K=79]:0.026784303 secondsIn [K=81]:0.024683844 secondsIn [K=83]:0.023558035 secondsIn [K=85]:0.015945637 secondsIn [K=87]:0.023418765 secondsIn [K=89]:0.018921708 secondsIn [K=91]:0.016292988 secondsIn [K=93]:0.022331199 secondsIn [K=95]:0.012433372 secondsIn [K=97]:0.021208601 secondsIn [K=99]:0.01638945 secondsIn [K=101]:0.01571856 secondsIn [K=103]:0.019857388 secondsIn [K=105]:0.010748321 secondsIn [K=107]:0.01605354 secondsIn [K=109]:0.013885589 secondsIn [K=111]:0.015515276 secondsIn [K=113]:0.014337689 secondsIn [K=115]:0.012188802 secondsIn [K=117]:0.015207584 secondsIn [K=119]:0.009342447 secondsIn [K=121]:0.018276332 secondsIn [K=123]:0.008979882 secondsIn [K=125]:0.015199975 secondsIn [K=127]:0.010937606 seconds

Reduce demands on memory bandwidth by pre-loading into local variables

while( … ) {   *res++ = filter[0]*signal[0]            + filter[1]*signal[1]            + filter[2]*signal[2];   signal++;}

float f0 = filter[0];float f1 = filter[1];float f2 = filter[2];while( … ) {   *res++ = f0*signal[0]            + f1*signal[1]            + f2*signal[2];   signal++;}

Expose instruction-level parallelism

float f0 = filter[0], f1 = filter[1], f2 = filter[2];float s0 = signal[0], s1 = signal[1], s2 = signal[2];*res++ = f0*s0 + f1*s1 + f2*s2;do {   signal += 3;   s0 = signal[0];   res[0] = f0*s1 + f1*s2 + f2*s0;   s1 = signal[1];   res[1] = f0*s2 + f1*s0 + f2*s1;   s2 = signal[2];   res[2] = f0*s0 + f1*s1 + f2*s2;   res += 3;} while( … );

Copy input operands or blocks
Reduce cache conflicts
Constant array offsets for fixed size blocks
Expose page-level locality