Parallel Reduction --- (4) Free Loops

来源:互联网 发布:马天宇女装知乎 编辑:程序博客网 时间:2024/05/20 04:48

Abstract
This blog will talk about how to free unnecessary loops on CUDA codes.

1. Free Loops on CPU
When talking about “unrolling” or “free loops”, the first thing coming up to my mind is “#pragma unroll”, a compiler optimization instruction. From Wiki we can know that free loops is a technique that attempts to optimize a program’s execution speed at the expense of its binary size. For example, the original codes like this

for(int i=0;i<4;i++){    cout<<"hello world"<<endl;}

After unrolling the unnecessary loop-control instruction in pursuit of a higher performance while unavoidably costing more memory space, then the codes are like this

cout<<"hello world"<<endl;cout<<"hello world"<<endl;cout<<"hello world"<<endl;cout<<"hello world"<<endl;

Remember that there is no a free lunch. Free loops is an approach known as the space-time tradeoff.

2. Free Loops on GPU
The key codes of reduction on CUDA are shown as follows.

// reduction    if(tid < 512){        data[tid] += data[tid + 512];    }    __syncthreads();    if(tid < 256){        data[tid] += data[tid + 256];    }    __syncthreads();    if(tid < 128){        data[tid] += data[tid + 128];    }    __syncthreads();    if(tid < 64){        data[tid] += data[tid + 64];    }    __syncthreads();    if(tid < 32){        data[tid] += data[tid + 32];        data[tid] += data[tid + 16];        data[tid] += data[tid + 8];        data[tid] += data[tid + 4];        data[tid] += data[tid + 2];        data[tid] += data[tid + 1];    }

3. Experimental Results
这里写图片描述

A further improved performance can be made by freeing or unrolling loops. That’s really cool, isn’t it?

4. More details
The CUDA-based reduction codes can be viewed on Github.

0 0