Parallel Reduction --- (2) Remove Unnecessary Modular Arithmetic

来源:互联网 发布:一角书屋知乎 编辑:程序博客网 时间:2024/05/29 18:05

Abstract
This blog will try to improve the performance of previous reduction algorithm. Specifically, the strategy of removing unnecessary modular arithmetic will be discussed.

1. Modular arithmetic
The operation of modular arithmetic is costly referring to wiki. To avoid that problem, we may think that could it possible for us to implement our algorithm in a more efficient way?

2. Another Way to Implement
Sure in this case, the answer is yes. We reorganize our key codes as follows.

// reduction    for (int i=1; i < 1024; i *= 2){        int ntid = 2 * i * tid;        if(ntid < 1024){            data[ntid] += data[ntid + i];        }        __syncthreads();    }

A operation diagram is presented to explain these codes figuratively as follows.

这里写图片描述

3. Experimental Results
这里写图片描述

The experimental results show that the total time CUDA kernel used is 12.304 ms, which is faster than the previous implementation.

4. More
The source code can be visit on github.

0 0
原创粉丝点击