Parallel Reduction --- (1) Original Implementation

来源:互联网 发布:js时间戳转换时间格式 编辑:程序博客网 时间:2024/05/11 18:24

Abstract
This blog will implement an original version of parallel reduction.

1. Key Codes

// reduction on a CUDA block    for (int i=1; i < 1024; i *= 2){        if ((tid % (2 * i)) == 0){            data[tid] += data[tid + i];        }        __syncthreads();    }

So, what is the meaning of the above codes? Well, to explain them figuratively, see a operation diagram as follows.

这里写图片描述

2. Experimental Results
这里写图片描述

In our first version of experiment, we implement the basic reduction on CUDA. The CUDA kernel runs on NVIDIA GTX 780Ti, Intel Core I7 and the operating system, Windows 7. The results show that 13.417 ms is consumed to calculate the 0+1+2+…+1023 reduction for one thousand times.

3. More Details
For more details, you can visit my source codes on Github, anyone interested in this project is warmly welcome to contribute to it.

0 0