Parallel Reduction --- (3) Free Strides

来源:互联网 发布:网络管理软件排行榜 编辑:程序博客网 时间:2024/05/20 01:13

Abstract
This blog will try to improve the performance of previous reduction algorithm by freeing strides.

1. Strides
The existence of strides directly results in a discount of load or store efficiency. This is because half the threads in the transaction are not used and represent wasted bandwidth. Therefore, ensuring that as much as possible of the data fetched without strides is an important part of performance optimization of memory accesses.

2. Key Codes

// reduction    for(int i = 1024/2; i > 0; i >>=1){        if(tid < i){            data[tid] += data[tid + i];        }        __syncthreads();    }

A operation diagram is presented to explain these codes figuratively as follows.

这里写图片描述

3. Experimental Results
这里写图片描述

The experimental Results shows a much higher performance achieved by freeing strides, which is a fundamental but indispensable strategy when optimizing CUDA codes.

4. More
The source code can be viewed on Github.

0 0