Parallel Reduction --- (5) Question: How Many Threads on Earth We Need?

来源：互联网发布：2005sql向表中添加数据编辑：程序博客网时间：2024/05/08 16:18

Abstract
This blog will try to show you how to further optimize your CUDA performance via exploring how many threads should be called.

1. How many threads we need?
The first question here is how many threads on earth that we need? Someone may say as much as possible (known as TLP) while the others may say as little as possible (known as ILP). Well, I would like to say that this is a trade-off problem!
In the previous case, our reduction implementation is based on the TLP strategies. It is very likely that our previous performance can be further improved by combining TLP and ILP strategies together.

2. Use Less Threads (512 threads VS 1024 threads)

// load one-time coputation data into shared memory    __shared__ uint64_t data[512];    if(tid < 512){        data[tid] = data_gpu[tid] + data_gpu[tid + 512];    }    __syncthreads();    // reduction    if(tid < 256){        data[tid] += data[tid + 256];    }    __syncthreads();    if(tid < 128){        data[tid] += data[tid + 128];    }    __syncthreads();    if(tid < 64){        data[tid] += data[tid + 64];    }    __syncthreads();    if(tid < 32){        data[tid] += data[tid + 32];        data[tid] += data[tid + 16];        data[tid] += data[tid + 8];        data[tid] += data[tid + 4];        data[tid] += data[tid + 2];        data[tid] += data[tid + 1];    }    // write root node (data[0]) back    if(tid == 0){        data_gpu[tid] = data[tid];    }

4. Experimental Results
这里写图片描述

The experimental results shows an new higher performance we can get by balancing the TLP and ILP strategies.

5. More Details
The whole project has been submitted to Github, where anyone who is interested on CUDA is warmly welcome to develop this project. Big thanks to all of you!

1 0