How do I choose grid and block dimensions for CUDA kernels?

来源：互联网发布：云计算是谁提出的编辑：程序博客网时间：2024/06/12 22:02

The answers above point out how the block size can impact performance and suggest a common heuristic for its choice based on occupancy maximization. Without wanting to provide the criterion to choose the block size, it would be worth mentioning that CUDA 6.5 (now in Release Candidate version) includes several new runtime functions to aid in occupancy calculations and launch configuration, see

CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

One of the useful functions is cudaOccupancyMaxPotentialBlockSize which heuristically calculates a block size that achieves the maximum occupancy. The values provided by that function could be then used as the starting point of a manual optimization of the launch parameters. Below is a little example.

#include <stdio.h>/************************//* TEST KERNEL FUNCTION *//************************/__global__ void MyKernel(int *a, int *b, int *c, int N) {     int idx = threadIdx.x + blockIdx.x * blockDim.x;     if (idx < N) { c[idx] = a[idx] + b[idx]; } } /********//* MAIN *//********/void main() {     const int N = 1000000;    int blockSize;      // The launch configurator returned block size     int minGridSize;    // The minimum grid size needed to achieve the maximum occupancy for a full device launch     int gridSize;       // The actual grid size needed, based on input size     int* h_vec1 = (int*) malloc(N*sizeof(int));    int* h_vec2 = (int*) malloc(N*sizeof(int));    int* h_vec3 = (int*) malloc(N*sizeof(int));    int* h_vec4 = (int*) malloc(N*sizeof(int));    int* d_vec1; cudaMalloc((void**)&d_vec1, N*sizeof(int));    int* d_vec2; cudaMalloc((void**)&d_vec2, N*sizeof(int));    int* d_vec3; cudaMalloc((void**)&d_vec3, N*sizeof(int));    for (int i=0; i<N; i++) {        h_vec1[i] = 10;        h_vec2[i] = 20;        h_vec4[i] = h_vec1[i] + h_vec2[i];    }    cudaMemcpy(d_vec1, h_vec1, N*sizeof(int), cudaMemcpyHostToDevice);    cudaMemcpy(d_vec2, h_vec2, N*sizeof(int), cudaMemcpyHostToDevice);    float time;    cudaEvent_t start, stop;    cudaEventCreate(&start);    cudaEventCreate(&stop);    cudaEventRecord(start, 0);    cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, N);     // Round up according to array size     gridSize = (N + blockSize - 1) / blockSize;     cudaEventRecord(stop, 0);    cudaEventSynchronize(stop);    cudaEventElapsedTime(&time, start, stop);    printf("Occupancy calculator elapsed time:  %3.3f ms \n", time);    cudaEventRecord(start, 0);    MyKernel<<<gridSize, blockSize>>>(d_vec1, d_vec2, d_vec3, N);     cudaEventRecord(stop, 0);    cudaEventSynchronize(stop);    cudaEventElapsedTime(&time, start, stop);    printf("Kernel elapsed time:  %3.3f ms \n", time);    printf("Blocksize %i\n", blockSize);    cudaMemcpy(h_vec3, d_vec3, N*sizeof(int), cudaMemcpyDeviceToHost);    for (int i=0; i<N; i++) {        if (h_vec3[i] != h_vec4[i]) { printf("Error at i = %i! Host = %i; Device = %i\n", i, h_vec4[i], h_vec3[i]); return; };    }    printf("Test passed\n");}

EDIT

The cudaOccupancyMaxPotentialBlockSize is defined in the cuda_runtime.h file and is defined as follows:

template<class T>__inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize(    int    *minGridSize,    int    *blockSize,    T       func,    size_t  dynamicSMemSize = 0,    int     blockSizeLimit = 0){    return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit);}

The meanings for the parameters is the following

minGridSize     = Suggested min grid size to achieve a full machine launch.blockSize       = Suggested block size to achieve maximum occupancy.func            = Kernel function.dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func.blockSizeLimit  = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements.

Note that, as of CUDA 6.5, one needs to compute one's own 2D/3D block dimensions from the 1D block size suggested by the API.

0 0