How do I choose grid and block dimensions for CUDA kernels?
来源:互联网 发布:云计算是谁提出的 编辑:程序博客网 时间:2024/06/12 22:02
The answers above point out how the block size can impact performance and suggest a common heuristic for its choice based on occupancy maximization. Without wanting to provide the criterion to choose the block size, it would be worth mentioning that CUDA 6.5 (now in Release Candidate version) includes several new runtime functions to aid in occupancy calculations and launch configuration, see
CUDA Pro Tip: Occupancy API Simplifies Launch Configuration
One of the useful functions is cudaOccupancyMaxPotentialBlockSize
which heuristically calculates a block size that achieves the maximum occupancy. The values provided by that function could be then used as the starting point of a manual optimization of the launch parameters. Below is a little example.
#include <stdio.h>/************************//* TEST KERNEL FUNCTION *//************************/__global__ void MyKernel(int *a, int *b, int *c, int N) { int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) { c[idx] = a[idx] + b[idx]; } } /********//* MAIN *//********/void main() { const int N = 1000000; int blockSize; // The launch configurator returned block size int minGridSize; // The minimum grid size needed to achieve the maximum occupancy for a full device launch int gridSize; // The actual grid size needed, based on input size int* h_vec1 = (int*) malloc(N*sizeof(int)); int* h_vec2 = (int*) malloc(N*sizeof(int)); int* h_vec3 = (int*) malloc(N*sizeof(int)); int* h_vec4 = (int*) malloc(N*sizeof(int)); int* d_vec1; cudaMalloc((void**)&d_vec1, N*sizeof(int)); int* d_vec2; cudaMalloc((void**)&d_vec2, N*sizeof(int)); int* d_vec3; cudaMalloc((void**)&d_vec3, N*sizeof(int)); for (int i=0; i<N; i++) { h_vec1[i] = 10; h_vec2[i] = 20; h_vec4[i] = h_vec1[i] + h_vec2[i]; } cudaMemcpy(d_vec1, h_vec1, N*sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_vec2, h_vec2, N*sizeof(int), cudaMemcpyHostToDevice); float time; cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop); cudaEventRecord(start, 0); cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize, MyKernel, 0, N); // Round up according to array size gridSize = (N + blockSize - 1) / blockSize; cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Occupancy calculator elapsed time: %3.3f ms \n", time); cudaEventRecord(start, 0); MyKernel<<<gridSize, blockSize>>>(d_vec1, d_vec2, d_vec3, N); cudaEventRecord(stop, 0); cudaEventSynchronize(stop); cudaEventElapsedTime(&time, start, stop); printf("Kernel elapsed time: %3.3f ms \n", time); printf("Blocksize %i\n", blockSize); cudaMemcpy(h_vec3, d_vec3, N*sizeof(int), cudaMemcpyDeviceToHost); for (int i=0; i<N; i++) { if (h_vec3[i] != h_vec4[i]) { printf("Error at i = %i! Host = %i; Device = %i\n", i, h_vec4[i], h_vec3[i]); return; }; } printf("Test passed\n");}
EDIT
The cudaOccupancyMaxPotentialBlockSize
is defined in the cuda_runtime.h
file and is defined as follows:
template<class T>__inline__ __host__ CUDART_DEVICE cudaError_t cudaOccupancyMaxPotentialBlockSize( int *minGridSize, int *blockSize, T func, size_t dynamicSMemSize = 0, int blockSizeLimit = 0){ return cudaOccupancyMaxPotentialBlockSizeVariableSMem(minGridSize, blockSize, func, __cudaOccupancyB2DHelper(dynamicSMemSize), blockSizeLimit);}
The meanings for the parameters is the following
minGridSize = Suggested min grid size to achieve a full machine launch.blockSize = Suggested block size to achieve maximum occupancy.func = Kernel function.dynamicSMemSize = Size of dynamically allocated shared memory. Of course, it is known at runtime before any kernel launch. The size of the statically allocated shared memory is not needed as it is inferred by the properties of func.blockSizeLimit = Maximum size for each block. In the case of 1D kernels, it can coincide with the number of input elements.
Note that, as of CUDA 6.5, one needs to compute one's own 2D/3D block dimensions from the 1D block size suggested by the API.
- How do I choose grid and block dimensions for CUDA kernels?
- How do I choose grid and block dimensions for CUDA kernels?
- Understanding CUDA grid dimensions, block dimensions and threads organization
- How do I get started with CUDA?
- cuda编程之thread,block and grid
- "How Do I?" Videos for Devices
- "How Do I?" Videos for Devices
- "How Do I?" Videos for Devices
- How do I download and install Java for my Linux computer?
- How to choose the FF plywood and wood I beam
- How to choose the FF plywood and wood I beam
- How Do I Declare A Block in Objective-C?
- How Do I Declare A Block in Objective-C?
- How Do I Declare A Block in Objective-C? [备忘]
- How Do I Declare A Block in Objective-C?
- CUDA Pro Tip:Write Flexible Kernels with Grid-Stride Loops
- cuda-Block和Grid设定
- How do I use gcc, g++, and gdb?
- Android studio中的target,同一个项目多种应用打包
- 今天累了,休息看了下书法,慢慢的写了下,感觉不错
- Linux之Makefile(words)
- 插入排序算法——Java实现
- Linux之Makefile(firstword)
- How do I choose grid and block dimensions for CUDA kernels?
- VMware简介、它的特点以及在里面如何创建快照、克隆
- camera 3 hal 不支持 soc
- 算法之二叉树各种遍历
- autolayout
- 博士后和民工的区别
- C语言中的结构体指针
- Android设置标题栏后,关于Toast异常的解决办法
- Swap Nodes in Pairs