CUDA程序优化技巧

来源：互联网发布：代理注册公司知乎编辑：程序博客网时间：2024/06/07 06:23

CUDA程序优化技巧

2013-11-18 23:41 1469人阅读评论(4)收藏举报

分类：

CUDA（24）

目录(?)[+]

有如下几个方面

1. 使用共享内存减少全局内存读取次数

减少全局内存的重复数据的重复访问，此处大有学问，需要设计我们的线程组织模式，最大可能利用共享内存，可参考矩阵乘法优化问题；

2. 把全局内存绑定为纹理；

3. 减少bank conflict, 让不同线程读取连续内存。

半束的线程如果访问的全局内存是16B的倍数，则可以合并为一次访问，所以要求，连续的线程访问连续的内存

对于共享内存http://blog.csdn.net/smsmn/article/details/6336060(在CPU程序中是，提高cash命中率，但此时似乎不是cash命中率的问题)；

4. 尺寸和对齐的要求内存对齐。

因为GPU 上的内存控制器，从某个固定的倍数地址开始读取，才会有最高的效率（例如 16 bytes 的倍数）。分配内存时使用cudaMallocPitch替代cudaMalloc，相应 cudaMemcpy2D替代 cudaMemcpy。（这其实和(2)中原理类似）。主要是对全局内存

5.合并访问

不同线程的访问合并。如果相邻线程访问相邻的数据，则可以合并

注：

global memory 的存取，要尽可能的连续。这是因为 DRAM 存取的特性所造成的结果。更精确的说，global memory 的存取，需要是 "coalesced"。所谓的 coalesced，是表示除了连续之外，而且它开始的地址，必须是每个 thread 所存取的大小的 16 倍。例如，如果每个 thread 都读取 32 bits 的数据，那么第一个 thread 读取的地址，必须是 16*4 = 64 bytes 的倍数。

6.使用流并行

代码一：

[cpp] view plain copy

void my_jpc_colgr_gpu(int **data_a, int height,int width,int current_lvls){//当大于512时执行这个函数
cudaStream_t stream[3];
dim3 blocks(width);//width个线程块
dim3 threads(256);//让一个线程块计算一列,当height>1024时，得想办法
for(int i = 0;i<3;i++)
{
cudaStreamCreate(&stream[i]); //创建流
}
for(int i = 0;i<3;i++)
{
jpc_colgr_ker<<<blocks,threads,sizeof(int)*(height+1)>>>(dev_a+height_tmp*i*a_pitch/4,height,width,a_pitch/4,height&1,height_tmp*i);
}
cudaDeviceSynchronize();
}

流并行属于，任务级别的并行，当我们有几个互不相关的任务时，可以写多个核函数，资源允许的情况下，我们将这些核函数装载到不同流上，然后执行，这样可以实现更粗粒度的并行。

实验中发现，流并行的效率和增加一个线程网格的维度的方法的效率一样。以上代码可以采用增加线程块数目的方法来实现

代码二：

[cpp] view plain copy

void my_jpc_colgr_gpu(int **data_a, int height,int width,int current_lvls){//当大于512时执行这个函数
dim3 blocks(width,3);//width个线程块
dim3 threads(256);//让一个线程块计算一列,当height>1024时，得想办法
jpc_colgr_ker<<<blocks,threads,sizeof(int)*(height+1)>>>(dev_a,height,width,a_pitch/4,height&1,height_tmp);
}

对比以上两个代码片段发现，效率一样。

CUDA中的流用cudaStream_t类型实现，用到的API有以下几个：cudaStreamCreate(cudaStream_t * s)用于创建流，cudaStreamDestroy(cudaStream_t s)用于销毁流，cudaStreamSynchronize()用于单个流同步，cudaDeviceSynchronize()用于整个设备上的所有流同步，cudaStreamQuery()用于查询一个流的任务是否已经完成。具体的含义可以查询API手册。

7.使用并行流技术将数据拷贝和计算并行化

Asynchronous Transfers and Overlapping Transfers with Computation

Data transfers between the host and the device using cudaMemcpy() are blocking transfers; that is, control is returned to the host thread only after the data transfer is complete. ThecudaMemcpyAsync() function is a non-blocking variant ofcudaMemcpy() in which control is returned immediately to the host thread. In contrast withcudaMemcpy(), the asynchronous transfer versionrequires pinned host memory (seePinned Memory), and it contains an additional argument, a stream ID. A stream is simply a sequence of operations that are performed in order on the device. Operations in different streams can be interleaved and in some cases overlapped—a property that can be used to hide data transfers between the host and the device.

Asynchronous transfers enable overlap of data transfers with computation in two different ways. On all CUDA-enabled devices, it is possible to overlap host computation with asynchronous data transfers and with device computations. For example,Overlapping computation and data transfers demonstrates how host computation in the routinecpuFunction() is performed while data is transferred to the device and a kernel using the device is executed.

Overlapping computation and data transfers

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0);kernel<<<grid, block>>>(a_d);cpuFunction();

The last argument to the cudaMemcpyAsync() function is the stream ID, which in this case uses the default stream, stream 0. The kernel also uses the default stream, and it will not begin execution until the memory copy completes; therefore, no explicit synchronization is needed. Because the memory copy and the kernel both return control to the host immediately, the host functioncpuFunction() overlaps their execution.

In Overlapping computation and data transfers, the memory copy and kernel execution occur sequentially. On devices that are capable of “concurrent copy and execute,” it is possible to overlap kernel execution on the device with data transfers between the host and the device. Whether a device has this capability is indicated by the deviceOverlap field of a cudaDeviceProp variable (or listed in the output of thedeviceQuery SDK sample). On devices that have this capability, the overlap once again requires pinned host memory, and, in addition, the data transfer and kernel must use different, non-default streams (streams with non-zero stream IDs). Non-default streams are required for this overlap because memory copy, memory set functions, and kernel calls that use the default stream begin only after all preceding calls on the device (in any stream) have completed, and no operation on the device (in any stream) commences until they are finished.

Concurrent copy and execute illustrates the basic technique.

Concurrent copy and execute

cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);kernel<<<grid, block, 0, stream2>>>(otherData_d);

In this code, two streams are created and used in the data transfer and kernel executions as specified in the last arguments of thecudaMemcpyAsync call and the kernel’s execution configuration.

Concurrent copy and execute demonstrates how to overlap kernel execution with asynchronous data transfer. This technique could be used when the data dependency is such that the data can be broken into chunks and transferred in multiple stages, launching multiple kernels to operate on each chunk as it arrives. Sequential copy and execute and Staged concurrent copy and execute demonstrate this. They produce equivalent results. The first segment shows the reference sequential implementation, which transfers and operates on an array of N floats (where N is assumed to be evenly divisible by nThreads).

Sequential copy and execute

cudaMemcpy(a_d, a_h, N*sizeof(float), dir);kernel<<<N/nThreads, nThreads>>>(a_d);

Staged concurrent copy and execute shows how the transfer and kernel execution can be broken up into nStreams stages. This approach permits some overlapping of the data transfer and execution.

Staged concurrent copy and execute

size=N*sizeof(float)/nStreams;for (i=0; i<nStreams; i++) {offset = i*N/nStreams;cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]);}for (i=0; i<nStreams; i++) {offset = i*N/nStreams;kernel<<<N/(nThreads*nStreams), nThreads,          0, stream[i]>>>(a_d+offset);}

(In Staged concurrent copy and execute, it is assumed that N is evenly divisible bynThreads*nStreams.) Because execution within a stream occurs sequentially, none of the kernels will launch until the data transfers in their respective streams complete. Current hardware can simultaneously process an asynchronous data transfer and execute kernels. (It should be mentioned that it is not possible to overlap a blocking transfer with an asynchronous transfer, because the blocking transfer occurs in the default stream, and so it will not begin until all previous CUDA calls complete. It will not allow any other CUDA call to begin until it has completed.) A diagram depicting the timeline of execution for the two code segments is shown inFigure 1, and nStreams=4 for Staged concurrent copy and execute is shown in the bottom half.

Figure 1. Timeline Comparison for Sequential (top) and Concurrent (bottom) Copy and Kernel Execution

For this example, it is assumed that the data transfer and kernel execution times are comparable. In such cases, and when the execution time (t_E) exceeds the transfer time (t_T), a rough estimate for the overall time ist_E +t_T/nStreams for the staged version versust_E +t_T for the sequential version. If the transfer time exceeds the execution time, a rough estimate for the overall time ist_T +t_E/nStreams.

8.使用Pinned Memory优化内存和显存间的拷贝

有效的一个方法是将cpu内存通过cudaHostRegister()，绑定为分页锁定内存。

Pinned Memory

Page-locked or pinned memory transfers attain the highest bandwidth between the host and the device. On PCIe ×16 Gen2 cards, for example, pinned memory can attain greater than 5 GBps transfer rates.

Pinned memory is allocated using the cudaMallocHost() orcudaHostAlloc() functions in the Runtime API. ThebandwidthTest.cu program in the CUDA SDK shows how to use these functions as well as how to measure memory transfer performance.

Pinned memory should not be overused. Excessive use can reduce overall system performance because pinned memory is a scarce resource. How much is too much is difficult to tell in advance, so as with all optimizations, test the applications and the systems they run on for optimal performance parameters.

0 0