CUDA总结:同步

来源:互联网 发布:win10开机windows聚焦 编辑:程序博客网 时间:2024/06/05 18:48

cuda runtime api的同步行为

from cuda runtime api -2.API synchronization behavior
The API provides memcpy/memset functions in both synchronous and asynchronous forms, the latter having an “Async” suffix. This is a misnomer as each function may exhibit synchronous or asynchronous behavior depending on the arguments passed to the function. In the reference documentation, each memcpy function is categorized as synchronous or asynchronous, corresponding to the definitions below.

Memcpy

Synchronous
同步拷贝操作,默认的拷贝
All transfers involving Unified Memory regions are fully synchronous with respect to the host.

  • For transfers from pageable host memory to device memory, a stream sync is performed before the copy is initiated. The function will return once the pageable buffer has been copied to the staging memory for DMA transfer to device memory, but the DMA to final destination may not have completed.

  • For transfers from pinned host memory to device memory, the function is synchronous with respect to the host.

  • For transfers from device to either pageable or pinned host memory, the function returns only once the copy has completed.

  • For transfers from device memory to device memory, no host-side synchronization is performed.

  • For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

Asynchronous
异步的拷贝操作,需要显式调用异步版本的memcpy函数

  • For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

  • For transfers from any host memory to any host memory, the function is fully synchronous with respect to the host.

  • For all other transfers, the function is fully asynchronous. If pageable memory must first be staged to pinned memory, this will be handled asynchronously with a worker thread.

Memset
The synchronous memset functions are asynchronous with respect to the host except when the target is pinned host memory or a Unified Memory region, in which case they are fully synchronous. The Async versions are always asynchronous with respect to the host.
对于同步版本的memset,除了pinned内存和统一内存是同步的,其它都是异步的。对于异步版本则全部是异步的。即实质上memset是异步的,使用时需要注意这个问题,CPU端要等待memset完成

Kernel Launches
Kernel launches are asynchronous with respect to the host. Details of concurrent kernel execution and data transfers can be found in the CUDA Programmers Guide.
对于kernel,当然是异步的

注意:以上所说的异步,均是对于CPU来说的,即CPU端调用cuda函数后会立刻执行下一条指令,不会等待cuda函数运行结束。


同步函数
__syncthreads() 同步一个block内的所有线程,实质上是同步一个block内的warp,因为每个warp内的线程是同步的。
对于block间同步,需要采用原子操作

0 0
原创粉丝点击