CUDA并发相关(流并发、主机设备并发)
来源:互联网 发布:sap84软件购买 编辑:程序博客网 时间:2024/04/28 19:17
CUDA中下面的一些执行步骤能够互相异步并发执行:
主机上的计算
设备上的计算
主机到设备的数据传输
设备到主机的数据传输
一个设备中内存的数据传输
不同设备之间的数据传输
注意,上面的6个操作是互相之间能够并发,在每个单独项上是不能够并发的(除了第二项设备上可以通过流并发来进行设备上计算的并发),但并不是每个英伟达的显卡都支持上面所有的并发,因此使用之前需要查询显卡是否支持。
主机和设备之间的并发:
核函数的启动
一个设备内,内存的复制
从主机到设备复制的内存块大小小于等于64kb时
函数后缀上带有Async的内存复制
内存设置函数调用(比如将某块内存全部置零)
最后,可以通过将变量CUDA_LAUNCH_BLOCKING置1来将核函数的启动变成同步,这个特性只用于调试
核函数之间的并发:
在设备计算能力大于2.1时支持核函数的并发,这个特性可以通过查询变量concurrentKernels来知道,当这个变量为1时说明设备支持,另外需要注意以下几点:
一个cuda上下文中的核函数不能够和另一个cuda上下文中的核函数并行
当一个核函数中大量使用纹理或局部内存时,核函数之间能够并行的可能性就在降低
数据传输与核函数执行的并发:
通过查询变量asyncEngineCount来检查设备是否支持,当设备支持时其值应当大于0
数据传输并发(指数据从主机到设备和从设备到主机的并发):
查询变量asyncEngineCount,支持时,其值应当等于2
流
前面的并发操作都是通过流来管理的:流与流之间可以并发,但流内部是有顺序的
流的同步:
//等待所有的流都执行完cudaDeviceSynchronize()//等待指定流中的任务全部执行完cudaStreamSynchronize(cudaStream_t stream)//某个流等待某个事件执行完再执行,下面的例子中使用了这个函数:需要注意的是这个函数也是非阻塞的cudaStreamWaitEvent(cudaStream_t stream,cudaEvent_t event,unsigned int flags)//查询某个流中的任务是否完成cudaTreamQuery(cudaStream_t stream)
上面是流的显式同步,而隐式同步有以下几点:
如果主机线程中涉及到以下几点的话,那么两个流中的命令是无法同步的:
主机分配锁页内存
设备分配内存
将设备内存全部置成指定的值(原话是:a device memory set)
将两个地址的数据复制到同一个设备内存(原话是:a memory copy between two addresses to the same device memory)
任何附加到默认流中的命令
流中添加回调函数:
cudaStreamAddCallback(cudaStream_t stream,cudaStreamCallback_t callback,void *userdata,unsigned int flags)
需要注意的是第三个参数,是用户传递给回调函数的参数,cudaStreamCallback_t的定义:
typedef void (CUDART_CB cudaStreamCallback_t) (cudaStream_t stream,cudaError_t status,void userData)
流的优先级,流在创建时可以指定优先级:涉及的函数如下:
cudaStreamCreateWaitPriority()
cudaDeviceGetStreamPriorityRange()
事件:事件有两个作用:
第一个是记录某个流中某个操作的运行事件,创建时使用cudaEventCreate(cudaEvent_t *event),记录时间点:
cudaEventRecord(cudaEvent_t event,cudaStream_t stream),记录某个流下的时间点,由于是在主机上打印事件信息,因此主机需要等待事件完成,
因此还需要加上事件同步:cudaEventSynchronize(cudaEvent_t event),一个标准的事件记录:
cudaEvent_t start,end;
cudaEventCreate(&start);
cudaEventCreate(&end);
cudaEventRecord(start,stream)
//some stream operation
cudaEventRecord(end event,stream)
cudaEventSynchronize(end)
float elaspsedTime;
cudaEventElapsedTime(&elapsedTime,start,end);
第二个是作为一个标记,放在某个流操作的后面,标记这个流的该操作完成:
创建时使用:cudaEventCreateWithFlags(cudaEvent_t *event,unsigned int flags)
flags应该指定为cudaEventDisableTiming;flags还有其他值,请参考具体的手册
同步调用;相关函数:
cudaSetDeviceFlags(unsigned int flags):相见手册
核函数的并发执行,在下面需要注意的是,流与流之间通过事件来同步,即一个流等待另一个流执行完某个操作后再进行下一步操作,这时,创建事件时,应该使用:
checkCudaErrors(cudaEventCreateWithFlags(&(kernelEvent[i]), cudaEventDisableTiming));
使用时应该是:当前流等待上一个流运行后当前流再执行:
for (int i=0; i<nkernels; ++i) { clock_block<<<1,1,0,streams[i]>>>(&d_a[i], time_clocks); total_clocks += time_clocks; checkCudaErrors(cudaEventRecord(kernelEvent[i], streams[i])); //make the last stream wait for the kernel event to be recorded checkCudaErrors(cudaStreamWaitEvent(streams[nstreams-1], kernelEvent[i],0)); }
/* * Copyright 1993-2015 NVIDIA Corporation. All rights reserved. * * Please refer to the NVIDIA end user license agreement (EULA) associated * with this source code for terms and conditions that govern your use of * this software. Any use, reproduction, disclosure, or distribution of * this software and related documentation outside the terms of the EULA * is strictly prohibited. * *///// This sample demonstrates the use of streams for concurrent execution. It also illustrates how to// introduce dependencies between CUDA streams with the new cudaStreamWaitEvent function introduced// in CUDA 3.2.//// Devices of compute capability 1.x will run the kernels one after another// Devices of compute capability 2.0 or higher can overlap the kernels//#include <stdio.h>#include <helper_functions.h>#include <helper_cuda.h>// This is a kernel that does no real work but runs at least for a specified number of clocks__global__ void clock_block(clock_t *d_o, clock_t clock_count){ unsigned int start_clock = (unsigned int) clock(); clock_t clock_offset = 0; while (clock_offset < clock_count) { unsigned int end_clock = (unsigned int) clock(); // The code below should work like // this (thanks to modular arithmetics): // // clock_offset = (clock_t) (end_clock > start_clock ? // end_clock - start_clock : // end_clock + (0xffffffffu - start_clock)); // // Indeed, let m = 2^32 then // end - start = end + m - start (mod m). clock_offset = (clock_t)(end_clock - start_clock); } d_o[0] = clock_offset;}// Single warp reduction kernel__global__ void sum(clock_t *d_clocks, int N){ __shared__ clock_t s_clocks[32]; clock_t my_sum = 0; for (int i = threadIdx.x; i < N; i+= blockDim.x) { my_sum += d_clocks[i]; } s_clocks[threadIdx.x] = my_sum; syncthreads(); for (int i=16; i>0; i/=2) { if (threadIdx.x < i) { s_clocks[threadIdx.x] += s_clocks[threadIdx.x + i]; } syncthreads(); } d_clocks[0] = s_clocks[0];}int main(int argc, char **argv){ int nkernels = 8; // number of concurrent kernels int nstreams = nkernels + 1; // use one more stream than concurrent kernel int nbytes = nkernels * sizeof(clock_t); // number of data bytes float kernel_time = 10; // time the kernel should run in ms float elapsed_time; // timing variables int cuda_device = 0; printf("[%s] - Starting...\n", argv[0]); // get number of kernels if overridden on the command line if (checkCmdLineFlag(argc, (const char **)argv, "nkernels")) { nkernels = getCmdLineArgumentInt(argc, (const char **)argv, "nkernels"); nstreams = nkernels + 1; } // use command-line specified CUDA device, otherwise use device with highest Gflops/s cuda_device = findCudaDevice(argc, (const char **)argv); cudaDeviceProp deviceProp; checkCudaErrors(cudaGetDevice(&cuda_device)); checkCudaErrors(cudaGetDeviceProperties(&deviceProp, cuda_device)); if ((deviceProp.concurrentKernels == 0)) { printf("> GPU does not support concurrent kernel execution\n"); printf(" CUDA kernel runs will be serialized\n"); } printf("> Detected Compute SM %d.%d hardware with %d multi-processors\n", deviceProp.major, deviceProp.minor, deviceProp.multiProcessorCount); // allocate host memory clock_t *a = 0; // pointer to the array data in host memory checkCudaErrors(cudaMallocHost((void **)&a, nbytes)); // allocate device memory clock_t *d_a = 0; // pointers to data and init value in the device memory checkCudaErrors(cudaMalloc((void **)&d_a, nbytes)); // allocate and initialize an array of stream handles cudaStream_t *streams = (cudaStream_t *) malloc(nstreams * sizeof(cudaStream_t)); for (int i = 0; i < nstreams; i++) { checkCudaErrors(cudaStreamCreate(&(streams[i]))); } // create CUDA event handles cudaEvent_t start_event, stop_event; checkCudaErrors(cudaEventCreate(&start_event)); checkCudaErrors(cudaEventCreate(&stop_event)); // the events are used for synchronization only and hence do not need to record timings // this also makes events not introduce global sync points when recorded which is critical to get overlap cudaEvent_t *kernelEvent; kernelEvent = (cudaEvent_t *) malloc(nkernels * sizeof(cudaEvent_t)); for (int i = 0; i < nkernels; i++) { checkCudaErrors(cudaEventCreateWithFlags(&(kernelEvent[i]), cudaEventDisableTiming)); } ////////////////////////////////////////////////////////////////////// // time execution with nkernels streams clock_t total_clocks = 0;#if defined(__arm__) || defined(__aarch64__) // the kernel takes more time than the channel reset time on arm archs, so to prevent hangs reduce time_clocks. clock_t time_clocks = (clock_t)(kernel_time * (deviceProp.clockRate / 1000));#else clock_t time_clocks = (clock_t)(kernel_time * deviceProp.clockRate);#endif cudaEventRecord(start_event, 0); // queue nkernels in separate streams and record when they are done for (int i=0; i<nkernels; ++i) { clock_block<<<1,1,0,streams[i]>>>(&d_a[i], time_clocks); total_clocks += time_clocks; checkCudaErrors(cudaEventRecord(kernelEvent[i], streams[i])); // make the last stream wait for the kernel event to be recorded checkCudaErrors(cudaStreamWaitEvent(streams[nstreams-1], kernelEvent[i],0)); } // queue a sum kernel and a copy back to host in the last stream. // the commands in this stream get dispatched as soon as all the kernel events have been recorded sum<<<1,32,0,streams[nstreams-1]>>>(d_a, nkernels); checkCudaErrors(cudaMemcpyAsync(a, d_a, sizeof(clock_t), cudaMemcpyDeviceToHost, streams[nstreams-1])); // at this point the CPU has dispatched all work for the GPU and can continue processing other tasks in parallel // in this sample we just wait until the GPU is done checkCudaErrors(cudaEventRecord(stop_event, 0)); checkCudaErrors(cudaEventSynchronize(stop_event)); checkCudaErrors(cudaEventElapsedTime(&elapsed_time, start_event, stop_event)); printf("Expected time for serial execution of %d kernels = %.3fs\n", nkernels, nkernels * kernel_time/1000.0f); printf("Expected time for concurrent execution of %d kernels = %.3fs\n", nkernels, kernel_time/1000.0f); printf("Measured time for sample = %.3fs\n", elapsed_time/1000.0f); bool bTestResult = (a[0] > total_clocks); // release resources for (int i = 0; i < nkernels; i++) { cudaStreamDestroy(streams[i]); cudaEventDestroy(kernelEvent[i]); } free(streams); free(kernelEvent); cudaEventDestroy(start_event); cudaEventDestroy(stop_event); cudaFreeHost(a); cudaFree(d_a); if (!bTestResult) { printf("Test failed!\n"); exit(EXIT_FAILURE); } printf("Test passed\n"); exit(EXIT_SUCCESS);}
- CUDA并发相关(流并发、主机设备并发)
- CUDA异步并发 (二)
- 并发相关
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 并发
- 分治法,动态规划及贪心算法区别
- 理发师问题
- Android实现渐显按钮的左右滑动效果
- Threejs初识-创世之旅
- Android利用Timer实现倒计时效果小demo
- CUDA并发相关(流并发、主机设备并发)
- Git 常用命令
- oracle11g 配置监听及登录问题解决方案
- 百度站长平台显示连接超时次数
- 判断浏览器
- 对于混淆C#代码,你们有啥建议吗?
- 批处理隐藏运行的11种思路
- JS组件系列——表格组件神器:bootstrap table(一)
- 一个漂亮的干货集中营客户端的生成,集合了干货api的大部分功能