Nsight调试笔记
来源:互联网 发布:java应用市场 编辑:程序博客网 时间:2024/05/17 05:15
Problem 1: Low Memcpy/Compute Overlap
The percentage of time when memcpy is being performed in parallel with compute is low.
Nsight手册第九章 Memory Optimizations
9.1 Data Transfer Between Host and Device
High Priority:
1、Minimize data transfer between the host and the device, even if it means running some kernels on the device gains no performance when compared with running them on the host.
2、Build intermediate data structures and remember to destroyed them.
3、Using pinned memory(就是我们所说的不可分页内存). But don't overuse it.
Fuctions: cudaHostAlloc(), cudaHostRegister() (for regions of system memory that have already been pre-allocated)
4、using cudaMemcpyAsync() instead of cudaMemcpy().
example:
cudaMemcpyAsync(a_d,a_h,size,cudaMemcpyHostToDevice,0);kernel<<<grid,block>>>(a_d);cpuFunction
cpuFunction() overlaps the kernel execution. (CPU端代码和device端代码合并?)
然而,CUDA还支持进一步的优化,就是用host和device端的数据传输来掩盖kernel的计算时间(也就是compute和memcpy同时进行)。下面是流登场的时间了!废话不多,上代码:
cudaStreamCreate(&stream1);cudaStreamCreate(&stream2);cudaMemcpyAsync(a_d,a_h,size,cudaMemHostToDevice,steam1);kernel<<<grid,block,0,stream2>>>(<span style="background-color: rgb(255, 0, 0);">otherData_d</span>);//注意是otherData上面这一段代码使用stream1执行memcpy,使用stream2来执行kernel,一举两得。(PS:cudaMemcpy()限制了先执行内存复制然后执行kernel)
使用情景:This tech could be used when the data dependency is such that the data can be broken into chunks and transferred in multiple stages, for example, launching multiple kernels to operate on each chunk as it arrives.
看完下面给出的例子我再解释上面这段话,
//Sequential copy and executecudaMemcpy(a_d,a_h, N*sizeof(float),cudaMemcpyHostToDevice);kernel<<<N/nThreads,nThreads>>>(a_d);
//Staged concurrent copy and executesize=N*sizeof(float)/nStreams;for (int i=0;i<nStreams;i++){ offset=i*N/nStreams; cudaMemcpyAsync(a_d+offset,a_h+offset,size,cudaMemcpyHostToDevice,stream[i]); kernel<<<N/(nThreads*nStreams),nThreads,0,stream[i]>>>(a_d+offset);}上面两部分代码,第一部分使用并没有使用stream所以说是串行的执行memcpy和kernel;第二部分使用stream达到async处理的目的。虽然性能没有飞升(我指的是想比于其它程序动不动n倍的速度提升),不过思想很美(美这个词是不是不大合适啊)!
nsight提供了一个计算时间的公式:这里,我们假设tE=execution time, tT=transfer time;
if tE>tTtotal time=tE+tT/nStreams;if tE<tTtotal time=tT+tE/nStreams看上述公式,可见如果tT>>tE,使用stream的优化效果将会非常明显。
这里是华丽的分割线---------------------------------------------------------------------------------
上面我们看到了如何使用stream来达到memcpy和kernel掩盖latency的目的。
下面我们介绍使用zero copy(需要CUDA version>=2.2)来达到相同的目的。
同样的先上代码,
<pre name="code" class="cpp"><span style="font-size:18px;"><span style="font-size:14px;">float *a_h,*a_map;....cudaGetDeviceProperties(&prop,0);if (!prop.canMapHostMemory) exit(0);cudaSetDeviceFlags(cudaDeviceMapHost);cudaHostAlloc(&a_h,nBytes,cudaHostAllocMapped);cudaHostGetDevicePointer(&a_map,a_h,0);kernel<<<gridSize,blockSize>>>(a_map);不使用stream同样可以ovelap CPU-GPU memory transfer。</span></span>
- Nsight调试笔记
- Nsight调试断点不停
- Nsight 调试 Caffe
- Nsight 学习笔记(一)
- CUDA Nsight 调试出现disconnect
- Nsight
- Nsight调试CUDA程序忽略断点
- 用 Nsight 远程调试 caffe 代码
- Nsight 调试出现Parallel Nsight Debug CUDA grid launch failed: CUcontext: 211577632 CUmodule: 223878456
- 在windows下用vs2010跟nsight进行程序调试
- caffe 使用CUDA NSight 集成开发工具调试
- CUDA 7.5 Nsight 4.7 GTX960 openGL interop 调试报错
- Jetson TX1开发笔记(三):开发利器-Nsight Eclipse Edition
- 使用NSight进行CUDA调试,只能进行GPU代码调试,不能进入CPU端代码断点
- NSight Eclipse Edition 下创建CUDA程序并执行远程编译及调试
- 调试 OpenGL 4.2 和 Shader - 基于 NVIDIA Nsight Visual Studio 3.2
- 用远程连接进行nsight调试出现WDDM adapters will not be debuggable
- Caffe学习(九)使用Nsight Eclipse调试py-faster-rcnn(C++/python混合代码调试)
- 移动端分页
- 【Android】Volley框架的使用简介
- Java异常学习笔记(一)
- 始
- 线段树 敌兵布阵
- Nsight调试笔记
- 线程生产者消费者
- LeetCode 7.Reverse Integer
- IntelliJ IDEA 快捷键和设置
- 在 Linux 下你所不知道的 df 命令的那些功能
- POJ 2096 Collecting Bugs(dp 期望)
- NYOJ迷宫寻宝(一)【BFS】
- gdb调试简单明了,转载
- NVL(SUM(),1)和sql减法运算的综合运算