CUDA Program Analysis

来源：互联网发布：java线程池是否繁忙编辑：程序博客网时间：2024/04/28 22:23

Nvidia Visio profiler

usage

nvprof [options] [application] [application-arguments]

usage mode

summary mode

nvprof matrixMul

track gpu trace

nvprof --print-gpu-trace matrixMul

track API trace
```
nvprof --print-api-trace matrixMul
```
Note: API trace can be turned off, if not needed, by using –profile-api-trace none. This reduces some of the profiling overhead, especially when the kernels are short.
Event/metric Summary Mode

nvprof --events warps_launched,local_load --metrics ipc matrixMul

some userul events and metrics
events
- gld_inst_32bit
- gst_*
- global_load
- global_store
- local_*
- warp_launched
- active_cycles
- *_warps/ctas
- tex0_cache_sector_queries
- tex1_
metrics
- ipc
- gld_transactions_per_request
- gst_*
- gld_efficiency
- sm_efficiency
- l2_read_transactions
- l2_tex_transactions
- l2_utilization
example

nvprof --events warps_launched,local_load --metrics ipc matrixMul

Event/metric Trace Mode

nvprof --aggregate-mode off --events local_load --print-gpu-trace matrixMul

other important options

–dependency-analysis
Timeline

nvprof --export-profile timeline.prof <app> <app args>

Metrics And Events
The second use case is to collect events or metrics for all kernels in an application for
which you have already collected a timeline. Collecting events or metrics for all kernels will significantly change the overall
performance characteristics of the application because all kernel executions will be serialized on the GPU.
Even though overall application performance is changed, the event or metric values
for individual kernels will be correct and so you can merge the collected event and metric values
onto a previously collected timeline to get an accurate picture of the applications behavior.

  nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <app> <app args>

Analysis For Individual Kernel

nvprof --kernels <kernel specifier> --analysis-metrics -o analysis.prof <app> <app args>

metric reference

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

simple compile app

.cu -> .ptx -> .cubin ->exe

you can use “nvcc -keep” to preserve the middle compiled files

–ptxas-option=-v to see verbose compilation output

- number of registers used
- shared memory bytes
- local memory in bytes

cuobjdump

a disassemble tool

static inst

cuda timing functions

eg.

cudaEvent_t start,stop;float elapsed ;cudaEventCreate(&start);cudaEventCreate(& stop);cudaEventRecord(start,0);fool_kernel<<<grid,block>>>();cudaEventRecord(stop,0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed,start,stop);//返回ms，精确在0.5ms，不是很精确printf("elapsed time %f (seconds) \n",elapsed/1000);

0 0