CUDA Program Analysis
来源:互联网 发布:java线程池是否繁忙 编辑:程序博客网 时间:2024/04/28 22:23
Nvidia Visio profiler
usage
nvprof [options] [application] [application-arguments]
usage mode
- summary mode
nvprof matrixMul
- track gpu trace
nvprof --print-gpu-trace matrixMul
track API trace
nvprof --print-api-trace matrixMul
Note: API trace can be turned off, if not needed, by using –profile-api-trace none. This reduces some of the profiling overhead, especially when the kernels are short.
Event/metric Summary Mode
nvprof --events warps_launched,local_load --metrics ipc matrixMul
- some userul events and metrics
events
- gld_inst_32bit
- gst_*
- global_load
- global_store
- local_*
- warp_launched
- active_cycles
- *_warps/ctas
- tex0_cache_sector_queries
- tex1_
metrics
- ipc
- gld_transactions_per_request
- gst_*
- gld_efficiency
- sm_efficiency
- l2_read_transactions
- l2_tex_transactions
- l2_utilization
example
nvprof --events warps_launched,local_load --metrics ipc matrixMul
- Event/metric Trace Mode
nvprof --aggregate-mode off --events local_load --print-gpu-trace matrixMul
other important options
- –dependency-analysis
- Timeline
nvprof --export-profile timeline.prof <app> <app args>
- Metrics And Events
The second use case is to collect events or metrics for all kernels in an application for
which you have already collected a timeline. Collecting events or metrics for all kernels will significantly change the overall
performance characteristics of the application because all kernel executions will be serialized on the GPU.
Even though overall application performance is changed, the event or metric values
for individual kernels will be correct and so you can merge the collected event and metric values
onto a previously collected timeline to get an accurate picture of the applications behavior.
nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <app> <app args>
- Analysis For Individual Kernel
nvprof --kernels <kernel specifier> --analysis-metrics -o analysis.prof <app> <app args>
metric reference
http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference
simple compile app
.cu -> .ptx -> .cubin ->exe
you can use “nvcc -keep” to preserve the middle compiled files
–ptxas-option=-v to see verbose compilation output
- number of registers used
- shared memory bytes
- local memory in bytes
cuobjdump
a disassemble tool
static inst
cuda timing functions
eg.
cudaEvent_t start,stop;float elapsed ;cudaEventCreate(&start);cudaEventCreate(& stop);cudaEventRecord(start,0);fool_kernel<<<grid,block>>>();cudaEventRecord(stop,0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed,start,stop);//返回ms,精确在0.5ms,不是很精确printf("elapsed time %f (seconds) \n",elapsed/1000);
0 0
- CUDA Program Analysis
- Path-Oriented Program Analysis
- Linux compile opencv cuda program ----simple example.
- CAV11: Temporal property verification as a program analysis task
- Calling CUDA program from C/C++ project
- analysis
- analysis
- analysis
- Uni-G/ The University of Glasgow (Uni-G) ECG Analysis Program
- Pin 关于pin的论文推荐Dynamic Program Analysis of Microsoft Windows Applications
- Pin 关于pin的论文Pin_building customized program analysis tools with dynamic instrumentation
- When encountered a program hang,what command can I use to analysis this issue
- program
- Program
- program
- program
- PROGRAM
- Program
- Retrofit基本介绍
- ios 画虚线方法
- Web项目启动初始化监听器如何获取spring bean
- python爬取cnvd漏洞库信息
- 开源maven包下载
- CUDA Program Analysis
- [Leetcode] Spiral Matrix II
- 杭电2032
- Cmder介绍
- 代码奔溃定位
- 比较清楚的Spring相关技术博客
- [李景山php]每天TP5-20170126|thinkphp5-Process.php-8
- Ubuntu 安装mysql和简单操作
- SpEL语法