CUDA Program Analysis

来源:互联网 发布:java线程池是否繁忙 编辑:程序博客网 时间:2024/04/28 22:23

Nvidia Visio profiler

usage

nvprof [options] [application] [application-arguments]

usage mode

  • summary mode
nvprof matrixMul
  • track gpu trace
nvprof --print-gpu-trace matrixMul
  • track API trace

    nvprof --print-api-trace matrixMul

    Note: API trace can be turned off, if not needed, by using –profile-api-trace none. This reduces some of the profiling overhead, especially when the kernels are short.

  • Event/metric Summary Mode

nvprof --events warps_launched,local_load --metrics ipc matrixMul
  • some userul events and metrics
  • events

    • gld_inst_32bit
    • gst_*
    • global_load
    • global_store
    • local_*
    • warp_launched
    • active_cycles
    • *_warps/ctas
    • tex0_cache_sector_queries
    • tex1_
  • metrics

    • ipc
    • gld_transactions_per_request
    • gst_*
    • gld_efficiency
    • sm_efficiency
    • l2_read_transactions
    • l2_tex_transactions
    • l2_utilization
  • example

nvprof --events warps_launched,local_load --metrics ipc matrixMul
  • Event/metric Trace Mode
nvprof --aggregate-mode off --events local_load --print-gpu-trace matrixMul

other important options

  • –dependency-analysis
  • Timeline
nvprof --export-profile timeline.prof <app> <app args>
  • Metrics And Events
    The second use case is to collect events or metrics for all kernels in an application for
    which you have already collected a timeline. Collecting events or metrics for all kernels will significantly change the overall
    performance characteristics of the application because all kernel executions will be serialized on the GPU.
    Even though overall application performance is changed, the event or metric values
    for individual kernels will be correct and so you can merge the collected event and metric values
    onto a previously collected timeline to get an accurate picture of the applications behavior.
  nvprof --metrics achieved_occupancy,executed_ipc -o metrics.prof <app> <app args>
  • Analysis For Individual Kernel
nvprof --kernels <kernel specifier> --analysis-metrics -o analysis.prof <app> <app args>

metric reference

http://docs.nvidia.com/cuda/profiler-users-guide/index.html#metrics-reference

simple compile app

.cu -> .ptx -> .cubin ->exe

you can use “nvcc -keep” to preserve the middle compiled files

–ptxas-option=-v to see verbose compilation output

​ - number of registers used
- shared memory bytes
- local memory in bytes

cuobjdump

​ a disassemble tool

​ static inst

cuda timing functions

eg.

cudaEvent_t start,stop;float elapsed ;cudaEventCreate(&start);cudaEventCreate(& stop);cudaEventRecord(start,0);fool_kernel<<<grid,block>>>();cudaEventRecord(stop,0);cudaEventSynchronize(stop);cudaEventElapsedTime(&elapsed,start,stop);//返回ms,精确在0.5ms,不是很精确printf("elapsed time %f (seconds) \n",elapsed/1000);
0 0
原创粉丝点击