Some of our existing and upcoming blog posts illustrate how to get the best performance from C++ AMP, with a focus on today’s DirectX 11 GPUs, and a focus on our v1 capabilities. This post will serve as an index into current and future blog posts that can help you with performance tuning.

Measuring Performance of C++ AMP computations

Here are some links to help you measure performance characteristics accurately:

  • Analyze your parallel algorithms with the Concurrency Visualizer
  • Learn how to measure the performance of C++ AMP algorithms
  • Perform Data warm up when measuring performance

Optimizing Command Submission

A major performance concern in GPU computing is coordinating between CPU and GPU. A large part of this (especially discrete parts) comes from transferring the data to and from the accelerator. This includes understanding how various CPU operations, copies and kernels may occur concurrently with each other. Another overhead is the one time launch of each kernel invocation from the CPU side. Some tips in that area can be found on these blog posts:

  • Choose the right Queuing Mode for your accelerator_view
  • Utilize the CPU with asynchronous data transfers and continuations
  • Use Staging Arrays when appropriate
  • Avoid unnecessary copies of readonly or writeonly data

Understanding Compute-bound versus Memory-bound kernels

Simplistically, every thread in a kernel reads some data, does some arithmetic, and writes some results. By having many threads concurrently operating, these basic steps are overlapped so that while some threads are doing arithmetic others are reading data. We say that a kernel is memory bound when it spends more net time reading and writing memory than doing arithmetic. Otherwise, we say it is compute bound.

In the memory bound case, you will want to focus tuning first to do less reading or less writing. Even if a loop looks like it should be compute bound, particular memory access patterns may not use memory efficiently. If you make a change to reduce the amount of arithmetic and do not see a performance gain, the kernel is likely memory bound. This suggests that you should focus on memory usage and efficiency first.

Optimizing compute-bound kernels

Here are some resources to help you with optimizing compute bound kernels:

  • Avoid Aliased Invocation of parallel_for_each

Optimizing memory-bound kernels

Here are some resources to help you optimize memory bound kernels:

  • Learn about tiling
  • Use Constant Memory when possible
  • Take advantage of row major arrays
  • Avoid bank conflicts on tile_static memory

Performance Oriented Samples

You are encouraged to visit our official and comprehensive C++ AMP samples list and download them all. Here, in this blog post, we re-list some samples which are more performance-tuning-oriented than others:

  • Parallel Reduction using C++ AMP shows techniques for efficient use of memory and for coordination of threads within tiles.
  • Chunking data across multiple C++ AMP kernels illustrates optimization of memory copies between host and accelerators and between multiple accelerators.
  • Matrix transpose using C++ AMP highlights the importance of effective use of memory and the use of tiling to achieve it.
  • Matrix Multiplication Sample shows how to use tile memory to avoid redundant global memory loads by multiple threads in a tile.
  • Convolution Sample also shows how to use tile memory to avoid redundant memory accesses by multiple threads in a tile.

Forum Discussion Threads with Performance Related Answers

The best place to ask questions is our MSDN Forum and occasionally questions that we receive relate to the topic of C++ AMP performance. In case some of these questions and scnearios match the ones you are facing, you may want to visit these discussion threads:

  • Choosing tile size for tiled parallel_for_each invocation
  • ShaderResourceView vs Constant memory vs Texture - usage and performance
  • Emulating strided indices in parallel_for_each
  • Selecting good tile size for tiled IDCT kernel (computation similar to multiplying small matrices)
  • Overheads of using small array and computation size
  • Performance of accessing 2D array_view on CPU
  • Performance of copying between the CPU and an accelerator and array_view CPU access