CUDA计算能力的含义

来源：互联网发布：中越战争知乎编辑：程序博客网时间：2024/06/14 17:30

我们在学习GPU编程时经常看到计算能力（Compute Capability）这个词语，那么什么是计算能力呢？

计算能力（Compute Capability）

计算能力不是描述GPU设备计算能力强弱的绝对指标，他是相对的。准确的说他是一个架构的版本号。也不是指cuda软件平台的版本号（如cuda7.0，cuda8.0等）

如TX1，版本号为5.3，实际上指的是：

5、SM的主版本号，指maxwell架构

3、SM的次版本号，拥有一些在该架构前提下的一些优化特性

如官方文档中所说（2.5节）：

Compute Capability

The compute capability of a device is represented by a version number, also sometimes called its "SM version". This version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.

The compute capability comprises a major revision number X and a minor revision number Y and is denoted by X.Y.

Devices with the same major revision number are of the same core architecture. The major revision number is 7 for devices based on the Volta architecture, 6 for devices based on the Pascal architecture, 5 for devices based on the Maxwell architecture, 3 for devices based on the Kepler architecture, 2 for devices based on the Fermi architecture, and 1 for devices based on the Tesla architecture.

The minor revision number corresponds to an incremental improvement to the core architecture, possibly including new features.

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute capability. Compute Capabilities gives the technical specifications of each compute capability.

Note: The compute capability version of a particular GPU should not be confused with the CUDA version (e.g., CUDA 7.5, CUDA 8, CUDA 9), which is the version of the CUDA software platform. The CUDA platform is used by application developers to create applications that run on many generations of GPU architectures, including future GPU architectures yet to be invented. While new versions of the CUDA platform often add native support for a new GPU architecture by supporting the compute capability version of that architecture, new versions of the CUDA platform typically also include software features that are independent of hardware generation. The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and CUDA 9.0, respectively.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4wxGcUSTa Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

不同版本SM的功能

如上所述计算能力的含义应该解释清楚了。那么这些版本号代表了什么呢？

每一种计算能力都拥有着不同的特点，主版本号和次版本号在硬件细节上究竟有着什么不同呢？

如下图所示，在浮点运算能力上的区别如下：

在性能指标上，运算能力区别如下（表格最下方没有完全截图，下方的指标都一致，各版本没有区别）：

可见不同计算能力的区别一目了然。

典型版本的示例

Compute Capability 3.x

架构（Architecture）

A multiprocessor consists of:192 CUDA cores for arithmetic operations (see Arithmetic Instructions for throughputs of arithmetic operations),32 special function units for single-precision floating-point transcendental functions,4 warp schedulers.

When a multiprocessor is given warps to execute, it first distributes them among the four schedulers. Then, at every instruction issue time, each scheduler issues two independent instructions for one of its assigned warps that is ready to execute, if any.

A multiprocessor has a read-only constant cache that is shared by all functional units and speeds up reads from the constant memory space, which resides in device memory.

There is an L1 cache for each multiprocessor and an L2 cache shared by all multiprocessors. The L1 cache is used to cache accesses to local memory, including temporary register spills. The L2 cache is used to cache accesses to local and global memory. The cache behavior (e.g., whether reads are cached in both L1 and L2 or in L2 only) can be partially configured on a per-access basis using modifiers to the load or store instruction. Some devices of compute capability 3.5 and devices of compute capability 3.7 allow opt-in to caching of global memory in both L1 and L2 via compiler options.

The same on-chip memory is used for both L1 and shared memory: It can be configured as 48 KB of shared memory and 16 KB of L1 cache or as 16 KB of shared memory and 48 KB of L1 cache or as 32 KB of shared memory and 32 KB of L1 cache, using cudaFuncSetCacheConfig()/cuFuncSetCacheConfig():

// Device code __global__ void MyKernel() { ... } // Host code // Runtime API // cudaFuncCachePreferShared: shared memory is 48 KB // cudaFuncCachePreferEqual: shared memory is 32 KB // cudaFuncCachePreferL1: shared memory is 16 KB // cudaFuncCachePreferNone: no preference cudaFuncSetCacheConfig(MyKernel, cudaFuncCachePreferShared)

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4wxQe3BXP Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

L1缓存是每个多处理器单独拥有的，用于做共享内存或一级缓存，而L2缓存是所有多处理器共有的，用于做二级缓存或者全局内存。L1缓存是可配置的，可调共享内存和一级缓存比例

往后关于内存缓存方面的并没有看懂

Compute Capability 5.x

架构（Architecture）

A multiprocessor consists of:128 CUDA cores for arithmetic operations (see Arithmetic Instructions for throughputs of arithmetic operations),32 special function units for single-precision floating-point transcendental functions,4 warp schedulers.

Read more at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#ixzz4wxY2CuLy Follow us: @GPUComputing on Twitter | NVIDIA on Facebook

最新的计算能力表格在这里

https://developer.nvidia.com/cuda-gpus

阅读全文

0 0