【并行计算-CUDA开发】CUDA bank conflict in shared memory

来源：互联网发布：无间道3解析知乎编辑：程序博客网时间：2024/04/28 11:15

http://hi.baidu.com/pengkuny/item/c8070b388d75d481b611db7a

以前以为 shared memory 是一个万能的 L1 cache，速度很快，只要数据的 size 够小，能够放到 shared memory，剩下的事情我就不用操心啦。实际上不是这样，bank conflict 是一个绕不过去的问题，否则，性能会降得很低，很低，很低。。。

---------------------------------------------------------------

为什么 shared memory 存在 bank conflict，而 global memory 不存在？因为访问 global memory 的只能是 block，而访问 shared memory 的却是同一个 half-warp 中的任意线程。
http://stackoverflow.com/questions/3843032/why-arent-there-bank-conflicts-in-global-memory-for-cuda-opencl

----------------------------------------------------------------

摘自这个要翻墙的网页：Introduction to GPU Programming (HPC Summer Institute at Rice University) http://davidmedinawiki.wordpress.com/2012/06/08/introduction-to-gpu-programming/

Now that we know a little about shared memory, we need to see how this memory is accessed within the thread block. There are these memory managers called “memory banks” that are in charge of distributing the memory they manage.

Ok, that sentence sounded like it just stated the obvious but that is what memory banks do. The question now is, what memory are they in charge of?

Memory Bank Architecture (From the HPC Session)

Above is a diagram that shows how a GPU with 8 memory banks would store shared memory. Using basic math we get the following equation:

// mem is the memory location
bank = mem/8;

So why are these memory banks so important?

Well, the memory banks distribute data stored in their bank of shared memory one call at a time. This means that a parallel code can easily be turned into serial code due to bank conflicts (when each thread accesses from the same bank at the same time). There is, however, one exception to bank conflicts” which is when threads access the same memory from the same memory bank.
Here are some examples that show good and bad uses of bank memory (Images taken from the HPC Session):

---------------------------------------------------------------

下面的文字来自：http://hi.baidu.com/dwdxdy/item/e5d66f40168f852810ee1ef7

共享存储器被组织为16个bank，每个bank拥有32bit的宽度。
一个warp中的线程对共享存储器的访问请求会被划分为两个half-warp的访问请求。
无 bank conflict 时，一个half-warp内的线程可以在一个内核周期中并行访问
对同一 bank 的同时访问导致 bank conflict 只能顺序处理　访存效率降低
如果half-warp的线程访问同一地址时，会产生一次广播，不会产生 bank conflict

__shared__ float shared[256];
float foo = shared[threadIdx.x];
没有访问冲突

__shared__ float shared[256];
float foo = shared[threadIdx.x * 2];
产生2路访问冲突

__shared__ float shared[256];
float foo = shared[threadIdx.x*8];
产生8路访问冲突

---------------------------------------------------------------

Number of shared memory banks
来源：http://en.wikipedia.org/wiki/CUDA
GPU device 1.x : 16
GPU device 2.x : 32

---------------------------------------------------------------

书上说：“每个 bank 的宽度固定为 32 bit，相邻的 32 bit 字被组织在相邻的 bank 中，每个 bank 在每个时钟周期可以提供 32 bit 的带宽。”

由上面这句话可以看出来：将 shared memory 看成一个二维存储空间，每个 bank 就是一列，每一行就是 16（或 32）个 banks。要么，尽量让一个 half-warp（或 full warp）中的线程分散访问不同列（即访问不同的 bank，同行不同行没有关系）；要么，让一个 half-warp（或 full warp）中的线程务必全部访问同一列且同一行（即访问同一个地址，仅对读操作有效）。

对于计算能力 1.0 的设备，前个 half-warp 和后个 half-warp 不存在 bank conflict；
对于计算能力 2.0 的设备，前个 half-warp 和后个 half-warp 可能存在 bank conflict，因为 shared memory 可以同时让 32 个 bank 响应请求；

如果是写操作，一个 half-warp（或 full warp）中所有线程访问同一地址的时候，此时会产生不确定的结果（也只需要一个 clock cycle，不确定哪个线程会胜出），发生这种情况时应使用原子操作——但是原子操作对性能影响太大。

“Shared memory features a broadcast mechanism whereby a 32-bit word can be read and broadcast to several threadssimultaneously when servicing one memory read request. ”——从这个描述来看，只要是多个线程访问同一地址都可以产生一次广播，多个线程访问同一地址将有效减少 bank conflict 的数量。若 half-warp（或 full warp）中所有线程都要访问同一地址，则完全没有 bank conflict。

对于大于 32 bit 的 struct 来说，对它的访问将编译成多个独立的存储器访问。– “Share memory only supports 32 bit reads/writes”

因此，shared memory 的写操作的 bank conflict 是一个很头疼的问题。

0 0