Nvidia-OpenCL-SDK-Code-Samples的学习[4]

来源：互联网发布：淘宝层级划分编辑：程序博客网时间：2024/06/15 04:42

这个工程是Matrix Transpose：以为是我之前看过的那个，结果不是，更高级：

#define BLOCK_DIM 16// This kernel is optimized to ensure all global reads and writes are coalesced,// and to avoid bank conflicts in shared memory.  This kernel is up to 11x faster// than the naive kernel below.  Note that the shared memory array is sized to // (BLOCK_DIM+1)*BLOCK_DIM.  This pads each row of the 2D block in shared memory // so that bank conflicts do not occur when threads address the array column-wise.__kernel void transpose(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block){// read the matrix tile into shared memoryunsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);if((xIndex + offset < width) && (yIndex < height)){unsigned int index_in = yIndex * width + xIndex + offset;block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];}barrier(CLK_LOCAL_MEM_FENCE);// write the transposed matrix tile to global memoryxIndex = get_group_id(1) * BLOCK_DIM + get_local_id(0);yIndex = get_group_id(0) * BLOCK_DIM + get_local_id(1);if((xIndex < height) && (yIndex + offset < width))        {unsigned int index_out = yIndex * height + xIndex;odata[index_out] = block[get_local_id(0)*(BLOCK_DIM+1)+get_local_id(1)];}}// This naive transpose kernel suffers from completely non-coalesced writes.// It can be up to 10x slower than the kernel above for large matrices.__kernel void transpose_naive(__global float *odata, __global float* idata, int offset, int width, int height){    unsigned int xIndex = get_global_id(0);    unsigned int yIndex = get_global_id(1);        if (xIndex + offset < width && yIndex < height)    {        unsigned int index_in  = xIndex + offset + width * yIndex;        unsigned int index_out = yIndex + height * xIndex;        odata[index_out] = idata[index_in];     }}//just copy from src to dst,not transpose!__kernel void simple_copy(__global float *odata, __global float* idata, int offset, int width, int height){    unsigned int xIndex = get_global_id(0);    unsigned int yIndex = get_global_id(1);        if (xIndex + offset < width && yIndex < height)    {        unsigned int index_in  = xIndex + offset + width * yIndex;        odata[index_in] = idata[index_in];     }}__kernel void uncoalesced_copy(__global float *odata, __global float* idata, int offset, int width, int height){    unsigned int xIndex = get_global_id(0);    unsigned int yIndex = get_global_id(1);        if (xIndex + offset < width && yIndex < height)    {        unsigned int index_in  = yIndex + height * (xIndex+ offset);        odata[index_in] = idata[index_in];     }}//just copy using local variable and LDS to make it faster,not transpose!__kernel void shared_copy(__global float *odata, __global float *idata, int offset, int width, int height, __local float* block){// read the matrix tile into shared memoryunsigned int xIndex = get_global_id(0);unsigned int yIndex = get_global_id(1);    unsigned int index_in = yIndex * width + xIndex + offset;if((xIndex + offset< width) && (yIndex < height)){block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)] = idata[index_in];}barrier(CLK_LOCAL_MEM_FENCE);if((xIndex < height) && (yIndex+ offset < width))       {odata[index_in] = block[get_local_id(1)*(BLOCK_DIM+1)+get_local_id(0)];}}

要转置的矩阵是2048X2048大小的。前2个kernel是转置，后3个是复制矩阵。这几个kernel的globalsize都是2048X2048 localsize都是16X16，但本来第1个和第5个我不明白为什么要用17X16的local变量做中间结果，为什么不用和localsize一样大小的16X16，这和我以前看过的例子都不同，这里说是为了防止“bank conflict”，难道我以前看的例子都是有这个问题的？于是去查，明白一点点了：http://blog.csdn.net/endlch/article/details/47043069

http://www.cnblogs.com/waytofall/archive/2013/02/19/2916996.html

所以这里不用16X16而用17X16哦！

还有个问题，就是“访存合并”的问题，这上面说第1种kernel写法，读写访存都是合并的，而第2种就是写访存不合并！
我记得所谓访存合并就是：一个warp里的线程都要访问连续的地址。我看了下，第2种kernel里不仅写访存不合并，读访存也不合并啊！？第1种kernel里对idata的读和对odata的写也都是访存不合并的，只是block里读写是合并的，难道这个访存合并是对block而言？！

大神的回答是：

合并访存并不是对于work-group说的，而在N卡和A卡上分别针对warp和wavefront说的。这点没有疑义的。

其次，虽然说尽量要对warp或者wave整理进行连续地址上的访问，但在实际中，只要能拆分成多个64B整体，甚至是32B就足够可以高效了（需要其他条件限定其实。N卡建议在普通非pascal的卡上，使用至少32B连续(L2读取），或者128B连续（L1, 但L1本身高效，哪怕放宽到64B也挺好）。A卡建议至少64B连续，同时需要每1/4个wave至少应当能合并成64B. 等等）

第三，在这个放宽的条件下，你会看到naive的版本代码，在转置的时候，无论怎么读写，只要是直接的转置，要么你能读的时候能连续或者说合并，但是写的时候不能；要么写的时候能，但是读的时候不能，这就很尴尬了。（注意一般情况下我们总是行优先存储的）。

第四，通过使用local memory, 则可以整体搬运一个方块区域，保存在local memory上，这种搬运可以合并或者至少像刚才说的哪种放宽条件的合并，同时利用local memory的可以跳着或者说错开的读写能力，进一步的可以完成最终的global memory上的写入也合并。注意这是在说无bank conflict的情况下。

我以前记住的是每32个线程就是一个wave或者warp，原来记住这种放宽政策就好了。意思用画图表示就是：

这样当然第1个kernel会访存合并，第2个这样读就是写不合并，竖着读就是读不合并。

 如果存的是char类型，char占8位。 那在这种拆分成64B要连续的情况下  也就是一个group每64个线程要访问连续地址，才是访存合并。 （不是float，float是32位，所以这里就要求每64B/32=16个线程连续访问才是访存合并。）  
之前不放宽条件，书上说的好像是每32个线程是一个wave或者warp(好像是这样 忘记了)，所以我就算这0--31线程没有连续 中间空了个16下标。 原来以后记住放宽的条件就好了。

大神又提醒我说：访存合并只是一个global memory上的概念， 在local memory上，只要不导致bank conflict冲突，怎么读都可以的。
包括竖着读，分散着读等等。如果无bank conflict, 在local memory上连续的读取和分散或者跳着读或者竖着，都一样效果的。

这是为何能利用local memory进行转置的原因。因为local memory不怕它，但是global memory害怕，所以通过local memory中间不怕的一参合， 让global上的都变成合并的了）这样就很安全了。
大神还说了第5点：关于额外增加一列防止bank conflict的：N卡和A卡现在大部分都是32个bank的local memory, 至少等于访问A[id * s + b], 如果元素是4B的时候，只要s能是奇数，那么就无bank conflict。这个容易理解，你可以写写画画图复制理解以下，也可以直接尝试证明它，易证的，或者干脆记住结论。 而距离16个，最近的奇数是17了，所以这里可以选择17. 请注意刚才的公式里面和b其实无关的。只和s是否是奇数有关，如果id是连续的线程编号的话。

看到 http://bbs.gpuworld.cn/thread-10060-1-1.html  这个人问的关于conflict的，是占8个bank这个我知道，thread0、thread1、thread2和thread3一起访问bank0对吧，这个我也知道，但我不知道怎么看这4个threads是否访问的bank0的同深度还是不同深度项目？大神回答的意思应该是访问的bank0里同深度的，所以无conflict。

大神的回答是： 如果一个地址p,它的低128B, 其中的每连续的4B, 决定在哪个bank上，高位地址决定了深度。 例如地址128, 和地址256,都对应的bank 0的，但是深度不一样，这种情况下（同一bank的不同深度）不能同时访问。但是128，和132这两处，则是可以的。（正好对应bank0和bank1）类似的，128和4也可以同时访问（不同bank的不同深度）。
然后我就问他：不同的地址就是深度，你说了同一bank中不同深度的情况(嗯这样会conflict)  还说了不同bank不同深度的(这样不会conflict)，这两种大概知道了。  还有一种同一bank的相同深度的(不会conflict)这种没有说。而且我想问的是thread0访问Sdata[0]、thread1访问Sdata[1]、thread2访问Sdata[2]、thread3访问Sdata[3],而Sdata0--3都在bank0中，这种情况是不同线程访问同一bank的不同深度，应该要发生bank conflict，为什么没有？！
然后大神开始火了，心情不好了，不再回答了。

阅读全文

0 0