Nvidia-OpenCL-SDK-Code-Samples的学习[2]

来源：互联网发布：网络研修2016年编辑：程序博客网时间：2024/06/08 00:02

首先纠正一个错误，之前看过的例子里clBuildProgram()倒数第三个option传的NULL，但大神说不要这样，大神说当没有时要用" "，绝对不要用NULL！

另外看到即使创建一个CL_MEM_WRITE_ONLY的buffer，虽然在执行kernel时会被写进结果，但在此之前建议最好通过clEnqueueWriteBuffer()将那个buffer先初始化。这个例子里竟然对CL_MEM_WRITE_ONLY也可以像CL_MEM|_READ_ONLY一样通过clEnqueueWriteBuffer()对kernel的参数赋值的！我第一次看到......

1、接着之前的，这个工程是矩阵的乘法，实际是类似100000X1100的矩阵M乘以1100X1的矩阵V。

这个首先查看设备允许的最大GLOBAL_SIZE，以此为基准，设置了矩阵的长宽，里面涉及到一个cl文件里有多个kernel函数：

// Matrix multiplication kernel called by MatrixMul()   //the basical kernel.__kernel void MatVecMulUncoalesced0(const __global float* M,                                    const __global float* V,                                    uint width, uint height,                                    __global float* W){    // Row index    uint y = get_global_id(0);    if (y < height) {            // Row pointer        const __global float* row = M + y * width;        // Compute dot product          float dotProduct = 0;        for (int x = 0; x < width; ++x)            dotProduct += row[x] * V[x];        // Write result to global memory        W[y] = dotProduct;    }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulUncoalesced1(const __global float* M,                                    const __global float* V,                                    uint width, uint height,                                    __global float* W){            // Each work-item handles as many matrix rows as necessary    for (uint y = get_global_id(0);         y < height;         y += get_global_size(0))    {        // Row pointer        const __global float* row = M + y * width;        // Compute dot product          float dotProduct = 0;        for (uint x = 0; x < width; ++x)            dotProduct += row[x] * V[x];        // Write result to global memory        W[y] = dotProduct;    }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced0(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){        // Each work-group handles as many matrix rows as necessary    for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {        // Row pointer        const __global float* row = M + y * width;                // Each work-item accumulates as many products as necessary        // into local variable "sum"        float sum = 0;        for (uint x = get_local_id(0); x < width; x += get_local_size(0))            sum += row[x] * V[x];        // Each partial dot product is stored in shared memory        partialDotProduct[get_local_id(0)] = sum;        // Synchronize to make sure each work-item is done updating        // shared memory; this is necessary because in the next step,        // the first work-item needs to read from shared memory        // the partial dot products written by the other work-items        barrier(CLK_LOCAL_MEM_FENCE);        // The first work-item in the work-group adds all partial        // dot products together and writes the result to global memory        if (get_local_id(0) == 0) {            float dotProduct = 0;            for (uint t = 0; t < get_local_size(0); ++t)                dotProduct += partialDotProduct[t];            W[y] = dotProduct;    }        // Synchronize to make sure the first work-item is done with        // reading partialDotProduct        barrier(CLK_LOCAL_MEM_FENCE);}}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced1(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){        // Each work-group handles as many matrix rows as necessary    for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {        // Row pointer        const __global float* row = M + y * width;                // Each work-item accumulates as many products as necessary        // into local variable "sum"        float sum = 0;        for (uint x = get_local_id(0); x < width; x += get_local_size(0))            sum += row[x] * V[x];        // Each partial dot product is stored in shared memory        partialDotProduct[get_local_id(0)] = sum;                // Perform parallel reduction to add each work-item's        // partial dot product together        for (uint stride = 1; stride < get_local_size(0); stride *= 2) {            // Synchronize to make sure each work-item is done updating            // shared memory; this is necessary because work-items read            // results that have been written by other work-items            barrier(CLK_LOCAL_MEM_FENCE);                        // Index into the "partialDotProduct" array where            // the work-item will write during this step            uint index = 2 * stride * get_local_id(0);                        // Check for valid indices            if (index < get_local_size(0)) {                            // Add two elements from the "partialDotProduct" array                // and store the result in partialDotProduct[index]                partialDotProduct[index] += partialDotProduct[index + stride];            }        }        // Write the result of the reduction to global memory        if (get_local_id(0) == 0)            W[y] = partialDotProduct[0];        // Synchronize to make sure the first work-item is done with        // reading partialDotProduct        barrier(CLK_LOCAL_MEM_FENCE);    }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced2(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){        // Each work-group handles as many matrix rows as necessary    for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {        // Row pointer        const __global float* row = M + y * width;                // Each work-item accumulates as many products as necessary        // into local variable "sum"        float sum = 0;        for (uint x = get_local_id(0); x < width; x += get_local_size(0))            sum += row[x] * V[x];        // Each partial dot product is stored in shared memory        partialDotProduct[get_local_id(0)] = sum;                // Perform parallel reduction to add each work-item's        // partial dot product together        for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {            // Synchronize to make sure each work-item is done updating            // shared memory; this is necessary because work-items read            // results that have been written by other work-items            barrier(CLK_LOCAL_MEM_FENCE);                        // Only the first work-items in the work-group add elements together            if (get_local_id(0) < stride) {                            // Add two elements from the "partialDotProduct" array                // and store the result in partialDotProduct[index]                partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];            }        }        // Write the result of the reduction to global memory        if (get_local_id(0) == 0)            W[y] = partialDotProduct[0];        // Synchronize to make sure the first work-item is done with        // reading partialDotProduct        barrier(CLK_LOCAL_MEM_FENCE);    }}#define WARP_SIZE 32__kernel void MatVecMulCoalesced3(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){   // Each work-group computes multiple elements of W   for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {      const __global float* row = M + y * width;      // Each work-item accumulates as many products as necessary      // into local variable "sum"      float sum = 0;      for (uint x = get_local_id(0); x < width; x += get_local_size(0))         sum += row[x] * V[x];      // Each partial dot product is stored in shared memory      partialDotProduct[get_local_id(0)] = sum;      // Perform parallel reduction to add each work-item's      // partial dot product together      // Synchronize to make sure each work-item is done writing to      // partialDotProduct      barrier(CLK_LOCAL_MEM_FENCE);      // Thread local ID within a warp      uint id = get_local_id(0) & (WARP_SIZE - 1);       // Each warp reduces 64 consecutive elements      float warpResult = 0.0f;      if (get_local_id(0) < get_local_size(0)/2 )      {          volatile __local float* p = partialDotProduct + 2 * get_local_id(0) - id;          p[0] += p[32];          p[0] += p[16];          p[0] += p[8];          p[0] += p[4];          p[0] += p[2];          p[0] += p[1];          warpResult = p[0];      }      // Synchronize to make sure each warp is done reading      // partialDotProduct before it is overwritten in the next step      barrier(CLK_LOCAL_MEM_FENCE);      // The first thread of each warp stores the result of the reduction      // at the beginning of partialDotProduct      if (id == 0)         partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult;      // Synchronize to make sure each warp is done writing to      // partialDotProduct before it is read in the next step      barrier(CLK_LOCAL_MEM_FENCE);      // Number of remaining elements after the first reduction      uint size = get_local_size(0) / (2 * WARP_SIZE);      // get_local_size(0) is less or equal to 512 on NVIDIA GPUs, so      // only a single warp is needed for the following last reduction      // step      if (get_local_id(0) < size / 2) {         volatile __local float* p = partialDotProduct + get_local_id(0);         if (size >= 8)            p[0] += p[4];         if (size >= 4)            p[0] += p[2];         if (size >= 2)            p[0] += p[1];      }      // Write the result of the reduction to global memory      if (get_local_id(0) == 0)         W[y] = partialDotProduct[0];      // Synchronize to make sure the first work-item is done with      // reading partialDotProduct      barrier(CLK_LOCAL_MEM_FENCE);   }}

对于这个例子，例子里其实有6个kernel，展示的是书写和思维的进化过程。当我看到第2个我以为它比第一个快，因为有并行；在看到第3个时，它的写法我以为我明白，但我当时就在想：最外层循环为什么不用 for (uint y = get_global_id(0);y < height;y += get_global_size(0)) 而一定要用groupID？从上周五问大神，她讲了2遍，但我还是没真正听明白。。。被骂得狗血淋头。。。我自己听得哭了一会儿，其实虽然导火线是他说的话太伤我，实质是气自己为什么就是听不懂。修复好情绪又问，他没办法了，叫我先去看《AMD-OpenCL-Programming-User-Guide》于是我下载了看完了。还叫我理解“访存合并”的概念。。。

后来又问，直到昨天晚上才真的明白自己那样想为什么是错的：

这个让我纠结许久的问题终于解决了。

然后我就继续往下看，又遇到个问题，以为对于第一个group的所有items都是同时在算点乘而sum开始是等于0 比如A[0][0]*B[0][0]=V[0]也就是一个sum此时sum=V[0] 但同时第二个线程也在计算A[0][1]*B[1][0]=V[1]另一个sum。此时这个线程算sum+=V[1]时的sum是0还是V[0]呢。这样不会产生数据竞争吗？但想了想，自己明白了。用个图表示：

所以这样才会有256个localsize的结果相加即为一行的总结果。这样就说得通了。不知道这是不是因为被大神骂得脑袋灵光了，其实每次问他问题，我都要准备莫大的勇气，我都有点害怕。。。不过总算想明白了。以后应该就慢慢好了，在完整而彻底的了解对于kernel内部是怎么运行的之后。但愿以后就轻松些了。

虽然第3个kernel的后部分可以不用看了，但如果叫我写也许我会不写if (get_local_id(0) == 0) 作者真是聪明考虑周全因为比如对于第一个group 256个线程只让线程0去统计这组的结果即第一行的点乘结果。如果不加这个if 那么256个线程都会去统计自己这组的结果，做了重复工作255次，每个group重复255次，吃多了撑的。所以加if是多么明智和考虑周到。所以这也就是为什么后面还要一个barrier，因为255个线程不统计，第0个线程统计，所以255个线程要等第0个线程统计完毕。说真的，如果叫我自己写，我可能就会忽略。不过这都是由于我对线程的概念包括kernel内部的执行都还不够熟悉，还比较生疏，应该以后多建立这种思想，从而形成根深蒂固，以后就会自然而然考虑到了。。。不知道这个时间需要多久。

*************************************************************************

第4个kernel比第3个改进之处在于后部分第3个kernel每个group的第0个线程要统计本组的256个之和，需要它一个一个的加，要加二百多次；第4个kernel就在这里改进，按顺序让每个人算自己和下一个人之和。这样就快多了。画个图更清晰：

我只画到第2次for。后面类推。

第5个kernel其实和第4个差不多：

但大神说第5个kernel虽然和上一个类似，但 for循环里有明显的 /= 2 根据log2规约，这就让人不用经过推理就看明白，不做事的在后部分，而且每次增加一倍。这也是优化之处。

还有尽量不要让线程之间出现空洞，就是最好做事的都在一部分不做事的也在一部分不要做事不做事的间隔着。不干活的都仍在尾巴的话，实际上硬件可以不调度它,而将宝贵的计算资源(例如SP) 给干活的warp/wave，

(相比这个, 如果你不小心中间出现了空洞, 则干活和不干活的都得算上. )。

但是第6个kernel我自己想了一下还没明白后部分啊：第二个warp的确是从64开始计算的，但不对啊上面加重复了啊！？

刚刚睡午觉时自己想明白了，症结在于我每次只知道for里+=是并行的，分开的语句也是并行的，但我把这里if里几个语句当成了串行的，这就是导致我思考方向错误的原因。它应该是并行的！就像打王者荣耀，不管是打对方的英雄还是打对方的小兵，我和我们这边的人都是一起上的！对方的英雄和小兵就是kernel中各种语句，每一条语句，我所操作的英雄以及我们这边别人操作的英雄就是每一个线程！一定是一起上的！并行！

画了一个图表示我的想法这样看就清晰了：

连在一起看三幅图就明白了。

其实就是原本的0-255里每个线程里都放了一个sum 现在需要将0到255个sum的和加起来就是这个group的和，也就是一行乘以那个向量的和，第一排这里就是p[0]+=p[32]的结果，算这次时每个线程都在认真计算只需要128个线程每个线程计算2个和这样刚刚好把256个sum算完归结到128个线程里变成新的sum，看这里是连续的：0---31 32----63 64----95 96---127 128----159 160---191 192---223 224---255看我图中那一列就看得到。第二次算p[0]+=p[16]时，实际上每个warp的后16个线程在乱加，反正我们不需要它们的结果所以就让它们乱加好了，我们需要的是0---15 16---31 第二个warp的64---79 80---95 第三个warp的128--143 144--159 第4个warp的192--207 208--223总共也加完了上次的128个新sum，现在变成了64个新新sum........后面的几次我就不讲了，类推。知道所有和汇集到每个warp的第一个线程中可以看成每个warp的小组长。

汇集完毕就是4个最后的sum，kernel后部分就好理解了，

if (id == 0)
partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult; 这个就是找到4个小组长要它们各组的结果然后放到partialDotProduct的[0][1][2][3]位置，最后的那个if语句就是将这4个结果加起来和上面那个if类似。即完成了整个这个group的结果。

其实只要想通了我上面错的那一步，后面就容易了。

还有个问题想明白了这个kernel的原理是一回事这第6个比第5个的优化之处体现在哪里?访存合并和线程不空洞它都做到了，但第5个也做到了，而且这第6种写法一直都有128个线程在工作，第5种写法每次不做事的线程都会增加一倍，做事的线程都会减少，那第6种写法的优化之处在哪里？

1 0