Nvidia-OpenCL-SDK-Code-Samples的学习[2]

来源:互联网 发布:网络研修2016年 编辑:程序博客网 时间:2024/06/08 00:02

首先纠正一个错误,之前看过的例子里clBuildProgram()倒数第三个option传的NULL,但大神说不要这样,大神说当没有时要用" ",绝对不要用NULL!

另外看到即使创建一个CL_MEM_WRITE_ONLY的buffer,虽然在执行kernel时会被写进结果,但在此之前建议最好通过clEnqueueWriteBuffer()将那个buffer先初始化。这个例子里竟然对CL_MEM_WRITE_ONLY也可以像CL_MEM|_READ_ONLY一样通过clEnqueueWriteBuffer()对kernel的参数赋值的!我第一次看到......

1、接着之前的,这个工程是矩阵的乘法,实际是类似100000X1100的矩阵M乘以1100X1的矩阵V。

这个首先查看设备允许的最大GLOBAL_SIZE,以此为基准,设置了矩阵的长宽,里面涉及到一个cl文件里有多个kernel函数:

// Matrix multiplication kernel called by MatrixMul()   //the basical kernel.__kernel void MatVecMulUncoalesced0(const __global float* M,                                    const __global float* V,                                    uint width, uint height,                                    __global float* W){    // Row index    uint y = get_global_id(0);    if (y < height) {            // Row pointer        const __global float* row = M + y * width;        // Compute dot product          float dotProduct = 0;        for (int x = 0; x < width; ++x)            dotProduct += row[x] * V[x];        // Write result to global memory        W[y] = dotProduct;    }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulUncoalesced1(const __global float* M,                                    const __global float* V,                                    uint width, uint height,                                    __global float* W){            // Each work-item handles as many matrix rows as necessary    for (uint y = get_global_id(0);         y < height;         y += get_global_size(0))    {        // Row pointer        const __global float* row = M + y * width;        // Compute dot product          float dotProduct = 0;        for (uint x = 0; x < width; ++x)            dotProduct += row[x] * V[x];        // Write result to global memory        W[y] = dotProduct;    }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced0(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){        // Each work-group handles as many matrix rows as necessary    for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {        // Row pointer        const __global float* row = M + y * width;                // Each work-item accumulates as many products as necessary        // into local variable "sum"        float sum = 0;        for (uint x = get_local_id(0); x < width; x += get_local_size(0))            sum += row[x] * V[x];        // Each partial dot product is stored in shared memory        partialDotProduct[get_local_id(0)] = sum;        // Synchronize to make sure each work-item is done updating        // shared memory; this is necessary because in the next step,        // the first work-item needs to read from shared memory        // the partial dot products written by the other work-items        barrier(CLK_LOCAL_MEM_FENCE);        // The first work-item in the work-group adds all partial        // dot products together and writes the result to global memory        if (get_local_id(0) == 0) {            float dotProduct = 0;            for (uint t = 0; t < get_local_size(0); ++t)                dotProduct += partialDotProduct[t];            W[y] = dotProduct;    }        // Synchronize to make sure the first work-item is done with        // reading partialDotProduct        barrier(CLK_LOCAL_MEM_FENCE);}}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced1(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){        // Each work-group handles as many matrix rows as necessary    for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {        // Row pointer        const __global float* row = M + y * width;                // Each work-item accumulates as many products as necessary        // into local variable "sum"        float sum = 0;        for (uint x = get_local_id(0); x < width; x += get_local_size(0))            sum += row[x] * V[x];        // Each partial dot product is stored in shared memory        partialDotProduct[get_local_id(0)] = sum;                // Perform parallel reduction to add each work-item's        // partial dot product together        for (uint stride = 1; stride < get_local_size(0); stride *= 2) {            // Synchronize to make sure each work-item is done updating            // shared memory; this is necessary because work-items read            // results that have been written by other work-items            barrier(CLK_LOCAL_MEM_FENCE);                        // Index into the "partialDotProduct" array where            // the work-item will write during this step            uint index = 2 * stride * get_local_id(0);                        // Check for valid indices            if (index < get_local_size(0)) {                            // Add two elements from the "partialDotProduct" array                // and store the result in partialDotProduct[index]                partialDotProduct[index] += partialDotProduct[index + stride];            }        }        // Write the result of the reduction to global memory        if (get_local_id(0) == 0)            W[y] = partialDotProduct[0];        // Synchronize to make sure the first work-item is done with        // reading partialDotProduct        barrier(CLK_LOCAL_MEM_FENCE);    }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced2(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){        // Each work-group handles as many matrix rows as necessary    for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {        // Row pointer        const __global float* row = M + y * width;                // Each work-item accumulates as many products as necessary        // into local variable "sum"        float sum = 0;        for (uint x = get_local_id(0); x < width; x += get_local_size(0))            sum += row[x] * V[x];        // Each partial dot product is stored in shared memory        partialDotProduct[get_local_id(0)] = sum;                // Perform parallel reduction to add each work-item's        // partial dot product together        for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) {            // Synchronize to make sure each work-item is done updating            // shared memory; this is necessary because work-items read            // results that have been written by other work-items            barrier(CLK_LOCAL_MEM_FENCE);                        // Only the first work-items in the work-group add elements together            if (get_local_id(0) < stride) {                            // Add two elements from the "partialDotProduct" array                // and store the result in partialDotProduct[index]                partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride];            }        }        // Write the result of the reduction to global memory        if (get_local_id(0) == 0)            W[y] = partialDotProduct[0];        // Synchronize to make sure the first work-item is done with        // reading partialDotProduct        barrier(CLK_LOCAL_MEM_FENCE);    }}#define WARP_SIZE 32__kernel void MatVecMulCoalesced3(const __global float* M,                                  const __global float* V,                                  uint width, uint height,                                  __global float* W,                                  __local float* partialDotProduct){   // Each work-group computes multiple elements of W   for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) {      const __global float* row = M + y * width;      // Each work-item accumulates as many products as necessary      // into local variable "sum"      float sum = 0;      for (uint x = get_local_id(0); x < width; x += get_local_size(0))         sum += row[x] * V[x];      // Each partial dot product is stored in shared memory      partialDotProduct[get_local_id(0)] = sum;      // Perform parallel reduction to add each work-item's      // partial dot product together      // Synchronize to make sure each work-item is done writing to      // partialDotProduct      barrier(CLK_LOCAL_MEM_FENCE);      // Thread local ID within a warp      uint id = get_local_id(0) & (WARP_SIZE - 1);       // Each warp reduces 64 consecutive elements      float warpResult = 0.0f;      if (get_local_id(0) < get_local_size(0)/2 )      {          volatile __local float* p = partialDotProduct + 2 * get_local_id(0) - id;          p[0] += p[32];          p[0] += p[16];          p[0] += p[8];          p[0] += p[4];          p[0] += p[2];          p[0] += p[1];          warpResult = p[0];      }      // Synchronize to make sure each warp is done reading      // partialDotProduct before it is overwritten in the next step      barrier(CLK_LOCAL_MEM_FENCE);      // The first thread of each warp stores the result of the reduction      // at the beginning of partialDotProduct      if (id == 0)         partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult;      // Synchronize to make sure each warp is done writing to      // partialDotProduct before it is read in the next step      barrier(CLK_LOCAL_MEM_FENCE);      // Number of remaining elements after the first reduction      uint size = get_local_size(0) / (2 * WARP_SIZE);      // get_local_size(0) is less or equal to 512 on NVIDIA GPUs, so      // only a single warp is needed for the following last reduction      // step      if (get_local_id(0) < size / 2) {         volatile __local float* p = partialDotProduct + get_local_id(0);         if (size >= 8)            p[0] += p[4];         if (size >= 4)            p[0] += p[2];         if (size >= 2)            p[0] += p[1];      }      // Write the result of the reduction to global memory      if (get_local_id(0) == 0)         W[y] = partialDotProduct[0];      // Synchronize to make sure the first work-item is done with      // reading partialDotProduct      barrier(CLK_LOCAL_MEM_FENCE);   }}

对于这个例子,例子里其实有6个kernel,展示的是书写和思维的进化过程。当我看到第2个我以为它比第一个快,因为有并行;在看到第3个时,它的写法我以为我明白,但我当时就在想:最外层循环为什么不用 for (uint y = get_global_id(0);y < height;y += get_global_size(0)) 而一定要用groupID?从上周五问大神,她讲了2遍,但我还是没真正听明白。。。被骂得狗血淋头。。。我自己听得哭了一会儿,其实虽然导火线是他说的话太伤我,实质是气自己为什么就是听不懂。修复好情绪又问,他没办法了,叫我先去看《AMD-OpenCL-Programming-User-Guide》于是我下载了看完了。还叫我理解“访存合并”的概念。。。

后来又问,直到昨天晚上才真的明白自己那样想为什么是错的:

这个让我纠结许久的问题终于解决了。

然后我就继续往下看,又遇到个问题,以为对于第一个group的所有items都是同时在算点乘  而sum开始是等于0   比如A[0][0]*B[0][0]=V[0]也就是一个sum此时sum=V[0]  但同时第二个线程也在计算A[0][1]*B[1][0]=V[1]另一个sum。此时这个线程算sum+=V[1]时的sum是0还是V[0]呢。这样不会产生数据竞争吗?但想了想,自己明白了。用个图表示:

所以这样才会有256个localsize的结果相加即为一行的总结果。这样就说得通了。不知道这是不是因为被大神骂得脑袋灵光了,其实每次问他问题,我都要准备莫大的勇气,我都有点害怕。。。不过总算想明白了。以后应该就慢慢好了,在完整而彻底的了解对于kernel内部是怎么运行的之后。但愿以后就轻松些了。

虽然第3个kernel的后部分可以不用看了,但如果叫我写 也许我会不写if (get_local_id(0) == 0)  作者真是聪明 考虑周全  因为比如对于第一个group 256个线程 只让线程0去统计这组的结果即第一行的点乘结果。如果不加这个if  那么256个线程都会去统计自己这组的结果,做了重复工作255次,每个group重复255次,吃多了撑的。所以加if是多么明智和考虑周到。所以这也就是为什么后面还要一个barrier,因为255个线程不统计,第0个线程统计,所以255个线程要等第0个线程统计完毕。说真的,如果叫我自己写,我可能就会忽略。不过这都是由于我对线程的概念包括kernel内部的执行都还不够熟悉,还比较生疏,应该以后多建立这种思想,从而形成根深蒂固,以后就会自然而然考虑到了。。。不知道这个时间需要多久。

*************************************************************************

第4个kernel比第3个改进之处在于 后部分  第3个kernel每个group的第0个线程要统计本组的256个之和,需要它一个一个的加,要加二百多次;第4个kernel就在这里改进,按顺序让每个人算自己和下一个人之和。这样就快多了。画个图更清晰:

我只画到第2次for。后面类推。

第5个kernel其实和第4个差不多:

但大神说  第5个kernel虽然和上一个类似,但 for循环里有明显的 /= 2  根据log2规约,这就让人不用经过推理就看明白,不做事的在后部分,而且每次增加一倍。这也是优化之处。

还有尽量不要让线程之间出现空洞,就是最好做事的都在一部分  不做事的也在一部分  不要做事不做事的间隔着  。不干活的都仍在尾巴的话,实际上硬件可以不调度它,而将宝贵的计算资源(例如SP) 给干活的warp/wave,

 (相比这个, 如果你不小心中间出现了空洞, 则干活和不干活的都得算上. )。

但是第6个kernel我自己想了一下 还没明白后部分啊:第二个warp的确是从64开始计算的,但不对啊 上面加重复了啊!?

刚刚睡午觉时自己想明白了,症结在于我每次只知道for里+=是并行的,分开的语句也是并行的,但我把这里if里几个语句当成了串行的,这就是导致我思考方向错误的原因。它应该是并行的!就像打王者荣耀,不管是打对方的英雄还是打对方的小兵,我和我们这边的人都是一起上的!对方的英雄和小兵就是kernel中各种语句,每一条语句,我所操作的英雄以及我们这边别人操作的英雄就是每一个线程!一定是一起上的!并行!

画了一个图表示我的想法  这样看就清晰了:

连在一起看 三幅图 就明白了。


其实就是原本的0-255里每个线程里都放了一个sum 现在需要将0到255个sum的和加起来 就是这个group的和,也就是一行乘以那个向量的和,第一排这里就是p[0]+=p[32]的结果,算这次时每个线程都在认真计算 只需要128个线程  每个线程计算2个和这样刚刚好把256个sum算完归结到128个线程里变成新的sum,看这里是连续的:0---31   32----63    64----95   96---127    128----159   160---191  192---223  224---255看我图中那一列就看得到。第二次算p[0]+=p[16]时,实际上每个warp的后16个线程在乱加,反正我们不需要它们的结果 所以就让它们乱加好了,我们需要的是0---15   16---31 第二个warp的64---79  80---95   第三个warp的128--143  144--159  第4个warp的192--207  208--223总共也加完了上次的128个新sum,现在变成了64个新新sum........后面的几次我就不讲了,类推。知道所有和汇集到每个warp的第一个线程中 可以看成每个warp的小组长。

汇集完毕就是4个最后的sum,kernel后部分就好理解了,

if (id == 0)
         partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult;  这个就是找到4个小组长要它们各组的结果   然后放到partialDotProduct的[0][1][2][3]位置,最后的那个if语句就是将这4个结果加起来和上面那个if类似。即完成了整个这个group的结果。

其实只要想通了我上面错的那一步,后面就容易了。

还有个问题  想明白了这个kernel的原理是一回事  这第6个比第5个的优化之处体现在哪里?访存合并和线程不空洞 它都做到了,但第5个也做到了,而且这第6种写法一直都有128个线程在工作,第5种写法每次不做事的线程都会增加一倍,做事的线程都会减少,那第6种写法的优化之处在哪里?

1 0
原创粉丝点击