Nvidia-OpenCL-SDK-Code-Samples的学习[2]
来源:互联网 发布:网络研修2016年 编辑:程序博客网 时间:2024/06/08 00:02
首先纠正一个错误,之前看过的例子里clBuildProgram()倒数第三个option传的NULL,但大神说不要这样,大神说当没有时要用" ",绝对不要用NULL!
另外看到即使创建一个CL_MEM_WRITE_ONLY的buffer,虽然在执行kernel时会被写进结果,但在此之前建议最好通过clEnqueueWriteBuffer()将那个buffer先初始化。这个例子里竟然对CL_MEM_WRITE_ONLY也可以像CL_MEM|_READ_ONLY一样通过clEnqueueWriteBuffer()对kernel的参数赋值的!我第一次看到......
1、接着之前的,这个工程是矩阵的乘法,实际是类似100000X1100的矩阵M乘以1100X1的矩阵V。
这个首先查看设备允许的最大GLOBAL_SIZE,以此为基准,设置了矩阵的长宽,里面涉及到一个cl文件里有多个kernel函数:
// Matrix multiplication kernel called by MatrixMul() //the basical kernel.__kernel void MatVecMulUncoalesced0(const __global float* M, const __global float* V, uint width, uint height, __global float* W){ // Row index uint y = get_global_id(0); if (y < height) { // Row pointer const __global float* row = M + y * width; // Compute dot product float dotProduct = 0; for (int x = 0; x < width; ++x) dotProduct += row[x] * V[x]; // Write result to global memory W[y] = dotProduct; }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulUncoalesced1(const __global float* M, const __global float* V, uint width, uint height, __global float* W){ // Each work-item handles as many matrix rows as necessary for (uint y = get_global_id(0); y < height; y += get_global_size(0)) { // Row pointer const __global float* row = M + y * width; // Compute dot product float dotProduct = 0; for (uint x = 0; x < width; ++x) dotProduct += row[x] * V[x]; // Write result to global memory W[y] = dotProduct; }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced0(const __global float* M, const __global float* V, uint width, uint height, __global float* W, __local float* partialDotProduct){ // Each work-group handles as many matrix rows as necessary for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) { // Row pointer const __global float* row = M + y * width; // Each work-item accumulates as many products as necessary // into local variable "sum" float sum = 0; for (uint x = get_local_id(0); x < width; x += get_local_size(0)) sum += row[x] * V[x]; // Each partial dot product is stored in shared memory partialDotProduct[get_local_id(0)] = sum; // Synchronize to make sure each work-item is done updating // shared memory; this is necessary because in the next step, // the first work-item needs to read from shared memory // the partial dot products written by the other work-items barrier(CLK_LOCAL_MEM_FENCE); // The first work-item in the work-group adds all partial // dot products together and writes the result to global memory if (get_local_id(0) == 0) { float dotProduct = 0; for (uint t = 0; t < get_local_size(0); ++t) dotProduct += partialDotProduct[t]; W[y] = dotProduct; } // Synchronize to make sure the first work-item is done with // reading partialDotProduct barrier(CLK_LOCAL_MEM_FENCE);}}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced1(const __global float* M, const __global float* V, uint width, uint height, __global float* W, __local float* partialDotProduct){ // Each work-group handles as many matrix rows as necessary for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) { // Row pointer const __global float* row = M + y * width; // Each work-item accumulates as many products as necessary // into local variable "sum" float sum = 0; for (uint x = get_local_id(0); x < width; x += get_local_size(0)) sum += row[x] * V[x]; // Each partial dot product is stored in shared memory partialDotProduct[get_local_id(0)] = sum; // Perform parallel reduction to add each work-item's // partial dot product together for (uint stride = 1; stride < get_local_size(0); stride *= 2) { // Synchronize to make sure each work-item is done updating // shared memory; this is necessary because work-items read // results that have been written by other work-items barrier(CLK_LOCAL_MEM_FENCE); // Index into the "partialDotProduct" array where // the work-item will write during this step uint index = 2 * stride * get_local_id(0); // Check for valid indices if (index < get_local_size(0)) { // Add two elements from the "partialDotProduct" array // and store the result in partialDotProduct[index] partialDotProduct[index] += partialDotProduct[index + stride]; } } // Write the result of the reduction to global memory if (get_local_id(0) == 0) W[y] = partialDotProduct[0]; // Synchronize to make sure the first work-item is done with // reading partialDotProduct barrier(CLK_LOCAL_MEM_FENCE); }}// Matrix multiplication kernel called by MatrixMul()__kernel void MatVecMulCoalesced2(const __global float* M, const __global float* V, uint width, uint height, __global float* W, __local float* partialDotProduct){ // Each work-group handles as many matrix rows as necessary for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) { // Row pointer const __global float* row = M + y * width; // Each work-item accumulates as many products as necessary // into local variable "sum" float sum = 0; for (uint x = get_local_id(0); x < width; x += get_local_size(0)) sum += row[x] * V[x]; // Each partial dot product is stored in shared memory partialDotProduct[get_local_id(0)] = sum; // Perform parallel reduction to add each work-item's // partial dot product together for (uint stride = get_local_size(0) / 2; stride > 0; stride /= 2) { // Synchronize to make sure each work-item is done updating // shared memory; this is necessary because work-items read // results that have been written by other work-items barrier(CLK_LOCAL_MEM_FENCE); // Only the first work-items in the work-group add elements together if (get_local_id(0) < stride) { // Add two elements from the "partialDotProduct" array // and store the result in partialDotProduct[index] partialDotProduct[get_local_id(0)] += partialDotProduct[get_local_id(0) + stride]; } } // Write the result of the reduction to global memory if (get_local_id(0) == 0) W[y] = partialDotProduct[0]; // Synchronize to make sure the first work-item is done with // reading partialDotProduct barrier(CLK_LOCAL_MEM_FENCE); }}#define WARP_SIZE 32__kernel void MatVecMulCoalesced3(const __global float* M, const __global float* V, uint width, uint height, __global float* W, __local float* partialDotProduct){ // Each work-group computes multiple elements of W for (uint y = get_group_id(0); y < height; y += get_num_groups(0)) { const __global float* row = M + y * width; // Each work-item accumulates as many products as necessary // into local variable "sum" float sum = 0; for (uint x = get_local_id(0); x < width; x += get_local_size(0)) sum += row[x] * V[x]; // Each partial dot product is stored in shared memory partialDotProduct[get_local_id(0)] = sum; // Perform parallel reduction to add each work-item's // partial dot product together // Synchronize to make sure each work-item is done writing to // partialDotProduct barrier(CLK_LOCAL_MEM_FENCE); // Thread local ID within a warp uint id = get_local_id(0) & (WARP_SIZE - 1); // Each warp reduces 64 consecutive elements float warpResult = 0.0f; if (get_local_id(0) < get_local_size(0)/2 ) { volatile __local float* p = partialDotProduct + 2 * get_local_id(0) - id; p[0] += p[32]; p[0] += p[16]; p[0] += p[8]; p[0] += p[4]; p[0] += p[2]; p[0] += p[1]; warpResult = p[0]; } // Synchronize to make sure each warp is done reading // partialDotProduct before it is overwritten in the next step barrier(CLK_LOCAL_MEM_FENCE); // The first thread of each warp stores the result of the reduction // at the beginning of partialDotProduct if (id == 0) partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult; // Synchronize to make sure each warp is done writing to // partialDotProduct before it is read in the next step barrier(CLK_LOCAL_MEM_FENCE); // Number of remaining elements after the first reduction uint size = get_local_size(0) / (2 * WARP_SIZE); // get_local_size(0) is less or equal to 512 on NVIDIA GPUs, so // only a single warp is needed for the following last reduction // step if (get_local_id(0) < size / 2) { volatile __local float* p = partialDotProduct + get_local_id(0); if (size >= 8) p[0] += p[4]; if (size >= 4) p[0] += p[2]; if (size >= 2) p[0] += p[1]; } // Write the result of the reduction to global memory if (get_local_id(0) == 0) W[y] = partialDotProduct[0]; // Synchronize to make sure the first work-item is done with // reading partialDotProduct barrier(CLK_LOCAL_MEM_FENCE); }}
对于这个例子,例子里其实有6个kernel,展示的是书写和思维的进化过程。当我看到第2个我以为它比第一个快,因为有并行;在看到第3个时,它的写法我以为我明白,但我当时就在想:最外层循环为什么不用 for (uint y = get_global_id(0);y < height;y += get_global_size(0)) 而一定要用groupID?从上周五问大神,她讲了2遍,但我还是没真正听明白。。。被骂得狗血淋头。。。我自己听得哭了一会儿,其实虽然导火线是他说的话太伤我,实质是气自己为什么就是听不懂。修复好情绪又问,他没办法了,叫我先去看《AMD-OpenCL-Programming-User-Guide》于是我下载了看完了。还叫我理解“访存合并”的概念。。。
后来又问,直到昨天晚上才真的明白自己那样想为什么是错的:
这个让我纠结许久的问题终于解决了。
然后我就继续往下看,又遇到个问题,以为对于第一个group的所有items都是同时在算点乘 而sum开始是等于0 比如A[0][0]*B[0][0]=V[0]也就是一个sum此时sum=V[0] 但同时第二个线程也在计算A[0][1]*B[1][0]=V[1]另一个sum。此时这个线程算sum+=V[1]时的sum是0还是V[0]呢。这样不会产生数据竞争吗?但想了想,自己明白了。用个图表示:
所以这样才会有256个localsize的结果相加即为一行的总结果。这样就说得通了。不知道这是不是因为被大神骂得脑袋灵光了,其实每次问他问题,我都要准备莫大的勇气,我都有点害怕。。。不过总算想明白了。以后应该就慢慢好了,在完整而彻底的了解对于kernel内部是怎么运行的之后。但愿以后就轻松些了。
虽然第3个kernel的后部分可以不用看了,但如果叫我写 也许我会不写if (get_local_id(0) == 0) 作者真是聪明 考虑周全 因为比如对于第一个group 256个线程 只让线程0去统计这组的结果即第一行的点乘结果。如果不加这个if 那么256个线程都会去统计自己这组的结果,做了重复工作255次,每个group重复255次,吃多了撑的。所以加if是多么明智和考虑周到。所以这也就是为什么后面还要一个barrier,因为255个线程不统计,第0个线程统计,所以255个线程要等第0个线程统计完毕。说真的,如果叫我自己写,我可能就会忽略。不过这都是由于我对线程的概念包括kernel内部的执行都还不够熟悉,还比较生疏,应该以后多建立这种思想,从而形成根深蒂固,以后就会自然而然考虑到了。。。不知道这个时间需要多久。
*************************************************************************
第4个kernel比第3个改进之处在于 后部分 第3个kernel每个group的第0个线程要统计本组的256个之和,需要它一个一个的加,要加二百多次;第4个kernel就在这里改进,按顺序让每个人算自己和下一个人之和。这样就快多了。画个图更清晰:
我只画到第2次for。后面类推。
第5个kernel其实和第4个差不多:
但大神说 第5个kernel虽然和上一个类似,但 for循环里有明显的 /= 2 根据log2规约,这就让人不用经过推理就看明白,不做事的在后部分,而且每次增加一倍。这也是优化之处。
还有尽量不要让线程之间出现空洞,就是最好做事的都在一部分 不做事的也在一部分 不要做事不做事的间隔着 。不干活的都仍在尾巴的话,实际上硬件可以不调度它,而将宝贵的计算资源(例如SP) 给干活的warp/wave,
但是第6个kernel我自己想了一下 还没明白后部分啊:第二个warp的确是从64开始计算的,但不对啊 上面加重复了啊!?
刚刚睡午觉时自己想明白了,症结在于我每次只知道for里+=是并行的,分开的语句也是并行的,但我把这里if里几个语句当成了串行的,这就是导致我思考方向错误的原因。它应该是并行的!就像打王者荣耀,不管是打对方的英雄还是打对方的小兵,我和我们这边的人都是一起上的!对方的英雄和小兵就是kernel中各种语句,每一条语句,我所操作的英雄以及我们这边别人操作的英雄就是每一个线程!一定是一起上的!并行!
画了一个图表示我的想法 这样看就清晰了:
连在一起看 三幅图 就明白了。
其实就是原本的0-255里每个线程里都放了一个sum 现在需要将0到255个sum的和加起来 就是这个group的和,也就是一行乘以那个向量的和,第一排这里就是p[0]+=p[32]的结果,算这次时每个线程都在认真计算 只需要128个线程 每个线程计算2个和这样刚刚好把256个sum算完归结到128个线程里变成新的sum,看这里是连续的:0---31 32----63 64----95 96---127 128----159 160---191 192---223 224---255看我图中那一列就看得到。第二次算p[0]+=p[16]时,实际上每个warp的后16个线程在乱加,反正我们不需要它们的结果 所以就让它们乱加好了,我们需要的是0---15 16---31 第二个warp的64---79 80---95 第三个warp的128--143 144--159 第4个warp的192--207 208--223总共也加完了上次的128个新sum,现在变成了64个新新sum........后面的几次我就不讲了,类推。知道所有和汇集到每个warp的第一个线程中 可以看成每个warp的小组长。
汇集完毕就是4个最后的sum,kernel后部分就好理解了,
if (id == 0)
partialDotProduct[get_local_id(0) / WARP_SIZE] = warpResult; 这个就是找到4个小组长要它们各组的结果 然后放到partialDotProduct的[0][1][2][3]位置,最后的那个if语句就是将这4个结果加起来和上面那个if类似。即完成了整个这个group的结果。
其实只要想通了我上面错的那一步,后面就容易了。
还有个问题 想明白了这个kernel的原理是一回事 这第6个比第5个的优化之处体现在哪里?访存合并和线程不空洞 它都做到了,但第5个也做到了,而且这第6种写法一直都有128个线程在工作,第5种写法每次不做事的线程都会增加一倍,做事的线程都会减少,那第6种写法的优化之处在哪里?
- Nvidia-OpenCL-SDK-Code-Samples的学习[2]
- Nvidia-OpenCL-SDK-Code-Samples的学习[1]
- Nvidia-OpenCL-SDK-Code-Samples的学习[3]
- Nvidia-OpenCL-SDK-Code-Samples的学习[4]
- Nvidia-SDK-Code-Sample的学习[5]OverLap、一次乌龙
- opencl之code builder学习教程2
- Android SDK Samples,学习Android的好方法
- OpenCL和CUDA的关系--NVIDIA
- Nvidia 在11.26号发布了新的Toolkit version3.2 支持OpenCL
- opencl之code builder学习教程1
- opencl之code builder学习教程3
- opencl之code builder学习教程4
- DirectX9 SDK Samples(2) Tutorial2-Rendering Vertices
- DirectX9 SDK Samples(8) BasicHLSL Sample(2)
- DirectX9 SDK Samples(14) EffectParam Sample(2)
- DirectX9 SDK Samples(20) HDRDemo Sample(2)
- Ubuntu14.04下Nvidia opencl例子的编译运行
- nvidia显卡上OpenCL的hello world程序
- iOS 加密
- java中重载与重写的区别
- PAT:考试座位号(C++)
- Python挑战游戏( PythonChallenge)闯关之路Level- 1
- 对象传值
- Nvidia-OpenCL-SDK-Code-Samples的学习[2]
- xinetd超级守护进程的使用
- hdu 1011(Starship Troopers,树形dp)
- No output folder解决办法
- redis 接口和命令机制
- 数组指针函数
- Remoting 外网访问和内网访问解决方案
- Linux安装Oracle 11g R2安装过程中遇到的报错及解决办法
- Oracle 使用druid连接池,发生“违反协议”异常