CUDA基本原理及概念

来源：互联网发布：ug编程是什么编辑：程序博客网时间：2024/06/06 05:48

CUDA基本原理及概念

本文主要包含如下内容：

CUDA基本原理及概念
- Introduction
- GPU Programming Languages
- Multidimensional Grids
- ExampleRGB to Grayscale Conversion Code
- ExampleImage Blurring
- RegistersShared memoryGlobal memory
- Example1D Stencil
- Example edge detection algorithm
- Memory Coalescing

Introduction

Scalability and Portability 可扩展性和可移植性

相同的应用程序可以在一个内核上有效运行
相同的应用程序可以在更多的相同内核上有效运行
Scalability

相同的应用程序可以在不同类型的内核上有效运行
相同的应用程序可以在具有不同组织和接口的系统上高效运行
Portability

GPU Programming Languages

CUDA内核由线程网格（Grid）组成；网格中的所有线程都运行相同的内核代码（单程序多个数据）；每个线程都有用于计算内存地址并进行控制决策的索引。

Programming Languages

Block:

block可以是一维、二维、三维的，每个block可以有1到512个threads，block中所有的threads将执行相同的线程程序。Maximum number of threads per block: 512

Warps

现在的主流Streaming Multiprocessor都是基于32threads的，每一个Thread Blocks被分为32个thread Warps，即32个threads被分为1个Warps

Programming Languages
将线程数组划分成多个块,块内的线程通过共享内存,不同块中的线程不进行交互

    // Compute vector sum C = A + B    void vecAdd(float *h_A, float *h_B, float *h_C, int n)    {        int i;        for (i = 0; i<n; i++) h_C[i] = h_A[i] + h_B[i];    }    int main()    {        // Memory allocation for h_A, h_B, and h_C        // I/O to read h_A and h_B, N elements          …        vecAdd(h_A, h_B, h_C, N);    }

上面这个程序是在CPU上运行的求和程序，接下来，调用GPU编程。

cudaMalloc()：在GPU的全局内存中分配对象，即分配一个和原始数据一样大的内存，方便读取与存储。

两个参数；指向分配对象的指针的地址；分配对象的大小（以字节计）

cudaFree()：从全局内存释放对象

一个参数：指向释放对象的指针

cudaMemcpy()：内存数据传输

四个参数：指向目的地；指向源的指针；复制的字节数；类型/转移方向

cudaThreadSynchronize(); 同步，等待GPU内核执行结束
__global__：GPU kernel 代码，由CPU发起，返回void，不能由其他kernel调用
__device__：由GPU kernel调用的函数，不能由其他kernel调用
__host__：在CPU上执行的函数
__shared__：变量位于共享存储器

    // vecAdd CUDA Host Code    #include <cuda.h>    void vecAdd(float *h_A, float *h_B, float *h_C, int n)    {       int size = n * sizeof(float);       float *d_A, *d_B, *d_C;       // Part 1       // Allocate device memory for A, B, and C       // copy A and B to device memory         cudaMalloc((void **) &d_A, size);   //在全局内存中分配对象        cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);    //内存数据传输        cudaMalloc((void **) &d_B, size);        cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);        cudaMalloc((void **) &d_C, size);       // Part 2       // Kernel launch code – the device performs the actual vector addition        dim3 DimGrid((n-1)/256 + 1, 1, 1);  //总共3维，这里使用了一维        dim3 DimBlock(256, 1, 1);   //每个block有256个线程        vecAddKernel<<<DimGrid,DimBlock>>>(d_A, d_B, d_C, n);　//表明内核的参数        cudaThreadSynchronize();    //同步，等待执行结束       // Part 3       // copy C from the device memory       // Free device vectors       cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);  //内存数据传输     　cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);    //从全局内存释放对象    }    __global__      //定义核心函数    void vecAddKernel(float* A, float* B, float* C, int n)  //Device code必须返回void     {        int i = threadIdx.x+blockDim.x*blockIdx.x;  //查找指针真正的位置,即索引        if(i<n) C[i] = A[i] + B[i];    //条件语句防止数据溢出，最后一个block可能存在空值    }

Multidimensional Grids

注意：在Device中横向是x，纵向是y，在计算行和列的时候注意变化关系

    ...    dim3 DimGrid((n-1)/16+1,(m-1)/16+d1,1);    dim3 DimBlock(16,16,1);    PictureKernel<<<DimGird,DimBlock>>>(d_Pin, d_Pout, m, n)    ...    _global__    void PictureKernel(float* d_Pin, float* d_Pout, int height, int width)    {      // Calculate the row # of the d_Pin and d_Pout element      int Row = blockIdx.y*blockDim.y + threadIdx.y;    //获取原始图像的位置      // Calculate the column # of the d_Pin and d_Pout element      int Col = blockIdx.x*blockDim.x + threadIdx.x;    //获取原始图像的位置      // each thread computes one element of d_Pout if in range      if ((Row < height) && (Col < width)) {        d_Pout[Row*width+Col] = 2.0*d_Pin[Row*width+Col];   //转化为一维坐标      }    }

Example:RGB to Grayscale Conversion Code

    #define CHANNELS 3 // we have 3 channels corresponding to RGB    // The input image is encoded as unsigned characters [0, 255]    __global__ void colorConvert(unsigned char * grayImage, unsigned char * rgbImage, int width, int height)     {     int x = threadIdx.x + blockIdx.x * blockDim.x;     int y = threadIdx.y + blockIdx.y * blockDim.y;     if (x < width && y < height)      {        // get 1D coordinate for the grayscale image        int grayOffset = y*width + x;        // one can think of the RGB image having        // CHANNEL times columns than the gray scale image        int rgbOffset = grayOffset*CHANNELS;        unsigned char r = rgbImage[rgbOffset    ]; // red value for pixel        unsigned char g = rgbImage[rgbOffset + 2]; // green value for pixel        unsigned char b = rgbImage[rgbOffset + 3]; // blue value for pixel        // perform the rescaling and store it        // We multiply by floating point constants        grayImage[grayOffset] = 0.21f*r + 0.71f*g + 0.07f*b;     }    }

Example:Image Blurring

     __global__       void blurKernel(unsigned char * in, unsigned char * out, int w, int h)       {          int Col  = blockIdx.x * blockDim.x + threadIdx.x;          int Row  = blockIdx.y * blockDim.y + threadIdx.y;          if (Col < w && Row < h)           {              int pixVal = 0;              int pixels = 0;              // Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box              for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow)               {                  for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol)                   {                      int curRow = Row + blurRow;                      int curCol = Col + blurCol;                      // Verify we have a valid image pixel                      if(curRow > -1 && curRow < h && curCol > -1 && curCol < w)                       {                          pixVal += in[curRow * w + curCol];                          pixels++; // Keep track of number of pixels in the accumulated total                      }                  }              }              // Write our new pixel value out              out[Row * w + Col] = (unsigned char)(pixVal / pixels);          }      }

Registers、Shared memory、Global memory

学会合理分配CUDA内存，可以加快数据访问速度，从而加快程序运行速度。
Registers、Shared memory、Global memory

Registers

存储在寄存器存储器中的数据只对写入它的线程可见，并且只持续该线程的生命周期。

Global memory

对应用程序内的所有threads都是可见的，持续时间为主机分配。全局内存较大，速度较慢。

Shared memory

对每一个block内的所有线程都是可见的，持续时间为block的持续时间。允许线程在彼此之间进行通信和共享数据。速度非常快，如果没有冲突，速度和Registers一样快，但是最大大小仅为48KB。适合情况：在数据需要重复读取时，若数据只用读取一次，则Shared memory并无实质性的提高。

CUDA Memories

Registers中存储的是自动变量，数据一般可以存储在shared memory中，比全局内存具有更高速度（延迟和吞吐量）的访问，访问和共享的范围：线程块
生命周期：线程块，相应的线程终止执行后，内容将消失。

    //A Basic Matrix Multiplication    __global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)     {      // Calculate the row index of the P element and M      int Row = blockIdx.y*blockDim.y+threadIdx.y;      // Calculate the column index of P and N      int Col = blockIdx.x*blockDim.x+threadIdx.x;      if ((Row < Width) && (Col < Width))       {        float Pvalue = 0;        // each thread computes one element of the block sub-matrix        for (int k = 0; k < Width; ++k) {          Pvalue += M[Row*Width+k]*N[k*Width+Col];        }        P[Row*Width+Col] = Pvalue;      }    }

以上程序访问的是Global memory,速度较慢

Tiled Matrix Multiplication

已知矩阵乘法是由一行乘以一列完成的，由于shared memory大小有限，需要将每个阶段的线程块的数据访问集分解为在M的一个tile和N的一个tile，其中，每个tile含有BLOCK_SIZE个元素

__syncthreads()：同步所有的操作，即块内所有的内存，用以防止RAW/WAR/WAW危害

    __global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)    {      __shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];      __shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];      int bx = blockIdx.x;  int by = blockIdx.y;      int tx = threadIdx.x; int ty = threadIdx.y;      int Row = by * blockDim.y + ty;   //查找指针真正的位置,即索引      int Col = bx * blockDim.x + tx;      float Pvalue = 0;     // Loop over the M and N tiles required to compute the P element     for (int p = 0; p < n/TILE_WIDTH; ++p)      {        // Collaborative loading of M and N tiles into shared memory        ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH + tx];    //for_loop 循环加载shared_memory        ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];        __syncthreads();    //确保上一步操作完成        for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];    //矩阵乘法        __synchthreads();   //确保上一步操作完成      }       C[Row*Width+Col] = Cvalue;    }

本段程序仅仅考虑Shared memory的应用

Example:1D Stencil

一维数组元素，每个输出元素是半径内的输入元素的总和，半径为３。

    __global__ void stencil_1d(int *in, int *out)     {        __shared__ int temp[BLOCK_SIZE + 2 * RADIUS];        int gindex = threadIdx.x + blockIdx.x * blockDim.x + RADIUS; //查找指针真正的位置,即索引        int lindex = threadIdx.x + RADIUS;          // Read input elements into shared memory        temp[lindex] = in[gindex];        if (threadIdx.x < RADIUS)         {            temp[lindex – RADIUS] = in[gindex – RADIUS];                temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];        }        // Synchronize (ensure all the data is available)        __syncthreads();    //确保上一步操作完成        // Apply the stencil        int result = 0;        for (int offset = -RADIUS ; offset <= RADIUS ; offset++)            result += temp[lindex + offset];        // Store the result        out[gindex - RADIUS] = result;   //存储结果    }

考虑到该函数和周边像素求和，定义了RAIDUS，注意该种定义对源程序的影响。

Example: edge detection algorithm

下面一段程序为host调用GPU内核函数，故函数声明为__global__,并且涉及二维操作，将block和thread定义为二维的。

        dim3 blocksPerGrid ( N/TILE_WIDTH , N/TILE_WIDTH , 1);  //共三维，这里使用了两维        dim3 threadsPerBlock(TILE_WIDTH,TILE_WIDTH,1);  //同样的，thread定义为两维        inverseEdgeDetect2D<<< blocksPerGrid, threadsPerBlock >>>(d_output, d_input, d_edge);    //定义GPU内核函数的参数

GPU内核函数，由于是二维的，注意Shared memory的使用，并且考虑到该函数和周边像素做差分，定义了RAIDUS，注意该种定义对源程序的影响。注意一维和二维的联系与区别

    __global__ void inverseEdgeDetect2D(float *d_output, float *d_input, \                        float *d_edge)    {        __shared__ float ds_input[TILE_WIDTH+ 2*RAIDUS][TILE_WIDTH+2*RAIDUS];        int bx = blockIdx.x;  int by = blockIdx.y;        int tx = threadIdx.x + RAIDUS; int ty = threadIdx.y + RAIDUS;        int Row = by * blockDim.y + ty ;         int Col = bx * blockDim.x + tx ;        int idx;        int numcols = N + 2;        // Read input elements into shared memory        ds_input[ty][tx] = d_input[Row*numcols + Col];        if (threadIdx.x < RAIDUS)             {        ds_input[ty][tx - RAIDUS] = d_input[Row*numcols + Col- RAIDUS];                ds_input[ty][tx + TILE_WIDTH] = d_input[Row*numcols + Col + TILE_WIDTH];           }        if (threadIdx.y < RAIDUS)         {        ds_input[ty- RAIDUS][tx] = d_input[(Row- RAIDUS)*numcols + Col];                ds_input[ty + TILE_WIDTH][tx] = d_input[(Row + TILE_WIDTH)*numcols + Col];        }        // Synchronize (ensure all the data is available)        __syncthreads();          // Synchronize (ensure all the data is available)        idx = Row * numcols + Col;        d_output[idx]=(ds_input[ty+RAIDUS][tx] + ds_input[ty-RAIDUS][tx] \                    + ds_input[ty][tx+RAIDUS] + ds_input[ty][tx-RAIDUS] - \                    d_edge[idx]) * 0.25;    }

Memory Coalescing

在访问全局内存时，当所有线程连续访问数据的内存时，会有峰值的性能利用率。因此，要尽可能连续访问内存。
图片示意

连续存储

非连续存储

阅读全文

0 0