how to use cudaMallocPitch

来源：互联网发布：男性网络个人基金产品编辑：程序博客网时间：2024/05/20 02:30

by Steven Mark Ford

http://www.stevenmarkford.com/allocating-2d-arrays-in-cuda/

Allocating 2D arrays in CUDA can be a little confusing at first. There are a couple of mistakes you may make while trying to allocate your first 2D array.

Wrong Way #1:

int rowCount = 10;float** d_array=0; // array on devicecudaMalloc(d_array, rowCount*sizeof(float*));for(int i = 0 ; i < rowCount ; i++){//this results in error "Access violation writing location"cudaMalloc((void **)&d_array[i], (i + 1) * sizeof(float)); // column length increases with "i" here}

The problem with doing the above is that cudaMalloc assigns memory on the device and once it is on the device your main thread looses access to it, that is, it can only be accessed within kernels. So, When you try call cudaMalloc on the 2nd dimension of the array it throws an "Access violation writing location" exception.

Wrong Way #2:

int rowCount = 10;float** d_array =(float**)malloc(rowCount*sizeof(float*)); //malloc 1st dimensionfor(int i = 0 ; i < rowCount ; i++){cudaMalloc(&d_array[i], (i + 1) * sizeof(float)); // cuda malloc 2nd dimension}

The issue with the above code is that the 1st dimension of the array now belongs to the host (we used malloc) and the second dimension of the array belongs to the device (we used cuda malloc). This can also cause access violations on the host and other issues on the kernel.

The Semi-Correct Way:
Flatten out your 2D array into a 1D array and use pointer arithmetic to access the chunk of array you desire. This works but the issue is probable performance loss due to classical data structure alignment issues (for theory on this see Wikipedia's article: http://en.wikipedia.org/wiki/Data_structure_alignment). The CUDA documentation also talks about memory alignment for optimal performance.. For CUDA specific alignment and padding requirements see "CUDA_C_Programming_Guide Version 4.0" Page 94.

The Recommended Way:
Use the built-in CUDA array allocation methods e.g. cudaMalloocPitch() and cudaMalloc3D().These are also optomised for performance.
A quote from the "CUDA_C_Programming_Guide Version 4.0" Page 21: "These functions are recommended for allocations of 2D or 3D
arrays as it makes sure that the allocation is appropriately padded to meet the
alignment requirements described in Section 5.3.2.1, therefore ensuring best
performance when accessing the row addresses or performing copies between 2D
arrays and other regions of device memory"

Example:

// kernel which copies data from d_array to destinationArray__global__ void CopyData(float* d_array,                                    float* destinationArray,                                    size_t pitch,                                    int columnCount,                                    int rowCount){  for (int row = 0; row < rowCount; row++)   {     // update the pointer to point to the beginning of the next row     float* rowData = (float*)(((char*)d_array) + (row * pitch));            for (int column = 0; column < columnCount; column++)     {      rowData[column] = 123.0; // make every value in the array 123.0      destinationArray[(row*columnCount) + column] = rowData[column];    }  }}int main(int argc, char** argv) {  int columnCount = 15;   int rowCount = 10;  float* d_array; // the device array which memory will be allocated to  float* d_destinationArray; // the device array    // allocate memory on the host  float* h_array = new float[columnCount*rowCount];  // the pitch value assigned by cudaMallocPitch  // (which ensures correct data structure alignment)  size_t pitch;     //allocated the device memory for source array  cudaMallocPitch(&d_array, &pitch, columnCount * sizeof(float), rowCount);    //allocate the device memory for destination array  cudaMalloc(&d_destinationArray,columnCount*rowCount*sizeof(float));    //call the kernel which copies values from d_array to d_destinationArray  CopyData<<<100, 512>>>(d_array, d_destinationArray, pitch, columnCount, rowCount);  //copy the data back to the host memory  cudaMemcpy(h_array,                    d_destinationArray,                    columnCount*rowCount*sizeof(float),                    cudaMemcpyDeviceToHost);  //print out the values (all the values are 123.0)  for(int i = 0 ; i < rowCount ; i++)  {    for(int j = 0 ; j < columnCount ; j++)    {      cout << "h_array[" << (i*columnCount) + j << "]=" << h_array[(i*columnCount) + j] << endl;    }  }}

Below is a picture explaining the meaning of pitch (the numbers aren't realistic):

0 0