CUDA by Example 第三章 部分翻译实践 GPU器件参数提取

来源:互联网 发布:js中trimend 编辑:程序博客网 时间:2024/06/01 10:28

由于这本书内容实在是多,很多内容和其他讲解cuda的书又重复了,所以我只翻译一些重点,时间就是金钱嘛,一起来学cuda吧。如有错误,欢迎纠正


由于第一章第二章暂时没时间仔细看,我们从第三章开始

不喜欢受制于人,所以不用它的头文件,所有程序我都会改写,有些程序实在是太无聊,就算了。

//hello.cu

#include<stdio.h>

#include<cuda.h>

int main( void ) {
printf( "Hello, World!\n" );
return 0;
}

这第一个cuda程序并不能算是严格的cuda程序,它只不过用到了cuda的头文件,编译命令: nvcc hello.cu -o hello

执行命令:./hello

并没有在cuda上面执行任何任务。


第二个程序


#include<stdio.h>

#include<cuda.h>

__global__ void kernel(void){}

int main( void ) {

kernel<<<1,1>>>();

printf( "Hello, World!\n" );
return 0;
}

这个程序调用了一个函数,__global__的含义是该函数在CPU上调用,GPU上执行。

至于三个尖括号里面的参数是什么呢? 要看下一章


  1 #include <stdio.h>
  2 #include <cuda.h>
  3 __global__ void add( int a, int b, int *c ) {
  4         *c = a + b;
  5 }
  6 int main( void )
  7 {       
  8         int c;
  9         int *dev_c;
 10         cudaMalloc( (void**)&dev_c, sizeof(int) );
 11         add<<<1,1>>>( 2, 7, dev_c );
 12         cudaMemcpy( &c,dev_c,sizeof(int),cudaMemcpyDeviceToHost );
 13         printf( "2 + 7 = %d\n", c );
 14         cudaFree( dev_c );
 15         return 0;
 16 }       
 17

cudaMalloc()分配GPU上的存储空间,cudaMemcpy是把运行结果从GPU上拷贝到CPU上cudaMemcpyDeviceToHost,或者把执行参数从CPU上拷贝到GPU上cudaMemcpyHostToDevice。

cudaFree是释放GPU上的空间,和CPU上的Free是同样的意义,只不过对象不同。


这一章的重点(对我来说)是3.3  访问GPU(device)


这章呢,是说,如果你没有你所用的GPU的说明书,或者你懒得拆解下来看,或者,为了让你的程序可以适用于更多不同的硬件环境,尝试用编程的方式来得到关于GPU的某些参数。


大量的废话大家自己看吧。俺讲写有意义的。


现在很多电脑里面都不只有一个GPU显卡,尤其是显卡做计算的集成环境,所以我们可以通过

int count;

cudaGetDeviceCount(&count);

来获得集成环境的显卡数量。


然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。

下面是以cuda3.0为例子.

定义的这个机构体在自己的程序中可以直接调用,无需自己定义。


struct cudaDeviceProp {
char name[256];         //器件的名字
size_t totalGlobalMem;    //Global Memory 的byte大小
size_t sharedMemPerBlock;   //线程块可以使用的共用记忆体的最大值。byte为单位,多处理器上的所有线程块可以同时共用这些记忆体
int regsPerBlock;                 //线程块可以使用的32位寄存器的最大值,多处理器上的所有线程快可以同时实用这些寄存器
int warpSize;                    //按线程计算的wrap块大小
size_t memPitch;        //做内存复制是可以容许的最大间距,允许通过cudaMallocPitch()为包含记忆体区域的记忆提复制函数的最大间距,以byte为单位。
int maxThreadsPerBlock;   //每个块中最大线程数
int maxThreadsDim[3];       //块各维度的最大值
int maxGridSize[3];             //Grid各维度的最大值
size_t totalConstMem;  //常量内存的大小
int major;            //计算能力的主代号
int minor;            //计算能力的次要代号
int clockRate;     //时钟频率
size_t textureAlignment; //纹理的对齐要求
int deviceOverlap;    //器件是否能同时执行cudaMemcpy()和器件的核心代码
int multiProcessorCount; //设备上多处理器的数量
int kernelExecTimeoutEnabled; //是否可以给核心代码的执行时间设置限制
int integrated;                  //这个GPU是否是集成的
int canMapHostMemory; //这个GPU是否可以讲主CPU上的存储映射到GPU器件的地址空间
int computeMode;           //计算模式
int maxTexture1D;          //一维Textures的最大维度 
int maxTexture2D[2];      //二维Textures的最大维度
int maxTexture3D[3];      //三维Textures的最大维度
int maxTexture2DArray[3];     //二维Textures阵列的最大维度
int concurrentKernels;           //GPU是否支持同时执行多个核心程序
}


实例程序:


  1 #include<stdio.h>
  2 #include<stdlib.h>
  3 #include<cuda.h>
  4
  5 int main()
  6 {
  7     int i;
  8     /*cudaGetDeviceCount(&count)*/
  9     int count;
 10     cudaGetDeviceCount(&count);
 11     printf("The count of CUDA devices:%d\n",count);
 12         ////
 13
 14     cudaDeviceProp prop;
 15     for(i=0;i<count;i++)
 16     {
 17         cudaGetDeviceProperties(&prop,i);
 18         printf("\n---General Information for device %d---\n",i);
 19         printf("Name of the cuda device: %s\n",prop.name);
 20         printf("Compute capability: %d.%d\n",prop.major,prop.minor);
 21         printf("Clock rate: %d\n",prop.clockRate);
 22         printf("Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  ");
 23         if(prop.deviceOverlap)
 24             printf("Enabled\n");
 25         else
 26             printf("Disabled\n");
 27         printf("Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): ");
 28         if(prop.kernelExecTimeoutEnabled)
 29             printf("Enabled\n");
 30         else
 31             printf("Disabled\n");
 32
 33         printf("\n---Memory Information for device %d ---\n",i);
 34         printf("Total global mem in bytes: %ld\n",prop.totalGlobalMem);
 35         printf("Total constant Mem: %ld\n",prop.totalConstMem);
 36         printf("Max mem pitch for memory copies in bytes: %ld\n",prop.memPitch);
 37         printf("Texture Alignment: %ld\n",prop.textureAlignment);
 38
 39         printf("\n---MP Information for device %d---\n",i);
 40         printf("Multiprocessor count: %d\n",prop.multiProcessorCount);
 41         printf("Shared mem per mp(block): %ld\n",prop.sharedMemPerBlock);
 42         printf("Registers per mp(block):%d\n",prop.regsPerBlock);
 43         printf("Threads in warp:%d\n",prop.warpSize);
 44         printf("Max threads per block: %d\n",prop.maxThreadsPerBlock);
 45         printf("Max thread dimensions in a block:(%d,%d,%d)\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]);
 46         printf("Max blocks dimensions in a grid:(%d,%d,%d)\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]);
 47         printf("\n");
 48
 49         printf("\nIs the device an integrated GPU:");
 50         if(prop.integrated)
 51             printf("Yes!\n");
 52         else
 53             printf("No!\n");
 54
 55         printf("Whether the device can map host memory into CUDA device address space:");
 56         if(prop.canMapHostMemory)
 57             printf("Yes!\n");
 58         else
 59             printf("No!\n");
 60
 61         printf("Device's computing mode:%d\n",prop.computeMode);
 62
 63         printf("\n The maximum size for 1D textures:%d\n",prop.maxTexture1D);
 64         printf("The maximum dimensions for 2D textures:(%d,%d)\n",prop.maxTexture2D[0],prop.maxTexture2D[1]);
 65         printf("The maximum dimensions for 3D textures:(%d,%d,%d)\n",prop.maxTexture3D[0],prop.maxTexture3D[1],prop.maxTexture3D[2]);
 66 //      printf("The maximum dimensions for 2D texture arrays:(%d,%d,%d)\n",prop.maxTexture2DArray[0],prop.maxTexture2DArray[1],prop.maxTexture2DArray[2]);
 67
 68         printf("Whether the device supports executing multiple kernels within the same context simultaneously:");
 69         if(prop.concurrentKernels)
 70             printf("Yes!\n");
 71         else
 72             printf("No!\n");
 73     }
 74
 75 }

运行结果:

The count of CUDA devices:1

---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)


Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

 The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ ./cuda
-bash: ./cuda: 沒有此一檔案或目錄
yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
The count of CUDA devices:1

---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)


Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

 The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!


示例机器2:

The count of CUDA devices:2

---General Information for device 0---
Name of the cuda device: Tesla K20c
Compute capability: 3.5
Clock rate: 705500
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Disabled

---Memory Information for device 0 ---
Total global mem in bytes: 5032706048
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 13
Shared mem per mp(block): 49152
Registers per mp(block):65536
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(2147483647,65535,65535)


Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

 The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65536)
The maximum dimensions for 3D textures:(4096,4096,4096)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!

---General Information for device 1---
Name of the cuda device: GeForce GTX 480
Compute capability: 2.0
Clock rate: 1401000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution):  Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 1 ---
Total global mem in bytes: 1610153984
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 1---
Multiprocessor count: 15
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)


Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

 The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!



参考书籍:《CUDA BY EXAMPLE》

原创粉丝点击