CUDA by Example 第三章部分翻译实践 GPU器件参数提取

来源：互联网发布：js中trimend 编辑：程序博客网时间：2024/06/01 10:28

由于这本书内容实在是多，很多内容和其他讲解cuda的书又重复了，所以我只翻译一些重点，时间就是金钱嘛，一起来学cuda吧。如有错误，欢迎纠正

由于第一章第二章暂时没时间仔细看，我们从第三章开始

不喜欢受制于人，所以不用它的头文件，所有程序我都会改写，有些程序实在是太无聊，就算了。

//hello.cu

#include<stdio.h>

#include<cuda.h>

int main( void ) {
printf( "Hello, World!\n" );
return 0;
}

这第一个cuda程序并不能算是严格的cuda程序，它只不过用到了cuda的头文件，编译命令： nvcc hello.cu -o hello

执行命令：./hello

并没有在cuda上面执行任何任务。

第二个程序

#include<stdio.h>

#include<cuda.h>

__global__ void kernel(void){}

int main( void ) {

kernel<<<1,1>>>();

printf( "Hello, World!\n" );
return 0;
}

这个程序调用了一个函数，__global__的含义是该函数在CPU上调用，GPU上执行。

至于三个尖括号里面的参数是什么呢？要看下一章

1 #include <stdio.h>
2 #include <cuda.h>
3 __global__ void add( int a, int b, int *c ) {
4         *c = a + b;
5 }
6 int main( void )
7 {
8         int c;
9         int *dev_c;
10         cudaMalloc( (void**)&dev_c, sizeof(int) );
11         add<<<1,1>>>( 2, 7, dev_c );
12         cudaMemcpy( &c,dev_c,sizeof(int),cudaMemcpyDeviceToHost );
13         printf( "2 + 7 = %d\n", c );
14         cudaFree( dev_c );
15         return 0;
16 }
17

cudaMalloc()分配GPU上的存储空间，cudaMemcpy是把运行结果从GPU上拷贝到CPU上cudaMemcpyDeviceToHost，或者把执行参数从CPU上拷贝到GPU上cudaMemcpyHostToDevice。

cudaFree是释放GPU上的空间，和CPU上的Free是同样的意义，只不过对象不同。

这一章的重点（对我来说）是3.3 访问GPU（device）

这章呢，是说，如果你没有你所用的GPU的说明书，或者你懒得拆解下来看，或者，为了让你的程序可以适用于更多不同的硬件环境，尝试用编程的方式来得到关于GPU的某些参数。

大量的废话大家自己看吧。俺讲写有意义的。

现在很多电脑里面都不只有一个GPU显卡，尤其是显卡做计算的集成环境，所以我们可以通过

int count；

cudaGetDeviceCount(&count);

来获得集成环境的显卡数量。

然后通过cudaDeviceProp这个结构提可以获得显卡的相关性能。

下面是以cuda3.0为例子.

定义的这个机构体在自己的程序中可以直接调用，无需自己定义。

struct cudaDeviceProp {
char name[256];         //器件的名字
size_t totalGlobalMem;    //Global Memory 的byte大小
size_t sharedMemPerBlock;   //线程块可以使用的共用记忆体的最大值。byte为单位，多处理器上的所有线程块可以同时共用这些记忆体
int regsPerBlock;                 //线程块可以使用的32位寄存器的最大值，多处理器上的所有线程快可以同时实用这些寄存器
int warpSize;                    //按线程计算的wrap块大小
size_t memPitch;        //做内存复制是可以容许的最大间距，允许通过cudaMallocPitch（）为包含记忆体区域的记忆提复制函数的最大间距，以byte为单位。
int maxThreadsPerBlock;   //每个块中最大线程数
int maxThreadsDim[3];       //块各维度的最大值
int maxGridSize[3];             //Grid各维度的最大值
size_t totalConstMem; //常量内存的大小
int major;            //计算能力的主代号
int minor;            //计算能力的次要代号
int clockRate;     //时钟频率
size_t textureAlignment; //纹理的对齐要求
int deviceOverlap;    //器件是否能同时执行cudaMemcpy()和器件的核心代码
int multiProcessorCount; //设备上多处理器的数量
int kernelExecTimeoutEnabled; //是否可以给核心代码的执行时间设置限制
int integrated;                  //这个GPU是否是集成的
int canMapHostMemory; //这个GPU是否可以讲主CPU上的存储映射到GPU器件的地址空间
int computeMode;           //计算模式
int maxTexture1D;          //一维Textures的最大维度
int maxTexture2D[2];      //二维Textures的最大维度
int maxTexture3D[3];      //三维Textures的最大维度
int maxTexture2DArray[3];     //二维Textures阵列的最大维度
int concurrentKernels;           //GPU是否支持同时执行多个核心程序
}

实例程序：

1 #include<stdio.h>
2 #include<stdlib.h>
3 #include<cuda.h>
4
5 int main()
6 {
7     int i;
8     /*cudaGetDeviceCount(&count)*/
9     int count;
10     cudaGetDeviceCount(&count);
11     printf("The count of CUDA devices:%d\n",count);
12         ////
13
14     cudaDeviceProp prop;
15     for(i=0;i<count;i++)
16     {
17         cudaGetDeviceProperties(&prop,i);
18         printf("\n---General Information for device %d---\n",i);
19         printf("Name of the cuda device: %s\n",prop.name);
20         printf("Compute capability: %d.%d\n",prop.major,prop.minor);
21         printf("Clock rate: %d\n",prop.clockRate);
22         printf("Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): ");
23         if(prop.deviceOverlap)
24             printf("Enabled\n");
25         else
26             printf("Disabled\n");
27         printf("Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): ");
28         if(prop.kernelExecTimeoutEnabled)
29             printf("Enabled\n");
30         else
31             printf("Disabled\n");
32
33         printf("\n---Memory Information for device %d ---\n",i);
34         printf("Total global mem in bytes: %ld\n",prop.totalGlobalMem);
35         printf("Total constant Mem: %ld\n",prop.totalConstMem);
36         printf("Max mem pitch for memory copies in bytes: %ld\n",prop.memPitch);
37         printf("Texture Alignment: %ld\n",prop.textureAlignment);
38
39         printf("\n---MP Information for device %d---\n",i);
40         printf("Multiprocessor count: %d\n",prop.multiProcessorCount);
41         printf("Shared mem per mp(block): %ld\n",prop.sharedMemPerBlock);
42         printf("Registers per mp(block):%d\n",prop.regsPerBlock);
43         printf("Threads in warp:%d\n",prop.warpSize);
44         printf("Max threads per block: %d\n",prop.maxThreadsPerBlock);
45         printf("Max thread dimensions in a block:(%d,%d,%d)\n",prop.maxThreadsDim[0],prop.maxThreadsDim[1],prop.maxThreadsDim[2]);
46         printf("Max blocks dimensions in a grid:(%d,%d,%d)\n",prop.maxGridSize[0],prop.maxGridSize[1],prop.maxGridSize[2]);
47         printf("\n");
48
49         printf("\nIs the device an integrated GPU:");
50         if(prop.integrated)
51             printf("Yes!\n");
52         else
53             printf("No!\n");
54
55         printf("Whether the device can map host memory into CUDA device address space:");
56         if(prop.canMapHostMemory)
57             printf("Yes!\n");
58         else
59             printf("No!\n");
60
61         printf("Device's computing mode:%d\n",prop.computeMode);
62
63         printf("\n The maximum size for 1D textures:%d\n",prop.maxTexture1D);
64         printf("The maximum dimensions for 2D textures:(%d,%d)\n",prop.maxTexture2D[0],prop.maxTexture2D[1]);
65         printf("The maximum dimensions for 3D textures:(%d,%d,%d)\n",prop.maxTexture3D[0],prop.maxTexture3D[1],prop.maxTexture3D[2]);
66 //      printf("The maximum dimensions for 2D texture arrays:(%d,%d,%d)\n",prop.maxTexture2DArray[0],prop.maxTexture2DArray[1],prop.maxTexture2DArray[2]);
67
68         printf("Whether the device supports executing multiple kernels within the same context simultaneously:");
69         if(prop.concurrentKernels)
70             printf("Yes!\n");
71         else
72             printf("No!\n");
73     }
74
75 }

运行结果：

The count of CUDA devices:1

---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ vim cudabyex331.cu
yue@ubuntu-10:~/cuda/cudabye$ ./cuda
-bash: ./cuda: 沒有此一檔案或目錄
yue@ubuntu-10:~/cuda/cudabye$ ./cudabyex331
The count of CUDA devices:1

---General Information for device 0---
Name of the cuda device: GeForce GTX 470
Compute capability: 2.0
Clock rate: 1215000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 0 ---
Total global mem in bytes: 1341325312
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 14
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!

示例机器2：

The count of CUDA devices:2

---General Information for device 0---
Name of the cuda device: Tesla K20c
Compute capability: 3.5
Clock rate: 705500
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Disabled

---Memory Information for device 0 ---
Total global mem in bytes: 5032706048
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 0---
Multiprocessor count: 13
Shared mem per mp(block): 49152
Registers per mp(block):65536
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(2147483647,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65536)
The maximum dimensions for 3D textures:(4096,4096,4096)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!

---General Information for device 1---
Name of the cuda device: GeForce GTX 480
Compute capability: 2.0
Clock rate: 1401000
Device copy overlap(simulataneously perform a cudaMemcpy() and kernel execution): Enabled
Kernel execution timeout(whether there is a runtime limit for kernels executed on this device): Enabled

---Memory Information for device 1 ---
Total global mem in bytes: 1610153984
Total constant Mem: 65536
Max mem pitch for memory copies in bytes: 2147483647
Texture Alignment: 512

---MP Information for device 1---
Multiprocessor count: 15
Shared mem per mp(block): 49152
Registers per mp(block):32768
Threads in warp:32
Max threads per block: 1024
Max thread dimensions in a block:(1024,1024,64)
Max blocks dimensions in a grid:(65535,65535,65535)

Is the device an integrated GPU:No!
Whether the device can map host memory into CUDA device address space:Yes!
Device's computing mode:0

The maximum size for 1D textures:65536
The maximum dimensions for 2D textures:(65536,65535)
The maximum dimensions for 3D textures:(2048,2048,2048)
Whether the device supports executing multiple kernels within the same context simultaneously:Yes!

参考书籍：《CUDA BY EXAMPLE》

CUDA by Example 第三章 部分翻译实践 GPU器件参数提取

CUDA by Example 第三章部分翻译实践 GPU器件参数提取