CUDA：向量化加载提升性能

来源：互联网发布：幼儿教师美工作品图片编辑：程序博客网时间：2024/06/11 03:22

转载自 https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-increase-performance-with-vectorized-memory-access/

注意下
In almost all cases vectorized loads are preferable to scalar loads. Note however that using vectorized loads increases register pressure and reduces overall parallelism. So if you have a kernel that is already register limited or has very low parallelism, you may want to stick to scalar loads. Also, as discussed earlier, if your pointer is not aligned or your data type size in bytes is not a power of two you cannot use vectorized loads.

需要自己trade-off了

内建类型,内建类型自动对齐。
These are vector types derived from the basic integer and floating-point types. They are structures and the 1st, 2nd, 3rd, and 4th components are accessible through the fields x, y, z, and w, respectively. They all come with a constructor function of the form
make_; for example

int2 make_int2(int x, int y);

which creates a vector of type int2 with value(x, y).
就是说int2，实际上是

struct{    int x,    int y,}

其他类似

1 0