PyCUDA 学习笔记 -- pagelocked memory

来源:互联网 发布:暴雪铁人五项 知乎 编辑:程序博客网 时间:2024/06/02 17:36

PyCUDA: pagelocked memory

In GPU Programming, we have to transfer data from CPU to GPU which might take a while.

Normal ways of transferring data:

import pycuda.driver as cudaimport pycuda.autoinitfrom pycuda.compiler import SourceModuleimport numpy as npa = np.random.randn(256, 256)a = a.astype a_gpu = cuda.mem_alloc(a.nbytes)cuda.memcpy_htod(a_gpu, a)##following codes ...

transfer pagelocked host memory from host to device (aka. from CPU to GPU)

The above codes would take quite a long time to transfer the data, so we try the pagelocked ways.

Background concerning the pagelocked memory and GPU pinned memory

CPU data allocations are pageable by default. The GPU cannot access data directly from pageable host memory, so when a data transfer from pageable host memory to device memory is invoked, the CUDA driver must first allocate a temporary page-locked, or “pinned”, host array, copy the data to the pinned array, and then transfer the data from the pinned array to device memory.

Device Interface:

pycuda.driver. pagelocked_empty(shape, dtype, order=”C”, mem_flags=0)

  • Allocate a pagelocked numpy.ndarray of shape, dtype and order
  • mem_flags: may be one of the values in host_alloc_flags. It may only be non-zero on CUDA 2.2 and newer:
    The default one is equal to the cudaMallocHost(void
    ** ptr, size_t size) in CUDA, which allocates size bytes of host memory that is page-locked and accessible to the device.

    • PORTABLE:
      The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.

    • DEVICEMAP:
      Maps the allocation into the CUDA address space. This device pointer to the memory may be obtained by calling cudaHostGetDevicePoineer()

    • WRITECOMBINED:
      Allocates the memory as write-combined(WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host -> device transfers.

cuda.pagelocked_empty(shape, dtype, order="C")cuda.pagelocked_zeros( .. )cuda.pagelocked_empty_like( array )cuda.pagelocked_zero_like( array )

However, when I try to access the page-locked memory from CPU, it appears to be super slow

0 0