CUDA Pinned memory
In the framework of accelerating computational codes by parallel computing on graphics processing units (GPUs), the data to be processed must be transferred from system (host) memory to the graphics card's (device) memory, and the results retrieved from the graphics (device) memory into system (host) memory. In a computational code accelerated by general-purpose GPUs (GPGPUs), such transactions can occur many times and may affect the overall performance, so that the problem of carrying out those transfers in the fastest way arises.
To allow programmers to use a larger virtual address space than is actually available in the RAM, CPUs (or hosts, in the language of GPGPU) implement a virtual memory system Virtual memory (non-locked / non-pinned memory) in which a physical memory page may be swapped out to disk. When the host needs that page, it loads it back in from the disk. The drawback with CPU⟷GPU memory transfers is that memory transactions are slower, i.e., the bandwidth of the PCI-E bus to connect CPU and GPU is not fully exploited. Non-locked / non-pinned memory does not reside only in main memory (e.g. it can be in swap), so the driver needs to access every single page of the non-locked memory, copy it into pinned buffer and pass it to the Direct Memory Access (DMA) (synchronously page-by-page copy). Indeed, PCI-E transfers occur only using the DMA. Accordingly, when a “normal” transfer is issued, an allocation of a block of page-locked memory is necessary, followed by a host copy from regular memory to the page-locked / pinned one, the transfer, the wait for the transfer to complete and the deletion of the page-locked / pinned memory. This consumes precious host time which is avoided when directly using page-fixed memory.
However, with today’s memories, the use of virtual memory is no longer necessary for many applications which will fit within the host memory space. In all these cases, it is more convenient to use page-locked / pinned memory which enables a DMA on the GPU to request transfers to and from the host memory without the involvement of the CPU. In other words, locked memory is stored in the physical memory (RAM), so the device can fetch it without the help of the host (synchronous copy).
GPU memory is automatically allocated as page-locked, since GPU memory does not support swapping to disk. To allocate page-locked memory on the host in CUDA language one could use cudaHostAlloc.