CUDA:Memory

Data Transfer

Overlap Data Transfers

비동기 데이터 전송에 관련된 내용.

cudaMemcpyAsync 호출에 스트림을 사용하는 경우 비동기 전송 후 스트림에 이벤트를 삽입한 다음 cudaEventSynchronize를 사용하여 해당 이벤트를 동기화할 수 있습니다.

이렇게 하면 복사가 완료되었음을 보장하지만 장치가 유휴 상태이거나 스트림이 비어 있는 것에 의존하지 않습니다.

APIs

cudaMemcpyAsync

result = cudaMemcpyAsync(d_a, a, N, cudaMemcpyHostToDevice, stream1)

메모리 정렬 값 획득

Gist - invetigation on cudaMalloc alignment => aligned to at least 512 bytes

invetigation on cudaMalloc alignment => aligned to at least 512 bytes

#include <sys/time.h>
#include <cuda_runtime.h>
#include <stdio.h>

void test(int size)
{
    float *d1, *d2;

    cudaMalloc(&d1, size);
    cudaMalloc(&d2, size);
    printf("Alignment: %ld\n", (d2 - d1) * sizeof(float));
    cudaFree(d1);
    cudaFree(d2);
}

int main(int argc, char **argv)
{
    // set up device
    int dev = 0;
    cudaDeviceProp deviceProp;
    cudaGetDeviceProperties(&deviceProp, dev);
    printf("Using Device %d: %s\n", dev, deviceProp.name);
    cudaSetDevice(dev);

    test(1);

    return(0);
}

Result:

Using Device 0: GeForce GTX TITAN X
Alignment: 512

in AWS p2.xlarge (k80)

Using Device 0: Tesla K80
Alignment: 512

This document says as https://www.classes.cs.uchicago.edu/archive/2011/winter/32102-1/reading/CUDA_C_Best_Practices_Guide.pdf

cudaMalloc(), is guaranteed to be aligned to at least 256 bytes

But, it looks actually aligned to at least 512 bytes. It looks the document is old.

ref. https://stackoverflow.com/questions/36534599/cuda-malloc-minimum-and-typical-actual-alignment

256 on Fermi, 512 on the Kepler, Maxwell

So, it was 256 bytes for old GPU such as Fermi, but it is 512 bytes nowadays.

Favorite site

References

CUDA_C_and_CPP_-Streams_and_Concurrency-Steve_Rennich-_NVIDIA.pdf ↩