CUDA:Memory
Data Transfer
- How to Optimize Data Transfers in CUDA C/C++ | NVIDIA Technical Blog
- How to Implement Performance Metrics in CUDA C/C++ | NVIDIA Technical Blog
Overlap Data Transfers
- How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog
- cuda - How do I know that cudaMemcpyAsync is done reading host memory? - Stack Overflow
- Slide 1 - StreamsAndConcurrencyWebinar.pdf 1
비동기 데이터 전송에 관련된 내용.
cudaMemcpyAsync 호출에 스트림을 사용하는 경우 비동기 전송 후 스트림에 이벤트를 삽입한 다음 cudaEventSynchronize를 사용하여 해당 이벤트를 동기화할 수 있습니다.
이렇게 하면 복사가 완료되었음을 보장하지만 장치가 유휴 상태이거나 스트림이 비어 있는 것에 의존하지 않습니다.
APIs
cudaMemcpyAsync
메모리 정렬 값 획득
invetigation on cudaMalloc alignment => aligned to at least 512 bytes
#include <sys/time.h>
#include <cuda_runtime.h>
#include <stdio.h>
void test(int size)
{
float *d1, *d2;
cudaMalloc(&d1, size);
cudaMalloc(&d2, size);
printf("Alignment: %ld\n", (d2 - d1) * sizeof(float));
cudaFree(d1);
cudaFree(d2);
}
int main(int argc, char **argv)
{
// set up device
int dev = 0;
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
printf("Using Device %d: %s\n", dev, deviceProp.name);
cudaSetDevice(dev);
test(1);
return(0);
}
Result:
Using Device 0: GeForce GTX TITAN X
Alignment: 512
in AWS p2.xlarge (k80)
Using Device 0: Tesla K80
Alignment: 512
This document says as https://www.classes.cs.uchicago.edu/archive/2011/winter/32102-1/reading/CUDA_C_Best_Practices_Guide.pdf
cudaMalloc(), is guaranteed to be aligned to at least 256 bytes
But, it looks actually aligned to at least 512 bytes. It looks the document is old.
ref. https://stackoverflow.com/questions/36534599/cuda-malloc-minimum-and-typical-actual-alignment
256 on Fermi, 512 on the Kepler, Maxwell
So, it was 256 bytes for old GPU such as Fermi, but it is 512 bytes nowadays.
See also
Favorite site
References
-
CUDA_C_and_CPP_-Streams_and_Concurrency-Steve_Rennich-_NVIDIA.pdf ↩