CUDA:SharedMemory

Configuring the amount of shared memory

[추천] Using Shared Memory in CUDA C/C++ ¹
- Google translate (en -> ko): Using_Shared_Memory_in_CUDA_C_Cpp_-_ko.pdf

On devices of compute capability 2.x and 3.x, each multiprocessor has 64KB of on-chip memory that can be partitioned between L1 cache and shared memory. For devices of compute capability 2.x, there are two settings, 48KB shared memory / 16KB L1 cache, and 16KB shared memory / 48KB L1 cache. By default the 48KB shared memory setting is used. This can be configured during runtime API from the host for all kernels using cudaDeviceSetCacheConfig() or on a per-kernel basis using cudaFuncSetCacheConfig(). These accept one of three options: cudaFuncCachePreferNone, cudaFuncCachePreferShared, and cudaFuncCachePreferL1. The driver will honor the specified preference except when a kernel requires more shared memory per thread block than available in the specified configuration. Devices of compute capability 3.x allow a third setting of 32KB shared memory / 32KB L1 cache which can be obtained using the option cudaFuncCachePreferEqual.

Kor:

컴퓨팅 성능이 2.x 및 3.x 인 장치의 경우 각 다중 프로세서에는 L1 캐시와 공유 메모리로 분할 할 수있는 64KB의 온칩 메모리가 있습니다.
컴퓨팅 기능 2.x의 장치에는 48KB 공유 메모리 / 16KB L1 캐시와 16KB 공유 메모리 / 48KB L1 캐시의 두 가지 설정이 있습니다.
기본적으로 48KB 공유 메모리 설정이 사용됩니다. 이것은 cudaDeviceSetCacheConfig ()를 사용하여 모든 커널의 호스트에서 런타임 API 중에 구성하거나 cudaFuncSetCacheConfig ()를 사용하여 커널 단위로 구성 할 수 있습니다.
이들은 cudaFuncCachePreferNone, cudaFuncCachePreferShared 및 cudaFuncCachePreferL1의 세 가지 옵션 중 하나를 허용합니다.
드라이버는 커널이 지정된 구성에서 사용 가능한 것보다 스레드 블록 당 더 많은 공유 메모리를 필요로하는 경우를 제외하고는 지정된 기본 설정을 따르게됩니다.
컴퓨팅 성능 3.x 장치는 옵션 cudaFuncCachePreferEqual을 사용하여 얻을 수있는 32KB 공유 메모리 / 32KB L1 캐시의 세 번째 설정을 허용합니다.

Bank Conflict

SIMT를 실행시킬때 문제중의 하나가 memory access이다. GPU에서는 동시에 여러개의 데이터를 처리해야하기 때문에, 동시에 여러개의 데이터에 access를 허용한다. 이것을 하기 위해서 GPU는 shared memory를 각 warp마다 일정 갯수의 memory bank로 나누어 두었는데, 각각의 bank는 bank단위로 동시에 접근할 수 있다. 이때 bank conflict란 프로그래밍 잘못으로 동시에 서로 다른 thread가 특정 bank를 access할때 발생하는 문제이다.

Bank conflict는 GPU 내의 공유 메모리에 접근할 때에 생긴다.

CUDA프로그래밍에서는 하나의 블럭 내에 있는 쓰레드가 동시에 공유 메모리에 접근할 수 있기 때문에 동시에 여러개의 데이터로 액세스하는 것을 기본적으로 허용한다. 이를 위해 GPU는 공유메모리를 각 warp마다 일정 갯수의 memory bank로 나누어 놓았는데, 각각의 bank는 bank단위로 동시에 접근할 수 있다.

예를 들어, Compute compatibility 1.x에서는 memory bank 16, Compute compatibility 2.x에서는 memory bank 32개를 가진다. 이는 각각, 동시에 16개, 32개의 memory bank에 액세스할 수 있다는 말이다(한번 메모리에서 읽기를 하는 경우 하나씩 읽는 것이 아닌 32개를 한번에 불러온다는 말).

만약 서로 다른 thread가 하나의 특정 bank에 access하게 되면 해당 bank에 접근하기 위해 순차적으로 변하기 위해 병렬처리방식에서 벗어나게 되기 때문에 bank conflict가 일어나는지 고려해야한다.

Details

[추천] Using Shared Memory in CUDA C/C++ ²
- Google translate (en -> ko): Using_Shared_Memory_in_CUDA_C_Cpp_-_ko.pdf

To achieve high memory bandwidth for concurrent accesses, shared memory is divided into equally sized memory modules (banks) that can be accessed simultaneously. Therefore, any memory load or store of n addresses that spans b distinct memory banks can be serviced simultaneously, yielding an effective bandwidth that is b times as high as the bandwidth of a single bank.

However, if multiple threads’ requested addresses map to the same memory bank, the accesses are serialized. The hardware splits a conflicting memory request into as many separate conflict-free requests as necessary, decreasing the effective bandwidth by a factor equal to the number of colliding memory requests. An exception is the case where all threads in a warp address the same shared memory address, resulting in a broadcast. Devices of compute capability 2.0 and higher have the additional ability to multicast shared memory accesses, meaning that multiple accesses to the same location by any number of threads within a warp are served simultaneously.

To minimize bank conflicts, it is important to understand how memory addresses map to memory banks. Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and the bandwidth is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads.

For devices of compute capability 2.0, the warp size is 32 threads and the number of banks is also 32. A shared memory request for a warp is not split as with devices of compute capability 1.x, meaning that bank conflicts can occur between threads in the first half of a warp and threads in the second half of the same warp.

Devices of compute capability 3.x have configurable bank size, which can be set using cudaDeviceSetSharedMemConfig() to either four bytes (cudaSharedMemBankSizeFourByte, the default) or eight bytes (cudaSharedMemBankSizeEightByte). Setting the bank size to eight bytes can help avoid shared memory bank conflicts when accessing double precision data.