CUDA

CUDA ("Compute Unified Device Architecture", 쿠다)는 그래픽 처리 장치(GPU)에서 수행하는 (병렬 처리) 알고리즘을 C 프로그래밍 언어를 비롯한 산업 표준 언어를 사용하여 작성할 수 있도록 하는 GPGPU 기술이다. CUDA는 엔비디아가 개발해오고 있으며 이 아키텍처를 사용하려면 엔비디아 GPU와 특별한 스트림 처리 드라이버가 필요하다. CUDA는 G8X GPU로 구성된 지포스 8 시리즈급 이상에서 동작한다. CUDA는 CUDA GPU 안의 명령셋과 대용량 병렬 처리 메모리를 접근할 수 있도록 해 준다.

Version 확인 방법

nvidia-smi 를 사용한 방법:

nvidia-smi -q | grep "CUDA Version" | awk -F: '{print $2}' | tr -d ' '

nvcc 를 사용한 방법:

nvcc --version

Install

Meta Packages

CUDA Installation Guide for Linux # Meta Packages

Meta packages are RPM/Deb/Conda packages which contain no (or few) files but have multiple dependencies. They are used to install many CUDA packages when you may not know the details of the packages you want. The following table lists the meta packages.

Meta Package	Purpose
cuda	Installs all CUDA Toolkit and Driver packages. Handles upgrading to the next version of the cuda package when it’s released.
cuda-12-6	Installs all CUDA Toolkit and Driver packages. Remains at version 12.5 until an additional version of CUDA is installed.
cuda-toolkit-12-6	Installs all CUDA Toolkit packages required to develop CUDA applications. Does not include the driver.
cuda-toolkit-16	Installs all CUDA Toolkit packages required to develop applications. Will not upgrade beyond the 12.x series toolkits. Does not include the driver.
cuda-toolkit	Installs all CUDA Toolkit packages required to develop applications. Handles upgrading to the next 12.x version of CUDA when it’s released. Does not include the driver.
cuda-tools-12-6	Installs all CUDA command line and visual tools.
cuda-runtime-12-6	Installs all CUDA Toolkit packages required to run CUDA applications, as well as the Driver packages.
cuda-compiler-12-6	Installs all CUDA compiler packages.
cuda-libraries-12-6	Installs all runtime CUDA Library packages.
cuda-libraries-dev-12-6	Installs all development CUDA Library packages.

Kernel

CPU가 GPU에게 큰 load의 작업을 병렬수행하도록 시키는 코드, 이것을 만드는 것이 바로 쿠다 프로그래밍이고 gpgpu이다.

Difference between host and device

Host: CPU
Device: GPU

Kind of function type

함수 종류 지정. 함수의 이름 앞에 다음을 선언하여 실행 방식을 결정한다.

__global__: cpu(host)가 요청하여, gpu(device)에서 실행하는 함수. 쿠다 프로그래밍에서 가장 많이 쓰이는 핵심 함수이다.
__device__: gpu가 요청하여, gpu에서 실행하는 함수
__host__: cpu가 요청하여, cpu가 실행. 일반적인 C/C++함수로 생략 가능하다.

keywords

Throughput: 간단하게 정리해서 말하면, throughput이란 multiprocessor의 clock cycle당 처리할 수 있는 operation의 수를 의미한다.

warps: 몇개의 병렬 스레드들의 실행 그룹을 이야기하는 것 같다.(warp는 weaving에서 기원한다.); The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.

SIMT: Single instruction Multiple-thread의 약어로 하나의 instruction을 여러개의 thread에서 동시에 돌린다는 의미인 것 같다.; 즉, 하나의 프로그램을 여러개의 thread에서 동시에 돌린다는 의미로 생각하면 될 듯. CUDA는 SIMT모델인 듯.; warp의 크기가 32이므로, 하나의 instruction은 한 warp안에서는 32개의 operation으로 구성된다.; (SIMT모델이고, 한 warp의 크기가 32라는 것은, 32개의 thread가 있다는 이야기이므로, 하나의 instruction은 32개의 thread에서 32번 동시에 실행되게 된다. 그것이 32 operations의 의미)

Bank conflict: SIMT를 실행시킬때 문제중의 하나가 memory access이다. GPU에서는 동시에 여러개의 데이터를 처리해야하기 때문에, 동시에 여러개의 데이터에 access를 허용한다. 이것을 하기 위해서 GPU는 shared memory를 각 warp마다 일정 갯수의 memory bank로 나누어 두었는데, 각각의 bank는 bank단위로 동시에 접근할 수 있다. 이때 bank conflict란 프로그래밍 잘못으로 동시에 서로 다른 thread가 특정 bank를 access할때 발생하는 문제이다.

Execution Configuration

kernel block 과 thread

<<<>>>의 역할은 __global__ 함수를 실행시킨 후 블록과 쓰레드로 나누어 병렬로 실행시키는 역할을 한다.

VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

kernel 에서 blocksPerGrid는 Grid당 Block의 갯수
threadsPerBlock는 Block당 thread의 갯수를 나타낸다. 이 값은 디바이스 속성 구조체의 maxThreadsPerBlock멤버의 값을 초과하면 안된다.

Gencode

nvcc 사용시 -gencode 옵션은 -arch 옵션과 -code 옵셥을 합친 것이다.

-arch: -arch=compute_20과 같이 사용; 생성될 PTX 코드의 버전을 지정 (compute_XX는 Virtual architecture를 의미함); PTX 코드는 GPU 드라이버에서 JIT (Just-In-Time) 컴파일을 통해 실행이 가능함; SDK가 필요 없다는 장점이 있음

-code: -code=sm_20와 같이 사용; 생성될 binary 코드(SASS)의 버전을 지정 (sm_XX는 real architecture를 의미함); binary 코드는 별도의 JIT 컴파일 과정 없이 지정된 architecture에서 곧바로 실행이 가능; 컴파일 시간이 짧다는 이점이 있음

-gencode: -gencode arch=compute_20,code=sm_20과 같이 사용; 한번에 여러 -gencode 옵션을 줄 때의 이점은 여러 버전의 PTX, binary 코드를 미리 생성해두어 상위 아키텍쳐에 대한 '호환성'을 높일 수 있다는 것임.; 대신 옵션 수에 비례하여 실행 코드의 크기가 커진다는 단점도 있음.

Select CUDA Device

Stackoverflow: How to choose device when running a CUDA executable?

The canonical way to select a device in the runtime API is using cudaSetDevice function.

Compute modes

Programming Guide :: CUDA Toolkit Documentation - 3.4. Compute Modes

On Tesla solutions running Windows Server 2008 and later or Linux, one can set any device in a system in one of the three following modes using NVIDIA's System Management Interface (nvidia-smi), which is a tool distributed as part of the driver:

Default compute mode: Multiple host threads can use the device (by calling cudaSetDevice() on this device, when using the runtime API, or by making current a context associated to the device, when using the driver API) at the same time.
Exclusive-process compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may be current to as many threads as desired within the process that created that context.
Exclusive-process-and-thread compute mode: Only one CUDA context may be created on the device across all processes in the system and that context may only be current to one thread at a time.
Prohibited compute mode: No CUDA context can be created on the device.

cudaComputeMode

CUDA device compute modes Enumerator:

cudaComputeModeDefault: Default compute mode (Multiple threads can use cudaSetDevice() with this device)
cudaComputeModeExclusive: Compute-exclusive mode (Only one thread will be able to use cudaSetDevice() with this device)
cudaComputeModeProhibited: Compute-prohibited mode (No threads can use cudaSetDevice() with this device)

Running more than one CUDA applications on one GPU

[추천] Stackoverflow - Running more than one CUDA applications on one GPU

CUDA activity from independent host processes will normally create independent CUDA contexts, one for each process. Thus, the CUDA activity launched from separate host processes will take place in separate CUDA contexts, on the same device.

CUDA activity in separate contexts will be serialized. The GPU will execute the activity from one process, and when that activity is idle, it can and will context-switch to another context to complete the CUDA activity launched from the other process. The detailed inter-context scheduling behavior is not specified. (Running multiple contexts on a single GPU also cannot normally violate basic GPU limits, such as memory availability for device allocations.)

The "exception" to this case (serialization of GPU activity from independent host processes) would be the CUDA Multi-Process Server. In a nutshell, the MPS acts as a "funnel" to collect CUDA activity emanating from several host processes, and run that activity as if it emanated from a single host process. The principal benefit is to avoid the serialization of kernels which might otherwise be able to run concurrently. The canonical use-case would be for launching multiple MPI ranks that all intend to use a single GPU resource.

Note that the above description applies to GPUs which are in the "Default" compute mode. GPUs in "Exclusive Process" or "Exclusive Thread" compute modes will reject any attempts to create more than one process/context on a single device. In one of these modes, attempts by other processes to use a device already in use will result in a CUDA API reported failure. The compute mode is modifiable in some cases using the nvidia-smi utility.

built-in variable

CUDA 내장변수는 아래와 같다.

blockIdx

블록의 INDEX를 확인할 수 있다. 확인 방법은 아래와 같다.

__global__ void add(int * a, int * b, int * c) {
    int tid = blockIdx.x; // 이 인덱스의 데이터를 처리한다.
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
    }
}

threadIdx

스레드의 INDEX를 가리킨다.

__global__ void add(int * a, int * b, int * c) {
    int tid = threadIdx.x;
    if (tid < N) {
        c[tid] = a[tid] + b[tid];
    }
}
// ...
add<<<1, N>>>(deva, devb, devc);

dim3

커널의 실행 크기를 명시하는 데 사용될 3차원 요소들의 집합을 나타낸다.

#define DIM 100
// ...
dim3 grid(DIM, DIM); // 와 같이 하면 마지막, 3차원 인자는 1로 처리된다. 즉, 2차원 100x100 Block의 GRID가 만들어진다.
// ...
kernel<<<grid, 1>>>(a); // 와 같은 방식으로 grid를 전달하면 된다.

gridDim

모든 블록에 대하여 하나의 상수로 제공되며, 단순히 개시된 그리드의 Block 크기를 보유하고 있다.

__global__ void kernel(char * ptr) { // ptr이 2차원 배열이라 가정했을 경우...
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset = x + y * gridDim.x; // 이와 같이 ptr의 offset이 계산된다.
    // ...
    ptr[offset] = 100;
}

blockDim

모든 블록에 대해 고정된 값이며, 블록의 각 차원에 대한 스레드 개수를 보관한다.

How to create library project

WDDM vs TCC mode

NVIDIA high-end GPUs (Tesla, Quadro, etc) can be configured to run in either Tesla Compute Cluster (TCC) mode or Windows Display Driver Model (WDDM) mode. The difference between the two is that in TCC mode, the cards dedicate themselves completely to compute and are not meant to have a local display. In WDDM mode, they act as both a compute card as well as a GPU for displaying local graphics.

You can force the cards into either mode by navigating to the default directory, c:\program files\NVIDIA corporation\nvsmi, and running nvidia-smi -g {card number} -dm {0 or 1}. Mode 0 is WDDM mode and mode 1 is TCC mode.

Python 에서 Binding 된 메모리 해제 방법

PyTorch

import torch

# CUDA 메모리 해제
torch.cuda.empty_cache()

nvidia-smi로 확인하면, 일부는 아직 남아있더라...

TensorFlow

import tensorflow as tf

# TensorFlow 세션 초기화 해제
tf.keras.backend.clear_session()

nvidia-smi로 확인해 봐야함. (현재 미확인)

Numba

from numba import cuda

# 현재 사용 중인 디바이스 해제
cuda.select_device(0)
cuda.close()

nvidia-smi로 확인해보니 잘 제거되어 있더라

PyCUDA

PyCUDA 항목 확인.

Compile cuda code for CPU

Project

GPU Ocelot

GPU Ocelot is an open-source dynamic JIT compilation framework for GPU compute applications targetinga range of GPU and non-GPU execution targets.

CUDA PTX 코드를 에뮬레이션 하는듯?

cuda-waste: https://code.google.com/archive/p/cuda-waste/; CUDA Waste is a wrapper for emulation of CUDA programs.

NVEmulate: https://developer.nvidia.com/nvemulate; NVemulate allows you to emulate the functionality of newer NVIDIA GPUs (sometimes very slowly) in software. In addition, you can use it to control behavior of the driver's OpenGL Shading Language (GLSL) implementation.

GPU-ACCELERATED LIBRARIES

NVIDIA CUDA ZONE: GPU-ACCELERATED LIBRARIES

Adding GPU-acceleration to your application can be as easy as simply calling a library function. Check out the extensive list of accelerated, high performance libraries available today.

MAJOR	AmgX, cuDNN, cuFFT, cuBLAS, cuBLAS-XT, NPP, CHOLMOD, CULA Tools, MAGMA, IMSL Fortran Numerical Library, cuSOLVER, cuSPARSE, ArrayFire, cuRAND, CUDA Math Library, Thrust, NVBIO, NVIDIA VIDEO CODEC SDK (NVENCODE, NVDECODE), HiPLAR, OpenCV, Geometry Performance Primatives(GPP), Paralution, Triton Ocean SDK
MINOR	CUVI, NCCL

Favorite site

NVIDIA Home > 기술소개 > CUDA 병렬 컴퓨팅 플랫폼
CUDA Download
[추천] Wikipedia (en) CUDA에 대한 설명 ²
[추천] NVIDIA GPU ARCHITECTURE & CUDA PROGRAMMING ENVIRONMENT ³
Code GPU with CUDA - SIMT

Online Tools

[추천] LeetGPU - Only platform to write and run CUDA code. Without a GPU. For Free. (온라인에서 무료로 CUDA 코드 돌리는 사이트)

How to install

Developer

Kernel Driver

NVIDIA, 오픈소스 Linux GPU 커널 모듈로 완전 전환 | GeekNews
- [원문] NVIDIA Transitions Fully Towards Open-Source GPU Kernel Modules | NVIDIA Technical Blog

Develop guide & issue

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS
CUDA occupancy (최적화에 관련된 내용)
[추천] CUDA (용어 및 블록 최적화 등) ⁴
CUDA소개 (컴파일 워크플로우 등 CUDA 아키텍처에 대한 전반적인 소개)
CUDA example chap5 :: syncthread()를 써야하는 예 및 자문자답
cuda 내장변수
CUDA Programming: Thread와 Block의 개념, GPU 구조
CUDA - Thread Index 계산하기 :: 지니's Story
CUDA PROGRAMMING MODEL 프로그래밍 모델
CUDA Thread Execution Model
CUDA에서 grid 와 thread 의 갯수에 대한 짧은생각
JCUDA(CUDA) grid, block?
[추천] Threads and blocks and grids, oh my!
Stackoverflow: CUDA determining threads per block, blocks per grid
Shader Programming vs CUDA
I have two functions for wavelet transform. Could somebody help me to imlement them in CUDA?
Using Separate Compilation in CUDA (라이브러리 생성 방법 등)
Setting Up GPU Telemetry with NVIDIA Data Center GPU Manager
GTC 2020: CUDA C++ in Jupyter: Adding CUDA Runtime Support to Cling
Introducing Low-Level GPU Virtual Memory Management

Memory

Stream

yesdevelop.tistory.com - 스트림

Tip

OpenCL

GTC09 OpenCL Optimization - San Jose | 10/2/2009 | Peng Wang, NVIDIA ⁷

Beginners tutorial

[추천] Introduction to CUDA C
CUDA C/C++ Basics Supercomputing 2011 Tutorial
CUDA Thread Basics
Github - GPU Puzzles - 퍼즐 풀며 CUDA 배우기
[추천] 파이썬 개발자를 위한 CUDA 프로그래밍 입문 | GeekNews
- [원문] Introduction to CUDA Programming for Python Developers | PySpur - AI Agent Builder

Other libraries

cuDNN

Docker

haanjack - Docker를 활용한 deep learning 개발 환경 구축

References

NVIDIA_CUDA_Programming_Guide_2.0.pdf ↩
CUDA_-_Wikipedia.pdf (2019-10-17) ↩
NVIDIA_GPU_Architecture_and_CUDA_Programming_Environment_-_Alan_Tatourian.pdf ↩
Blog.naver.com_-sysganda-_CUDA_block_optimization.pdf ↩
NVIDIA_GPU_Memory_types_–_MKBlog.pdf ↩
Porting_CUDA_to_OpenCL_-_Documentation.pdf ↩
1068_GTC09.pdf ↩