CuPy

CuPy is a NumPy/SciPy-compatible array library for GPU-accelerated computing with Python.

CuPy acts as a drop-in replacement to run existing NumPy/SciPy code on NVIDIA CUDA or AMD ROCm platforms.

Install

Installation — CuPy 13.2.0 documentation

Requirements

NVIDIA CUDA GPU with the Compute Capability 3.0 or larger.

CUDA Toolkit: v11.2 / v11.3 / v11.4 / v11.5 / v11.6 / v11.7 / v11.8 / v12.0 / v12.1 / v12.2 / v12.3 / v12.4
- If you have multiple versions of CUDA Toolkit installed, CuPy will automatically choose one of the CUDA installations. See Working with Custom CUDA Installation for details.
- This requirement is optional if you install CuPy from conda-forge. However, you still need to have a compatible driver installed for your GPU. See Installing CuPy from Conda-Forge for details.
Python: v3.9 / v3.10 / v3.11 / v3.12

NOTE
Currently, CuPy is tested against Ubuntu 20.04 LTS / 22.04 LTS (x86_64), CentOS 7 / 8 (x86_64) and Windows Server 2016 (x86_64).

Python Dependencies

NumPy/SciPy-compatible API in CuPy v13 is based on NumPy 1.26 and SciPy 1.11, and has been tested against the following versions:

NumPy: v1.22 / v1.23 / v1.24 / v1.25 / v1.26 / v2.0
SciPy (optional): v1.7 / v1.8 / v1.9 / v1.10 / v1.11
- Required only when coping sparse matrices from GPU to CPU (see Sparse matrices (cupyx.scipy.sparse).)
Optuna (optional): v3.x
- Required only when using Automatic Kernel Parameters Optimizations (cupyx.optimizing).

NOTE
SciPy and Optuna are optional dependencies and will not be installed automatically.

NOTE
Before installing CuPy, we recommend you to upgrade setuptools and pip: `python -m pip install -U setuptools pip`

Additional CUDA Libraries

Part of the CUDA features in CuPy will be activated only when the corresponding libraries are installed.

cuTENSOR: v2.0
- The library to accelerate tensor operations. See Environment variables for the details.
NCCL: v2.16 / v2.17
- The library to perform collective multi-GPU / multi-node computations.
cuDNN: v8.8
- The library to accelerate deep neural network computations.
cuSPARSELt: v0.2.0
- The library to accelerate sparse matrix-matrix multiplication.

Installing CuPy from PyPI

Wheels (precompiled binary packages) are available for Linux and Windows. Package names are different depending on your CUDA Toolkit version.

CUDA	Command
v11.2 ~ 11.8 (x86_64 / aarch64)	`pip install cupy-cuda11x`
v12.x (x86_64 / aarch64)	`pip install cupy-cuda12x`

To enable features provided by additional CUDA libraries (cuTENSOR / NCCL / cuDNN), you need to install them manually. If you installed CuPy via wheels, you can use the installer command below to setup these libraries in case you don’t have a previous installation:

$ python -m cupyx.tools.install_library --cuda 11.x --library cutensor

NOTE
Append `--pre -U -fhttps://pip.cupy.dev/pre` options to install pre-releases (e.g., `pip install cupy-cuda11x --pre -U -fhttps://pip.cupy.dev/pre`).

Building CuPy for ROCm From Source

Installation — CuPy 13.2.0 documentation # Building CuPy for ROCm From Source

To build CuPy from source, set the CUPY_INSTALL_USE_HIP, ROCM_HOME, and HCC_AMDGPU_TARGET environment variables. (HCC_AMDGPU_TARGET is the ISA name supported by your GPU. Run rocminfo and use the value displayed in Name: line (e.g., gfx900). You can specify a comma-separated list of ISAs if you have multiple GPUs of different architectures.)

$ export CUPY_INSTALL_USE_HIP=1
$ export ROCM_HOME=/opt/rocm
$ export HCC_AMDGPU_TARGET=gfx906
$ pip install cupy

NOTE
If you don’t specify the `HCC_AMDGPU_TARGET` environment variable, CuPy will be built for the GPU architectures available on the build host. This behavior is specific to ROCm builds; when building CuPy for NVIDIA CUDA, the build result is not affected by the host configuration.

Device Pointer

import cupy as cp

# GPU 배열 생성
array = cp.arange(1000000).reshape(1000, 1000)

# Device pointer 얻기
dev_ptr = array.data.ptr

# Device pointer 출력
print(f'Device pointer: {dev_ptr}')

Device Overlap 확인

CUDA:Stream 지원여부 확인을 위한 Device Overlap 속성 확인 방법:

cupy.cuda.runtime.getDeviceProperties(0)['deviceOverlap']

비동기 관련

class cupy.cuda.MemoryAsync(size_t size, stream)
- Asynchronous memory allocation on a CUDA device.
- This class provides an RAII interface of the CUDA memory allocation.
- Parameters:
  - size (int) – Size of the memory allocation in bytes.
  - stream (Stream) – The stream on which the memory is allocated and freed.
cupy.cuda.runtime.memcpyAsync(intptr_t dst, intptr_t src, size_t size, int kind, intptr_t stream)

모든 API 들에서 blocking 관련 파라미터가 있다면 이는 비동기 함수를 사용하게 된다.

cupy.ndarray
cupy.array
cupy.asarray
cupy.asnumpy

* blocking (bool) - False로 설정하면 복사본이 지정된 (제공된 경우) 또는 현재 스트림에서 비동기적으로 실행되며 사용자는 스트림 순서를 확인할 책임이 있습니다.

cudaGetDeviceCount

CUDA 장치 수를 가져옵니다.

import cupy as cp

num_devices = cp.cuda.runtime.getDeviceCount()

print(f"CUDA 장치 수: {num_devices}")

Device 선택

with cp.cuda.Device(1):
   x_on_gpu1 = cp.array([1, 2, 3, 4, 5])
x_on_gpu0 = cp.array([1, 2, 3, 4, 5])

배열의 장치와 현재 장치가 일치하지 않으면 CuPy 함수는 현재 장치가 다른 장치에서 배열을 직접 읽을 수 있도록 두 장치 간에 Peer-to-Peer Memory Access (P2P)를 설정하려고 시도합니다. P2P는 토폴로지에서 허용하는 경우에만 사용할 수 있습니다. P2P를 사용할 수 없는 경우 이러한 시도는 ValueError 예외로 실패합니다.

The N-dimensional array (ndarray)

The N-dimensional array (ndarray) — CuPy 14.0.0a1 documentation

cupy.ndarray

cupy.array

cupy.asarray

cupy.asnumpy

cupy.get_array_module

cupyx.scipy.get_array_module

데이터 전송

Basics of CuPy — CuPy 13.2.0 documentation # Data Transfer

배열을 장치로 이동

cupy.asarray() can be used to move a numpy.ndarray, a list, or any object that can be passed to numpy.array() to the current device:

x_cpu = np.array([1, 2, 3])
x_gpu = cp.asarray(x_cpu)  # move the data to the current device.

cupy.asarray() can accept cupy.ndarray, which means we can transfer the array between devices with this function.

with cp.cuda.Device(0):
    x_gpu_0 = cp.ndarray([1, 2, 3])  # create an array in GPU 0

with cp.cuda.Device(1):
    x_gpu_1 = cp.asarray(x_gpu_0)  # move the array to GPU 1

Note

cupy.asarray()는 가능한 경우 입력 배열을 copy하지 않습니다. 따라서 현재 장치의 배열을 넣으면 입력 개체 자체가 반환됩니다.

이 상황에서 array를 copy해야하는 경우, cupy.array()의 변수중 copy=True로 하면 됩니다. 실제로 cupy.asarray()는 cupy.array(arr, dtype, copy=False)와 같습니다.

장치에서 호스트로 배열 이동

Moving a device array to the host can be done by cupy.asnumpy() as follows:

x_gpu = cp.array([1, 2, 3])  # create an array in the current device

x_cpu = cp.asnumpy(x_gpu)  # move the array to the host.

We can also use cupy.ndarray.get():

x_cpu = x_gpu.get()

Note

Chainer를 사용하는 경우 to_cpu() 및 <codee>to_gpu()</code>를 사용하여 장치와 host 또는 다른 장치간에 array를 앞뒤로 이동할 수 있습니다.

to_gpu()에서는 배열이 전송되는 장치를 지정하는 장치 옵션이 있습니다.

역자 주: Chainer는 동적 계산 그래프를 지원하는 프레임워크로, CuPy는 Chainer에서 분리됬습니다.

CPU/GPU를 범용 코드를 작성하기

CuPy와 NumPy의 호환성으로 CPU/GPU를 아우르는 코드를 작성할 수 있습니다. cupy.get_array_module()함수로 이를 쉽게 할 수 있씁니다. 이 함수는 인자를 기반으로 numpy 또는 cupy 모듈을 반환합니다. CPU/GPU를 아우르는 함수는 다음과 같이 정의되어 사용됩니다.

# log(1 + exp(x))의 안정적인 구현
def softplus(x):
    xp = cp.get_array_module(x)
    return xp.maximum(0, x) + xp.log1p(xp.exp(-abs(x)))

때때로 호스트 또는 장치 array로 명시적으로 변환해야 할 수도 있습니다.

cupy.asarray() 및 cupy.asnumpy()는 범용적인 구현에서 CuPY 또는 NumPy의 array를 상호 변환하는데 사용할 수 있습니다.

x_cpu = np.array([1, 2, 3])
y_cpu = np.array([4, 5, 6])
x_cpu + y_cpu  # array([5, 7, 9])
x_gpu + y_cpu
"""
Traceback (most recent call last):
...
TypeError: Unsupported type <class 'numpy.ndarray'>
"""

cp.asnumpy(x_gpu) + y_cpu  # array([5, 7, 9])
cp.asnumpy(x_gpu) + cp.asnumpy(y_cpu)  # array([5, 7, 9])
x_gpu + cp.asarray(y_cpu)  # array([5, 7, 9])
cp.asarray(x_gpu) + cp.asarray(y_cpu)  # array([5, 7, 9])

CUDA MemoryPointer 할당 방법

import cupy as cp
import numpy as np

size = 1024  # 배열의 크기
memory = cp.cuda.alloc(size * cp.float32().itemsize)
array = cp.ndarray((size,), dtype=cp.float32, memptr=memory)

CUDA 의 Low Device Pointer (GPU) 를 직접 할당하는 방법

Convert CUDA Vectors or Device ptr to cupy arrays? · Issue #3202 · cupy/cupy

mem = cp.cuda.UnownedMemory(ptr, size, owning_obj)
mem_ptr = cp.cuda.MemoryPointer(mem, 0)
arr = cp.ndarray(..., memptr= mem_ptr, ...)  # make sure you interpret the array shape/dtype/strides correctly

Pinned Memory

Mapped memory functionality (zero-copy) · Issue #3452 · cupy/cupy

Pinned memory 동기화 방법

def pinned_array(array):
    # first constructing pinned memory
    mem = cupy.cuda.alloc_pinned_memory(array.nbytes)
    src = numpy.frombuffer(
                mem, array.dtype, array.size).reshape(array.shape)
    src[...] = array
    return src

a_cpu = np.ones((10000, 10000), dtype=np.float32)
b_cpu = np.ones((10000, 10000), dtype=np.float32)
# np.ndarray with pinned memory
a_cpu = pinned_array(a_cpu)
b_cpu = pinned_array(b_cpu)

a_stream = cp.cuda.Stream(non_blocking=True)
b_stream = cp.cuda.Stream(non_blocking=True)

a_gpu = cp.empty_like(a_cpu)
b_gpu = cp.empty_like(b_cpu)

a_gpu.set(a_cpu, stream=a_stream)
b_gpu.set(b_cpu, stream=b_stream)

# wait until a_cpu is copied in a_gpu
a_stream.synchronize()
# This line runs parallel to b_gpu.set()
a_gpu *= 2

메모리 할당시 크기가 정확히 일치하지 않는 이유

Pinned memory allocation returns odd size · Issue #3625 · cupy/cupy

memory alignment 에 맞게 할당하기 때문:

# Round up the memory size to fit memory alignment of cudaHostAlloc 
unit = self._allocation_unit_size 
size = internal.clp2(((size + unit - 1) // unit) * unit)

메모리 정렬 코드 확인

# cudaMalloc() is aligned to at least 512 bytes
# cf. https://gist.github.com/sonots/41daaa6432b1c8b27ef782cd14064269
DEF ALLOCATION_UNIT_SIZE = 512
# for test
_allocation_unit_size = ALLOCATION_UNIT_SIZE


cpdef size_t _round_size(size_t size):
    """Rounds up the memory size to fit memory alignment of cudaMalloc."""
    # avoid 0 div checking
    size = (size + ALLOCATION_UNIT_SIZE - 1) // ALLOCATION_UNIT_SIZE
    return size * ALLOCATION_UNIT_SIZE

cpdef size_t _bin_index_from_size(size_t size):
    """Returns appropriate bins index from the memory size."""
    # avoid 0 div checking
    return (size - 1) // ALLOCATION_UNIT_SIZE

Benchmarking

Performance Best Practices — CuPy 13.2.0 documentation

cupyx.profiler.benchmark 를 사용하자. 이 함수는 시간 변동을 줄이고 첫 번째 호출 시 오버헤드를 배제하기 위해 몇 번의 워밍업 실행을 실행합니다.

Synopsis:

cupyx.profiler.benchmark(func, args=(), kwargs={}, n_repeat=10000, *, name=None, n_warmup=10, max_duration=inf, devices=None)

사용 예제:

from cupyx.profiler import benchmark

def my_func(a):
    return cp.sqrt(cp.sum(a**2, axis=-1))

a = cp.random.random((256, 1024))
print(benchmark(my_func, (a,), n_repeat=20))

다음과 같이 출력된다:

my_func             :    CPU:   44.407 us   +/- 2.428 (min:   42.516 / max:   53.098) us     GPU-0:  181.565 us   +/- 1.853 (min:  180.288 / max:  188.608) us

직접 구현시 예제:

import time
start_gpu = cp.cuda.Event()
end_gpu = cp.cuda.Event()

start_gpu.record()
start_cpu = time.perf_counter()
out = my_func(a)
end_cpu = time.perf_counter()
end_gpu.record()
end_gpu.synchronize()
t_gpu = cp.cuda.get_elapsed_time(start_gpu, end_gpu)
t_cpu = end_cpu - start_cpu

기초

[추천] CuPy 설명서 : 네이버 블로그 [^0]

Low-level CUDA support

Low-level CUDA support — CuPy 13.2.0 documentation

cupy.cuda.Event

cupy.cuda.Event — CuPy 13.2.0 documentation

CUDA 스트림의 동기화 지점인 CUDA 이벤트입니다.

이 클래스는 RAII 방식으로 CUDA 이벤트 핸들을 처리합니다. 즉, Event 인스턴스가 GC에 의해 소멸되면 해당 핸들도 소멸됩니다.

class cupy.cuda.Event(block=False, disable_timing=False, interprocess=False)
- Parameters:
  - block (bool) – True인 경우 이벤트가 .synchronize() 메서드를 차단합니다.
  - disable_timing (bool) – True인 경우 이벤트는 타이밍 데이터를 준비하지 않습니다.
  - interprocess (bool) – True인 경우 이벤트가 다른 프로세스로 전달될 수 있습니다.
- .synchronize() Method
  - 모든 장치 작업을 이벤트에 동기화합니다.
  - 이벤트가 차단 이벤트로 생성되면 이벤트가 완료될 때까지 CPU 스레드도 차단됩니다.

IPC

cupy.cuda.runtime.ipcGetMemHandle(intptr_t devPtr)
cupy.cuda.runtime.ipcOpenMemHandle(bytes handle, unsigned int flags=cudaIpcMemLazyEnablePeerAccess)
cupy.cuda.runtime.ipcCloseMemHandle(intptr_t devPtr)
cupy.cuda.runtime.ipcGetEventHandle(intptr_t event)
cupy.cuda.runtime.ipcOpenEventHandle(bytes handle)

구현 예제는 CuPy:IPC 항목 참조.

Memory Pool

Memory Management — CuPy 13.2.0 documentation

import cupy
import numpy

mempool = cupy.get_default_memory_pool()
pinned_mempool = cupy.get_default_pinned_memory_pool()

# Create an array on CPU.
# NumPy allocates 400 bytes in CPU (not managed by CuPy memory pool).
a_cpu = numpy.ndarray(100, dtype=numpy.float32)
print(a_cpu.nbytes)                      # 400

# You can access statistics of these memory pools.
print(mempool.used_bytes())              # 0
print(mempool.total_bytes())             # 0
print(pinned_mempool.n_free_blocks())    # 0

# Transfer the array from CPU to GPU.
# This allocates 400 bytes from the device memory pool, and another 400
# bytes from the pinned memory pool.  The allocated pinned memory will be
# released just after the transfer is complete.  Note that the actual
# allocation size may be rounded to larger value than the requested size
# for performance.
a = cupy.array(a_cpu)
print(a.nbytes)                          # 400
print(mempool.used_bytes())              # 512
print(mempool.total_bytes())             # 512
print(pinned_mempool.n_free_blocks())    # 1

# When the array goes out of scope, the allocated device memory is released
# and kept in the pool for future reuse.
a = None  # (or `del a`)
print(mempool.used_bytes())              # 0
print(mempool.total_bytes())             # 512
print(pinned_mempool.n_free_blocks())    # 1

# You can clear the memory pool by calling `free_all_blocks`.
mempool.free_all_blocks()
pinned_mempool.free_all_blocks()
print(mempool.used_bytes())              # 0
print(mempool.total_bytes())             # 0
print(pinned_mempool.n_free_blocks())    # 0

Interoperability

Interoperability — CuPy 13.2.0 documentation

NumPy

cupy.ndarray implements array_ufunc interface.

Numba

Numba is a Python JIT compiler with NumPy support.

cupy.ndarray는 Numba v0.39.0 이상과 호환되는 CUDA 배열 교환 인터페이스인 __cuda_array_interface__를 구현합니다 (자세한 내용은 CUDA 배열 인터페이스 참조).

이는 Numba를 사용하여 JITed 커널에 CuPy 배열을 전달할 수 있음을 의미합니다. 다음은 numba/numba#2860에서 가져온 간단한 예제 코드입니다.

import cupy
from numba import cuda

@cuda.jit
def add(x, y, out):
        start = cuda.grid(1)
        stride = cuda.gridsize(1)
        for i in range(start, x.shape[0], stride):
                out[i] = x[i] + y[i]

a = cupy.arange(10)
b = a * 2
out = cupy.zeros_like(a)

print(out)  # => [0 0 0 0 0 0 0 0 0 0]

add[1, 32](a, b, out)

print(out)  # => [ 0  3  6  9 12 15 18 21 24 27]

또한 cupy.asarray()는 Numba CUDA 배열에서 CuPy 배열로의 제로 복사 변환을 지원합니다.

import numpy
import numba
import cupy

x = numpy.arange(10)  # type: numpy.ndarray
x_numba = numba.cuda.to_device(x)  # type: numba.cuda.cudadrv.devicearray.DeviceNDArray
x_cupy = cupy.asarray(x_numba)  # type: cupy.ndarray

mpi4py

PyTorch

PyTorch는 고성능의 미분 가능한 텐서 작업을 제공하는 기계 학습 프레임포크입니다.

PyTorch는 __cuda_array_interface__도 지원하므로 CuPy와 PyTorch 간의 Zero-Copy 데이터 교환이 No-Cost 로 이루어질 수 있습니다.

유일한 주의 사항은 PyTorch가 기본적으로 cuda_array_interface 속성이 정의되지 않은 CPU 텐서를 생성하므로 사용자는 교환하기 전에 텐서가 이미 GPU에 있는지 확인해야 한다는 것입니다.

import cupy as cp
import torch

import cupy as cp
import torch

# convert a torch tensor to a cupy array
a = torch.rand((4, 4), device='cuda')
b = cp.asarray(a)
b *= b

print(b)
"""
array([[0.8215962 , 0.82399917, 0.65607935, 0.30354425],
       [0.422695  , 0.8367199 , 0.00208597, 0.18545236],
       [0.00226746, 0.46201342, 0.6833052 , 0.47549972],
       [0.5208748 , 0.6059282 , 0.1909013 , 0.5148635 ]], dtype=float32)
"""

print(a)
"""
tensor([[0.8216, 0.8240, 0.6561, 0.3035],
        [0.4227, 0.8367, 0.0021, 0.1855],
        [0.0023, 0.4620, 0.6833, 0.4755],
        [0.5209, 0.6059, 0.1909, 0.5149]], device='cuda:0')
"""

# check the underlying memory pointer is the same
assert a.__cuda_array_interface__['data'][0] == b.__cuda_array_interface__['data'][0]

# convert a cupy array to a torch tensor
a = cp.arange(10)
b = torch.as_tensor(a, device='cuda')
b += 3

print(b)
"""tensor([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12], device='cuda:0')"""

print(a)
"""array([ 3,  4,  5,  6,  7,  8,  9, 10, 11, 12])"""

assert a.__cuda_array_interface__['data'][0] == b.__cuda_array_interface__['data'][0]

PyTorch는 DLPack을 통한 제로 복사 데이터 교환도 지원합니다 (아래 #DLPack 참조).

import cupy
import torch

# Create a PyTorch tensor.
tx1 = torch.randn(1, 2, 3, 4).cuda()

# Convert it into a CuPy array.
cx = cupy.from_dlpack(tx1)

# Convert it back to a PyTorch tensor.
tx2 = torch.from_dlpack(cx)

cupy.from_dlpack()에 torch.Tensor를 직접 공급하는 것은 CuPy v10+ 및 PyTorch 1.10+에 추가된 (새로운) DLPack 데이터 교환 프로토콜에서만 지원됩니다.

이전 버전의 경우 위 예제에 표시된 대로 torch.utils.dlpack.to_dlpack()으로 Tensor를 래핑해야 합니다.

RMM

DLPack

DLPack은 프레임워크 간에 텐서를 공유하기 위한 텐서 구조의 사양입니다.

CuPy는 DLPack 데이터 구조(cupy.from_dlpack() 및 cupy.ndarray.toDlpack())에서 가져오기 및 내보내기를 지원합니다.

import cupy

# Create a CuPy array.
cx1 = cupy.random.randn(1, 2, 3, 4).astype(cupy.float32)

# Convert it into a DLPack tensor.
dx = cx1.toDlpack()

# Convert it back to a CuPy array.
cx2 = cupy.from_dlpack(dx)

Favorite site

CuPy – NumPy & SciPy for GPU — CuPy 13.2.0 documentation