PyTorch

Projects

Github - roytseng-tw/Detectron.pytorch
Github - facebookresearch/maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch
- How to finetune from pretrained detectron models with different number of classes? #15
- See also: Mask R-CNN, Detectron

CppRl - PyTorch C++ Reinforcement Learning: PyTorch C++ Reinforcement Learning; https://github.com/Omegastick/pytorch-cpp-rl; Reddit - CppRl: A C++ reinforcement learning library using the new PyTorch C++ frontend : MachineLearning

higher: higher is a pytorch library allowing users to obtain higher order gradients over losses spanning training loops rather than individual training steps.
https://github.com/facebookresearch/higher

Kornia: Open Source Differentiable Computer Vision Library for PyTorch; https://github.com/arraiyopensource/kornia

pytorch-struct: https://github.com/harvardnlp/pytorch-struct; Fast, general, and tested differentiable structured prediction in PyTorch; A library of tested, GPU implementations of core structured prediction algorithms for deep learning applications. (or an implementation of Inside-Outside and Forward-Backward Algorithms Are Just Backprop"¹)

cvxpylayers: https://github.com/cvxgrp/cvxpylayers; Differentiable convex optimization layers; cvxpylayers is a Python library for constructing differentiable convex optimization layers in PyTorch and TensorFlow using CVXPY. A convex optimization layer solves a parametrized convex optimization problem in the forward pass to produce a solution. It computes the derivative of the solution with respect to the parameters in the backward pass.

pycls: https://github.com/facebookresearch/pycls; Codebase for Image Classification Research, written in PyTorch.; pycl is an image classification codebase, written in PyTorch. The codebase was originally developed for a project that led to the On Network Design Spaces for Visual Recognition work. pycls has since matured into a general image classification codebase that has been adopted by a number representation learning projects at Facebook AI Research.

PyTorch Visualization

Tensorboard-PyTorch
Visdom
Tornado (Python asynchronous web framework)
Plotly (Python interactive plotting library)

Distributed Applications

Writing Distributed Applications with PyTorch

Download and Requirements

Previous PyTorch Versions | PyTorch

예를 들면 v2.3.0 를 설치할 때:

# ROCM 6.0 (Linux only)
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/rocm6.0
# CUDA 11.8
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
# CUDA 12.1
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu121
# CPU only
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cpu

중 하나를 선택했다면 다음과 같이 requirements.txt파일을 만들면 된다:

--index-url https://download.pytorch.org/whl/cu117
torch==2.0.1+cu117
torchvision==0.15.2+cu117
torchaudio==2.0.2+cu117

torch.no_grad()

Pytorch에서 no_grad()와 eval()의 정확한 차이는 무엇일까? | Jimmy's Tech Blog

with torch.no_grad()
    ...

이와 같이 no_grad() with statement에 포함시키게 되면 Pytorch는 autograd engine을 꺼버린다. 이 말은 더 이상 자동으로 gradient를 트래킹하지 않는다는 말이 된다. 그러면 이런 의문이 들 수 있다. loss.backward()를 통해 backpropagation을 진행하지 않는다면 뭐 gradient를 게산하든지 말든지 큰 상관이 없는 것이 아닌가?

맞는 말이다. torch.no_grad()의 주된 목적은 autograd를 끔으로써 메모리 사용량을 줄이고 연산 속도를 높히기 위함이다. 사실상 어짜피 안쓸 gradient인데 inference시에 굳이 계산할 필요가 없지 않은가?

그래서 일반적으로 inference를 진행할 때는 torch.no_grad() with statement로 감싼다는 사실을 알면 된다.

model.eval()

Pytorch에서 no_grad()와 eval()의 정확한 차이는 무엇일까? | Jimmy's Tech Blog

model.eval()

그럼 여기서 다시 처음 질문으로 돌아와서 위의 torch.no_grad()만 쓰면 되지 않나? gradient 계산 안하고 이제 됐잖아 라고 생각할 수 있다. 맞는 말이지만, model.eval()의 역할은 약간 다르다. 현재(2019년) 시점에서는 모델링 시 training과 inference시에 다르게 동작하는 layer들이 존재한다. 예를 들면, Dropout layer는 학습시에는 동작해야하지만, inference시에는 동작하지 않는 것과 같은 예시를 들 수 있다. BatchNorm같은 경우도 마찬가지다.

사실상 model.eval()는 이런 layer들의 동작을 inference(eval) mode로 바꿔준다는 목적으로 사용된다. 따라서, 우리가 보통 원하는 모델의 동작을 위해서는 위의 두 가지를 모두 사용해야하는 것이 맞다.

Device Pointer

(당연하지만) device='cuda'로 CUDA 메모리로 할당되어야만 tensor.data_ptr()로 포인터를 획득할 수 있다.

import torch

# CUDA tensor 생성
tensor = torch.arange(1000000, device='cuda').reshape(1000, 1000)

# Device pointer 얻기
dev_ptr = tensor.data_ptr()

# Device pointer 출력
print(f'Device pointer: {dev_ptr}')

Deep Reinforcement Learning Algorithms with PyTorch

Github - PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Q Learning (DQN) (Mnih et al. 2013)
DQN with Fixed Q Targets (Mnih et al. 2013)
Double DQN (DDQN) (Hado van Hasselt et al. 2015)
DDQN with Prioritised Experience Replay (Schaul et al. 2016)
Dueling DDQN (Wang et al. 2016)
REINFORCE (Williams et al. 1992)
Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al. 2016 )
Twin Delayed Deep Deterministic Policy Gradients (TD3) (Fujimoto et al. 2018)
Soft Actor-Critic (SAC & SAC-Discrete) (Haarnoja et al. 2018)
Asynchronous Advantage Actor Critic (A3C) (Mnih et al. 2016)
Syncrhonous Advantage Actor Critic (A2C)
Proximal Policy Optimisation (PPO) (Schulman et al. 2017)
DQN with Hindsight Experience Replay (DQN-HER) (Andrychowicz et al. 2018)
DDPG with Hindsight Experience Replay (DDPG-HER) (Andrychowicz et al. 2018 )
Hierarchical-DQN (h-DQN) (Kulkarni et al. 2016)
Stochastic NNs for Hierarchical Reinforcement Learning (SNN-HRL) (Florensa et al. 2017)
Diversity Is All You Need (DIAYN) (Eyensbach et al. 2018)

Memory management

PyTorch uses a caching memory allocator to speed up memory allocations. This allows fast memory deallocation without device synchronizations. However, the unused memory managed by the allocator will still show as if used in nvidia-smi. You can use memory_allocated() and max_memory_allocated() to monitor memory occupied by tensors, and use memory_reserved() and max_memory_reserved() to monitor the total amount of memory managed by the caching allocator. Calling empty_cache() releases all unused cached memory from PyTorch so that those can be used by other GPU applications. However, the occupied GPU memory by tensors will not be freed so it can not increase the amount of GPU memory available for PyTorch.

For more advanced users, we offer more comprehensive memory benchmarking via memory_stats(). We also offer the capability to capture a complete snapshot of the memory allocator state via memory_snapshot(), which can help you understand the underlying allocation patterns produced by your code.

Use of a caching allocator can interfere with memory checking tools such as cuda-memcheck. To debug memory errors using cuda-memcheck, set PYTORCH_NO_CUDA_MEMORY_CACHING=1 in your environment to disable caching.

The behavior of caching allocator can be controlled via environment variable PYTORCH_CUDA_ALLOC_CONF. The format is PYTORCH_CUDA_ALLOC_CONF=<option>:<value>,<option2>:<value2>... Available options:

max_split_size_mb prevents the allocator from splitting blocks larger than this size (in MB). This can help prevent fragmentation and may allow some borderline workloads to complete without running out of memory. Performance cost can range from ‘zero’ to ‘substatial’ depending on allocation patterns. Default value is unlimited, i.e. all blocks can be split. The memory_stats() and memory_summary() methods are useful for tuning. This option should be used as a last resort for a workload that is aborting due to ‘out of memory’ and showing a large amount of inactive split blocks.
roundup_power2_divisions helps with rounding the requested allocation size to nearest power-2 division and making better use of the blocks. In the current CUDACachingAllocator, the sizes are rounded up in multiple of blocks size of 512, so this works fine for smaller sizes. However, this can be inefficient for large near-by allocations as each will go to different size of blocks and re-use of those blocks are minimized. This might create lots of unused blocks and will waste GPU memory capacity. This option enables the rounding of allocation size to nearest power-2 division. For example, if we need to round-up size of 1200 and if number of divisions is 4, the size 1200 lies between 1024 and 2048 and if we do 4 divisions between them, the values are 1024, 1280, 1536, and 1792. So, allocation size of 1200 will be rounded to 1280 as the nearest ceiling of power-2 division.
roundup_bypass_threshold_mb bypass rounding the requested allocation size, for allocation requests larger than the threshold value (in MB). This can help reduce the memory footprint when making large allocations that are expected to be persistent or have a large lifetime.
garbage_collection_threshold helps actively reclaiming unused GPU memory to avoid triggering expensive sync-and-reclaim-all operation (release_cached_blocks), which can be unfavorable to latency-critical GPU applications (e.g., servers). Upon setting this threshold (e.g., 0.8), the allocator will start reclaiming GPU memory blocks if the GPU memory capacity usage exceeds the threshold (i.e., 80% of the total memory allocated to the GPU application). The algorithm prefers to free old & unused blocks first to avoid freeing blocks that are actively being reused. The threshold value should be between greater than 0.0 and less than 1.0.

Troubleshooting

PyTorch multiprocessing

Stackoverflow - How to use PyTorch multiprocessing?

Python:multiprocessing패키지를 사용하여 PyTorch를 사용할 경우 사용할 경우 아래와 같은 에러가 발생될 수 있다.

`RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

As stated in pytorch documentation the best practice to handle multiprocessing is to use torch.multiprocessing instead of multiprocessing.

Be aware that sharing CUDA tensors between processes is supported only in Python 3, either with spawn or forkserver as start method.

Without touching your code, a workaround for the error you got is replacing

from multiprocessing import Process, Pool

with:

from torch.multiprocessing import Pool, Process, set_start_method
try:
     set_start_method('spawn')
except RuntimeError:
    pass

Library not loaded libomp.dylib

I can't import PyTorch, libomp.dylib can't be loaded. #20030

Traceback (most recent call last):
  File "setup.py", line 6, in <module>
    import torch
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/__init__.py", line 79, in <module>
    from torch._C import *
ImportError: dlopen(/usr/local/c2core/lib/python3.7/site-packages/torch/_C.cpython-37m-darwin.so, 9): Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /usr/local/c2core/lib/python3.7/site-packages/torch/lib/libshm.dylib
  Reason: image not found

OpenMP를 설치하면 된다. macOS의 경우 brew install libomp를 사용하면 된다.

Found no NVIDIA driver on your system

PyTorch의 CUDA관련 빌드시 아래와 같은 에러가 발생될 수 있다.

Traceback (most recent call last):
  File "setup.py", line 68, in <module>
    cmdclass={"build_ext": torch.utils.cpp_extension.BuildExtension},
  File "/usr/local/c2core/lib/python3.7/site-packages/setuptools/__init__.py", line 145, in setup
    return distutils.core.setup(**attrs)
  File "/usr/local/c2core/lib/python3.7/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/local/c2core/lib/python3.7/distutils/dist.py", line 966, in run_commands
    self.run_command(cmd)
  File "/usr/local/c2core/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/c2core/lib/python3.7/distutils/command/build.py", line 135, in run
    self.run_command(cmd_name)
  File "/usr/local/c2core/lib/python3.7/distutils/cmd.py", line 313, in run_command
    self.distribution.run_command(command)
  File "/usr/local/c2core/lib/python3.7/distutils/dist.py", line 985, in run_command
    cmd_obj.run()
  File "/usr/local/c2core/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 78, in run
    _build_ext.run(self)
  File "/usr/local/c2core/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 186, in run
    _build_ext.build_ext.run(self)
  File "/usr/local/c2core/lib/python3.7/distutils/command/build_ext.py", line 340, in run
    self.build_extensions()
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 372, in build_extensions
    build_ext.build_extensions(self)
  File "/usr/local/c2core/lib/python3.7/site-packages/Cython/Distutils/old_build_ext.py", line 195, in build_extensions
    _build_ext.build_ext.build_extensions(self)
  File "/usr/local/c2core/lib/python3.7/distutils/command/build_ext.py", line 449, in build_extensions
    self._build_extensions_serial()
  File "/usr/local/c2core/lib/python3.7/distutils/command/build_ext.py", line 474, in _build_extensions_serial
    self.build_extension(ext)
  File "/usr/local/c2core/lib/python3.7/site-packages/setuptools/command/build_ext.py", line 199, in build_extension
    _build_ext.build_extension(self, ext)
  File "/usr/local/c2core/lib/python3.7/distutils/command/build_ext.py", line 534, in build_extension
    depends=ext.depends)
  File "/usr/local/c2core/lib/python3.7/distutils/ccompiler.py", line 574, in compile
    self._compile(obj, src, ext, cc_args, extra_postargs, pp_opts)
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 288, in unix_wrap_compile
    "'-fPIC'"] + cflags + _get_cuda_arch_flags(cflags)
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/utils/cpp_extension.py", line 1013, in _get_cuda_arch_flags
    capability = torch.cuda.get_device_capability()
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/cuda/__init__.py", line 320, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/cuda/__init__.py", line 325, in get_device_properties
    _lazy_init()  # will define _get_device_properties and _CudaDeviceProperties
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/cuda/__init__.py", line 196, in _lazy_init
    _check_driver()
  File "/usr/local/c2core/lib/python3.7/site-packages/torch/cuda/__init__.py", line 101, in _check_driver
    http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

이 것은 pytorch의 런타임확인 때문에 그렇다. nvcc에 직접 옵션을 넘겨주고 싶다면 TORCH_CUDA_ARCH_LIST환경변수에 CUDA 아키텍처 목록을 ;로 구분하여 추가하면 된다.

TORCH_CUDA_ARCH_LIST="Pascal;Volta;Turing" FORCE_CUDA=1 /usr/local/c2core/bin/python3.7 setup.py build

maskrcnn-benchmark#Force cuda build 항목 참조.

결과가 계속 바뀌는 이유

INFORMATION
1줄 요약: 가장 중요한 것은 예측 시 `model.eval()`을 호출하는 것입니다.

PyTorch에서 모델을 사용하여 예측(predict)을 할 때 결과가 계속 바뀌는 이유는 다음과 같은 여러 가지 요인 때문일 수 있습니다:

모델이 학습 모드 (training mode)에 있는 경우

PyTorch에서 모델은 train() 모드와 eval() 모드가 있습니다. train() 모드는 드롭아웃(dropout)이나 배치 정규화(batch normalization)와 같은 레이어가 활성화되어 매번 다른 결과를 낼 수 있습니다.
반면, eval() 모드에서는 이러한 레이어들이 비활성화되어 예측이 결정적으로 변합니다.
따라서 예측을 수행할 때는 model.eval()을 호출하여 평가 모드로 전환해야 합니다.

model.eval()
with torch.no_grad():
    predictions = model(input_data)

난수 생성기 (Random Number Generator, RNG)

모델 초기화나 데이터 샘플링에서 난수가 사용됩니다. 만약 난수 시드(seed)를 설정하지 않았다면, 매번 실행할 때마다 다른 결과를 얻을 수 있습니다. 시드를 고정하면 동일한 결과를 얻을 수 있습니다.

import torch
torch.manual_seed(42)

드롭아웃 (Dropout)

드롭아웃은 학습 중 특정 뉴런을 무작위로 비활성화하여 과적합을 방지하는 기법입니다. 학습 모드에서 드롭아웃이 활성화되어 있으면, 매번 다른 예측 결과를 생성할 수 있습니다. model.eval()로 전환하면 드롭아웃이 비활성화됩니다.

데이터 샘플링

데이터 로더(DataLoader)가 데이터셋에서 무작위로 샘플링을 하는 경우, 매번 다른 데이터가 모델에 입력될 수 있습니다. 이 경우에도 예측이 달라질 수 있습니다. 이 문제를 해결하려면 데이터 샘플링 과정에서 난수 시드를 고정해야 합니다.

다중 처리(Multiprocessing)

다중 프로세싱을 사용하는 경우에도, 작업 간에 미묘한 차이가 발생할 수 있어 예측 결과가 달라질 수 있습니다.

스케일러 사용 여부 확인

많은 상황에서 데이터 스케일 조정이 필요한 경우가 있죠:

분석시에 변수들이 너무 스케일이 다를 경우 - 변수들의 단위 차이로 인해 숫자의 스케일이 크게 달라지는 경우
신경망 학습시에 데이터셋의 값이 들쑥날쑥하거나, 매우 큰 경우에는 cost의 값이 발산하여 정상적인 학습이 이루어지지 않는 경우

이 때 사용하는 것이 스케일러인데 이걸 사용했다면 제거해야 한다.

from numpy.typing import NDArray
from pandas import DataFrame
from sklearn.preprocessing import StandardScaler

# ...

def rescale_standard(
    df: DataFrame,
    *,
    copy=True,
    with_mean=True,
    with_std=True,
) -> NDArray:
    scaler = StandardScaler(copy=copy, with_mean=with_mean, with_std=with_std)
    return scaler.fit_transform(df)

# ...

df_without_rr = drop_rate(df)

if args.use_scaler:
    scaled_features = rescale_standard(df_without_rr)
    if args.debug and args.verbose >= VERBOSE_LEVEL_2:
        logger.debug(f"Scaled features:\n{scaled_features}")
else:
    scaled_features = df_without_rr.to_numpy().astype(float)

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Model().cuda() 혹은 Model().to(device)를 사용.

Optimizer 에서 load_state_dict사용후 위의 문제가 발생했다면:

optim_state = torch.load({CHECKPOINT_PATH})["optimizer"]
optim.load_state_dict(optim_state)
for state in optim.state.values():
    for k, v in state.items():
        if torch.is_tensor(v):
            state[k] = v.to(device)

WARNING
근데 난 안되더라... 뭐가 문제인지 차 후 확인 요망

Favorite site

PyTorch web site