CUDA:Troubleshooting

CUDA와 관련된 문제점 해결 방법에 대한 설명.

C++11 지원

nvcc: C++11 standard in CUDA frontend? (dependencies, gcc, Windows vs. Linux)

NVCC에서 지원하는 --std c++11를 사용할 경우 아래와 같이 에러가 발생할 수 있다.

[your@server down]$ nvcc --std c++11 main.cpp 
nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored.
In file included from /usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/thread:35,
                 from main.cpp:1:
/usr/lib/gcc/x86_64-redhat-linux/4.4.7/../../../../include/c++/4.4.7/c++0x_warning.h:31:2: error:
    #error This file requires compiler and library support for the upcoming ISO C++ standard,
    C++0x. This support is currently experimental, and must be enabled with the -std=c++0x or -std=gnu++0x compiler options.

이 경우 컴파일러에 직접 옵션을 넘겨줘야 하는데 아래와 같이 적용하면 된다.

nvcc -Xcompiler -std=c++0x main.cu

Explicitly specify the language

cpp파일을 전달하면 CUDA로 인식하지 않는 현상이 발견된다. 이 경우 강제로 CUDA로 인식하는 방법은 아래와 같다.

nvcc --x cu test.cpp

Compiling CUDA in QT with MinGW

Stackoverflow: Compiling CUDA in QT with MinGW

CUDA support for MinGw is not available. CUDA is only supported by Visual C++ environment in Windows. Using CUDA with QT under MinGw was not an option for me after this.

원격데스크톱의 CUDA 사용불가

cudaGetDeviceCount

원격데스크톱에서는 GeForce 계열 디바이스를 정상적으로 확인할 수 없다. 따라서 해당 컴퓨터에서 직접 확인해야 한다.

참고로 이 경우 cudaGetDeviceCount()를 사용하면 cudaErrorNoDevice가 반환된다.

invalid permissions

런타임시 다음과 같은 에러 메시지가 출력될 수 있다.

memory access violation at address: 0x703cc0000: invalid permissions

위와 같은 에러는 Host코드와 Device코드를 적절한 위치에 사용하지 못해 발생되는 코드이다. 대표적인 예로, cudaMalloc후, 해당 메모리를 Host에서 접근할 경우가 있다.

참고로 위와 같은 경우 cudaMemcpy등을 사용하여 Host 데이터를 Device 데이터로 복사할 수 있다.

Undefined __float128

CUDA컴파일 중 아래와 같은 에러가 발생할 수 있다.

/usr/include/c++/4.9/type_traits(279): error: identifier "__float128" is undefined

만약, NVCC 컴파일러 옵션에 -std=gnu++11를 추가했다면 -std=c++11로 변경하면 된다. (참고로 테스트 환경은 Ubuntu 14.04 GCC 4.9 NVCC CUDA 7.5.17 이다.)

만약, Boost를 사용했을 경우 BOOST Ticket 11852을 확인하여, 패치를 적용하면 된다. 관련 헤더는 boost/config/compiler/gcc.hpp이고, 내용은 아래와 같다.

 include/boost/config/compiler/gcc.hpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/boost/config/compiler/gcc.hpp b/include/boost/config/compiler/gcc.hpp
index d9dd59d..48605eb 100644

Index: include/boost/config/compiler/gcc.hpp
===================================================================
--- a/include/boost/config/compiler/gcc.hpp
+++ b/include/boost/config/compiler/gcc.hpp
@@ -153,7 +153,7 @@
 #else
 #include <stddef.h>
 #endif
-#if defined(_GLIBCXX_USE_FLOAT128) && !defined(__STRICT_ANSI__)
+#if defined(_GLIBCXX_USE_FLOAT128) && !defined(__STRICT_ANSI__) && !defined(__CUDACC__)
 # define BOOST_HAS_FLOAT128
 #endif

Without Monitor

모니터를 연결하지 않은 상태로 CUDA를 사용하게 되면 정상적으로 작동하지 않게 된다. Linux에서 X Window System의 xserver-xorg-video-dummy 드라이버를 설치하여 해결할 수 있다.

또는 WDDM/TCC 모드 전환을 사용하면 된다.

Stop on reboot

nvidia-smi stops working after reboot ununtu 18.04 - Graphics / Linux / Linux - NVIDIA Developer Forums

재부팅시 중지되는 현상에 대한 정리.

18년 8월 22일 - .run 인스톨러를 사용하지 말고, PPA 로 그래픽 드라이버 설치 후, .deb 파일로 설치하라

Compiling dynamic parallelism

컴파일 옵션에 -rdc=true를 추가해야 하며, 링크시 cudadevrt라이브러리를 추가해야 한다.

CUDA dynamic parallelism debugging is not supported in preemption mode

Stackoverflow - Can't debug CUDA: CUDA dynamic parallelism debugging is not supported in preemption mode

Right click the monitor's tray icon, check "Options\CUDA\Debugger". Except TCC GPUs, the others are by default force "Software Preemption".

You can set "Desktop GPUS must use Software Preemption" and "Headless GPUs must use software preemption" to false. And make sure in you VisualStuido, the setting "Nsight\Options\CUDA\Preemption Preference" is "Prefer no Software Preemption".

모니터의 트레이 아이콘을 마우스 오른쪽 버튼으로 클릭하고 "Options\CUDA\Debugger"를 체크하십시오.
TCC GPU를 제외하고 다른 것들은 기본적으로 "소프트웨어 선점 (Software Preemption)"을 강요합니다.

"데스크톱 GPUS는 소프트웨어 선점을 사용 (Desktop GPUS must use Software Preemption)"및
"헤드리스 GPU는 소프트웨어 선점을 사용 (Headless GPUs must use software preemption)"를 false로 설정할 수 있습니다.
VisualStuido에서 "Nsight\Options\CUDA\Preemption Preference"설정이 "Prefer no Software Preemption"이 아닌지 확인하십시오.

cudaErrorLaunchOutOfResources

This indicates that a launch did not occur because it did not have appropriate resources. Although this error is similar to cudaErrorInvalidConfiguration, this error usually indicates that the user has attempted to pass too many arguments to the device kernel, or the kernel launch specifies too many threads for the kernel's register count.

CUDA error types 중 하나의 값이다. 해당 에러가 발생될 경우 아래의 내용을 확인해 본다.

커널에 전달한 Thread 개수 확인.
Thread 당 Register 개수 확인 (제한 걸기)
Heap 메모리 할당량 확인.

== Depends: cuda-11-5 (>= 11.5.0) but it is not going to be installed == CUDA Toolkit 최신 버전 받고 설치하면 아래와 같은 이슈가 발생될 수 있다.

$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda : Depends: cuda-11-5 (>= 11.5.0) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

dpkg로 확인해 보면:

$ dpkg -l | grep cuda
ii  cuda-command-line-tools-11-0                      11.0.3-1                                        amd64        CUDA command-line tools
ii  cuda-compiler-11-0                                11.0.3-1                                        amd64        CUDA compiler
ii  cuda-cudart-11-0                                  11.0.221-1                                      amd64        CUDA Runtime native Libraries
ii  cuda-cudart-dev-11-0                              11.0.221-1                                      amd64        CUDA Runtime native dev links, headers
ii  cuda-cuobjdump-11-0                               11.0.221-1                                      amd64        CUDA cuobjdump
ii  cuda-cupti-11-0                                   11.0.221-1                                      amd64        CUDA profiling tools runtime libs.
ii  cuda-cupti-dev-11-0                               11.0.221-1                                      amd64        CUDA profiling tools interface.
ii  cuda-documentation-11-0                           11.0.228-1                                      amd64        CUDA documentation
ii  cuda-driver-dev-11-0                              11.0.221-1                                      amd64        CUDA Driver native dev stub library
ii  cuda-gdb-11-0                                     11.0.221-1                                      amd64        CUDA-GDB
ii  cuda-libraries-11-0                               11.0.3-1                                        amd64        CUDA Libraries 11.0 meta-package
ii  cuda-libraries-dev-11-0                           11.0.3-1                                        amd64        CUDA Libraries 11.0 development meta-package
ii  cuda-memcheck-11-0                                11.0.221-1                                      amd64        CUDA-MEMCHECK
ii  cuda-nsight-11-0                                  11.0.221-1                                      amd64        CUDA nsight
ii  cuda-nsight-compute-11-0                          11.0.3-1                                        amd64        NVIDIA Nsight Compute
ii  cuda-nsight-systems-11-0                          11.0.3-1                                        amd64        NVIDIA Nsight Systems
ii  cuda-nvcc-11-0                                    11.0.221-1                                      amd64        CUDA nvcc
ii  cuda-nvdisasm-11-0                                11.0.221-1                                      amd64        CUDA disassembler
ii  cuda-nvml-dev-11-0                                11.0.167-1                                      amd64        NVML native dev links, headers
ii  cuda-nvprof-11-0                                  11.0.221-1                                      amd64        CUDA Profiler tools
ii  cuda-nvprune-11-0                                 11.0.221-1                                      amd64        CUDA nvprune
ii  cuda-nvrtc-11-0                                   11.0.221-1                                      amd64        NVRTC native runtime libraries
ii  cuda-nvrtc-dev-11-0                               11.0.221-1                                      amd64        NVRTC native dev links, headers
ii  cuda-nvtx-11-0                                    11.0.167-1                                      amd64        NVIDIA Tools Extension
ii  cuda-nvvp-11-0                                    11.0.221-1                                      amd64        CUDA Profiler tools
ii  cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 1.0-1                                           amd64        cuda repository configuration files
ii  cuda-repo-ubuntu1804-11-4-local                   11.4.2-470.57.02-1                              amd64        cuda repository configuration files
ii  cuda-repo-ubuntu1804-11-5-local                   11.5.0-495.29.05-1                              amd64        cuda repository configuration files
ii  cuda-samples-11-0                                 11.0.221-1                                      amd64        CUDA example applications
ii  cuda-sanitizer-11-0                               11.0.221-1                                      amd64        CUDA Sanitizer
ii  cuda-toolkit-11-0                                 11.0.3-1                                        amd64        CUDA Toolkit 11.0 meta-package
ii  cuda-tools-11-0                                   11.0.3-1                                        amd64        CUDA Tools meta-package
ii  cuda-visual-tools-11-0                            11.0.3-1                                        amd64        CUDA visual tools

문제는, 다운받아 놓은 로컬 설치용 cuda toolkit 을 사용하여 apt update 했지만, 그 사이에 버전이 업데이트 되어서 저런 현상이 발생한 것이다.

기존에 받아놓은 쿠다 관련 패키지를 삭제,

$ sudo apt purge cuda
$ sudo apt autoremove

apt로 못찾는 내용은 dpkg로 직접 찾는다:

$ dpkg -l | grep cuda
rc  cuda-cudart-11-0                                  11.0.221-1                                      amd64        CUDA Runtime native Libraries
ii  cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01 1.0-1                                           amd64        cuda repository configuration files
ii  cuda-repo-ubuntu1804-11-4-local                   11.4.2-470.57.02-1                              amd64        cuda repository configuration files
ii  cuda-repo-ubuntu1804-11-5-local                   11.5.0-495.29.05-1                              amd64        cuda repository configuration files
rc  cuda-toolkit-11-0                                 11.0.3-1                                        amd64        CUDA Toolkit 11.0 meta-package
rc  cuda-visual-tools-11-0                            11.0.3-1                                        amd64        CUDA visual tools

필요한 버전을 제외한 cuda-repo- 관련 패키지는 전부 지워준다.

$ sudo dpkg -P cuda-repo-ubuntu1804-11-4-local
$ sudo dpkg -P cuda-repo-ubuntu1804-10-2-local-10.2.89-440.33.01
$ dpkg -l | grep cuda
rc  cuda-cudart-11-0                           11.0.221-1                                      amd64        CUDA Runtime native Libraries
ii  cuda-repo-ubuntu1804-11-5-local            11.5.0-495.29.05-1                              amd64        cuda repository configuration files
rc  cuda-toolkit-11-0                          11.0.3-1                                        amd64        CUDA Toolkit 11.0 meta-package
rc  cuda-visual-tools-11-0                     11.0.3-1                                        amd64        CUDA visual tools

그리고 cuda를 설치하면 된다:

$ sudo apt install cuda
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  cuda-11-5 cuda-cccl-11-5 cuda-command-line-tools-11-5 cuda-compiler-11-5 cuda-cudart-11-5 cuda-cudart-dev-11-5 cuda-cuobjdump-11-5 cuda-cupti-11-5 cuda-cupti-dev-11-5 cuda-cuxxfilt-11-5
  cuda-demo-suite-11-5 cuda-documentation-11-5 cuda-driver-dev-11-5 cuda-drivers cuda-drivers-495 cuda-gdb-11-5 cuda-libraries-11-5 cuda-libraries-dev-11-5 cuda-memcheck-11-5 cuda-nsight-11-5
  cuda-nsight-compute-11-5 cuda-nsight-systems-11-5 cuda-nvcc-11-5 cuda-nvdisasm-11-5 cuda-nvml-dev-11-5 cuda-nvprof-11-5 cuda-nvprune-11-5 cuda-nvrtc-11-5 cuda-nvrtc-dev-11-5 cuda-nvtx-11-5
  cuda-nvvp-11-5 cuda-runtime-11-5 cuda-samples-11-5 cuda-sanitizer-11-5 cuda-toolkit-11-5 cuda-toolkit-11-5-config-common cuda-toolkit-11-config-common cuda-toolkit-config-common cuda-tools-11-5
  cuda-visual-tools-11-5 dkms gds-tools-11-5 libcublas-11-5 libcublas-dev-11-5 libcufft-11-5 libcufft-dev-11-5 libcufile-11-5 libcufile-dev-11-5 libcurand-11-5 libcurand-dev-11-5 libcusolver-11-5
  libcusolver-dev-11-5 libcusparse-11-5 libcusparse-dev-11-5 libnpp-11-5 libnpp-dev-11-5 libnvidia-cfg1-495 libnvidia-common-495 libnvidia-compute-495 libnvidia-decode-495 libnvidia-encode-495
  libnvidia-extra-495 libnvidia-fbc1-495 libnvidia-gl-495 libnvjpeg-11-5 libnvjpeg-dev-11-5 libopengl0 liburcu6 libxnvctrl0 nsight-compute-2021.3.0 nsight-systems-2021.3.3 nvidia-compute-utils-495
  nvidia-dkms-495 nvidia-driver-495 nvidia-kernel-common-495 nvidia-kernel-source-495 nvidia-modprobe nvidia-prime nvidia-settings nvidia-utils-495 screen-resolution-extra xserver-xorg-video-nvidia-495
Suggested packages:
  menu
Recommended packages:
  libnvidia-compute-495:i386 libnvidia-decode-495:i386 libnvidia-encode-495:i386 libnvidia-fbc1-495:i386 libnvidia-gl-495:i386
The following NEW packages will be installed:
  cuda cuda-11-5 cuda-cccl-11-5 cuda-command-line-tools-11-5 cuda-compiler-11-5 cuda-cudart-11-5 cuda-cudart-dev-11-5 cuda-cuobjdump-11-5 cuda-cupti-11-5 cuda-cupti-dev-11-5 cuda-cuxxfilt-11-5
  cuda-demo-suite-11-5 cuda-documentation-11-5 cuda-driver-dev-11-5 cuda-drivers cuda-drivers-495 cuda-gdb-11-5 cuda-libraries-11-5 cuda-libraries-dev-11-5 cuda-memcheck-11-5 cuda-nsight-11-5
  cuda-nsight-compute-11-5 cuda-nsight-systems-11-5 cuda-nvcc-11-5 cuda-nvdisasm-11-5 cuda-nvml-dev-11-5 cuda-nvprof-11-5 cuda-nvprune-11-5 cuda-nvrtc-11-5 cuda-nvrtc-dev-11-5 cuda-nvtx-11-5
  cuda-nvvp-11-5 cuda-runtime-11-5 cuda-samples-11-5 cuda-sanitizer-11-5 cuda-toolkit-11-5 cuda-toolkit-11-5-config-common cuda-toolkit-11-config-common cuda-toolkit-config-common cuda-tools-11-5
  cuda-visual-tools-11-5 dkms gds-tools-11-5 libcublas-11-5 libcublas-dev-11-5 libcufft-11-5 libcufft-dev-11-5 libcufile-11-5 libcufile-dev-11-5 libcurand-11-5 libcurand-dev-11-5 libcusolver-11-5
  libcusolver-dev-11-5 libcusparse-11-5 libcusparse-dev-11-5 libnpp-11-5 libnpp-dev-11-5 libnvidia-cfg1-495 libnvidia-common-495 libnvidia-compute-495 libnvidia-decode-495 libnvidia-encode-495
  libnvidia-extra-495 libnvidia-fbc1-495 libnvidia-gl-495 libnvjpeg-11-5 libnvjpeg-dev-11-5 libopengl0 liburcu6 libxnvctrl0 nsight-compute-2021.3.0 nsight-systems-2021.3.3 nvidia-compute-utils-495
  nvidia-dkms-495 nvidia-driver-495 nvidia-kernel-common-495 nvidia-kernel-source-495 nvidia-modprobe nvidia-prime nvidia-settings nvidia-utils-495 screen-resolution-extra xserver-xorg-video-nvidia-495
0 upgraded, 83 newly installed, 0 to remove and 0 not upgraded.
Need to get 2,532 MB of archives.
After this operation, 5,709 MB of additional disk space will be used.
Do you want to continue? [Y/n] Y

RuntimeError: CUDA out of memory

(CUDA error 해결하기) RuntimeError: CUDA error: out of memory / For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

첫 번째 에러

RuntimeError: CUDA out of memory. Tried to allocate 94.00 MiB (GPU 1; 23.65 GiB total capacity; 0 bytes already allocated; 30.31 MiB free; 0 bytes reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

PYTORCH_CUDA_ALLOC_CONF 환경변수를 조작 할 예정이라면 PyTorch#Memory management 항목 참조.

두 번째 에러

RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

단순히 GPU 메모리 부족 현상 이다. 메모리 여유분을 올려주자.

Failed to initialize NVML: Driver/library version mismatch

(Ubuntu)Failed to initialize NVML: Driver/library version mismatch 해결하기

nvidia-smi를 실행했을 때 다음과 같은 에러가 출력될 수 있다.

Failed to initialize NVML: Driver/library version mismatch

잘 되다가 아무것도 안하고 어느 날 접속했을 때 갑자기 저런 문구가 적혀있다면 자동 업데이트를 의심해 보자.

cudaErrorInitializationError: initialization error

Python 에서 multiprocessing 사용시 해당 에러가 발생했다면 다음 코드를 추가한 후 실행해 보자.

import multiprocessing

multiprocessing.set_start_method("spawn", force=True)