-
Failed to initialize NVML: Unknown Error에러 관리 2024. 10. 7. 22:18
[이슈]
host machine에서는 GPU가 잘 detect 되는데, docker container에서는 Failed to initialize NVML: Unknown Error, Failed to detect NVIDIA driver version.이런 에러가 났다.
[host machine]
- NVIDIA GeForce RTX 4090 x4
- nvidia-driver-560 (recommanded) installed
- cuda 12.6
[docker container]
- nvcr.io/nvidia/pytorch:22.11-py3
- pytorch 1.13
- cuda 11.8
[해결 방법]
1. docker info를 확인한다. Cgroup Driver에 systemd라고 되어있으면 그게 문제다.
docker info Client: Docker Engine - Community Version: 27.3.1 Context: default Debug Mode: false Plugins: buildx: Docker Buildx (Docker Inc.) Version: v0.17.1 Path: /usr/libexec/docker/cli-plugins/docker-buildx compose: Docker Compose (Docker Inc.) Version: v2.29.7 Path: /usr/libexec/docker/cli-plugins/docker-compose Server: Containers: 1 Running: 0 Paused: 0 Stopped: 1 Images: 2 Server Version: 27.3.1 Storage Driver: overlay2 Backing Filesystem: extfs Supports d_type: true Using metacopy: false Native Overlay Diff: true userxattr: false Logging Driver: json-file Cgroup Driver: systemd Cgroup Version: 2 Plugins: Volume: local Network: bridge host ipvlan macvlan null overlay Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog Swarm: inactive Runtimes: io.containerd.runc.v2 nvidia runc Default Runtime: nvidia Init Binary: docker-init containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c runc version: v1.1.14-0-g2c9f560 init version: de40ad0 Security Options: apparmor seccomp Profile: builtin cgroupns Kernel Version: 6.8.0-45-generic Operating System: Ubuntu 24.04.1 LTS OSType: linux Architecture: x86_64 CPUs: 64 Total Memory: 251.6GiB Name: julia ID: d2e7a1f5-d280-4e3b-8870-e7da3a081c0c Docker Root Dir: /var/lib/docker Debug Mode: false Experimental: false Insecure Registries: 127.0.0.0/8 Live Restore Enabled: false
2. systemd를 cgroupfs로 바꾸기 위해 두 파일을 수정한다.
sudo vi /etc/docker/daemon.json { "default-runtime" : "nvidia", "exec-opts" :[ "native.cgroupdriver=cgroupfs"], "runtimes": { "nvidia": { "args": [], "path": "nvidia-container-runtime" } } }
sudo vi /etc/nvidia-container-runtime/config.toml no-cgroups = false # true
3. 다시 docker를 시작한다.
sudo systemctl restart docker
4. 이제 docker info를 다시 해보면 Cgroup Driver에 cgroupfs라고 뜨는 걸 볼 수 있다. 그리고 docker container를 만들어도 위의 에러가 나오지 않고 GPU device가 잘 detect 된다.
'에러 관리' 카테고리의 다른 글
vscode에서 commandline으로 run 할때 DEBUG 하기 (0) 2023.06.22