ABOUT ME

-

Today
-
Yesterday
-
Total
-
  • Failed to initialize NVML: Unknown Error
    에러 관리 2024. 10. 7. 22:18

     

    [이슈]

    host machine에서는 GPU가 잘 detect 되는데, docker container에서는 Failed to initialize NVML: Unknown Error,  Failed to detect NVIDIA driver version.이런 에러가 났다. 

     

    [host machine]

    - NVIDIA GeForce RTX 4090 x4

    - nvidia-driver-560 (recommanded) installed

    - cuda 12.6

     

    [docker container]

    - nvcr.io/nvidia/pytorch:22.11-py3

    - pytorch 1.13

    - cuda 11.8

     

    [해결 방법]

    1. docker info를 확인한다. Cgroup Driver에 systemd라고 되어있으면 그게 문제다. 

    docker info
    Client: Docker Engine - Community
     Version:    27.3.1
     Context:    default
     Debug Mode: false
     Plugins:
      buildx: Docker Buildx (Docker Inc.)
        Version:  v0.17.1
        Path:     /usr/libexec/docker/cli-plugins/docker-buildx
      compose: Docker Compose (Docker Inc.)
        Version:  v2.29.7
        Path:     /usr/libexec/docker/cli-plugins/docker-compose
    
    Server:
     Containers: 1
      Running: 0
      Paused: 0
      Stopped: 1
     Images: 2
     Server Version: 27.3.1
     Storage Driver: overlay2
      Backing Filesystem: extfs
      Supports d_type: true
      Using metacopy: false
      Native Overlay Diff: true
      userxattr: false
     Logging Driver: json-file
     Cgroup Driver: systemd
     Cgroup Version: 2
     Plugins:
      Volume: local
      Network: bridge host ipvlan macvlan null overlay
      Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
     Swarm: inactive
     Runtimes: io.containerd.runc.v2 nvidia runc
     Default Runtime: nvidia
     Init Binary: docker-init
     containerd version: 7f7fdf5fed64eb6a7caf99b3e12efcf9d60e311c
     runc version: v1.1.14-0-g2c9f560
     init version: de40ad0
     Security Options:
      apparmor
      seccomp
       Profile: builtin
      cgroupns
     Kernel Version: 6.8.0-45-generic
     Operating System: Ubuntu 24.04.1 LTS
     OSType: linux
     Architecture: x86_64
     CPUs: 64
     Total Memory: 251.6GiB
     Name: julia
     ID: d2e7a1f5-d280-4e3b-8870-e7da3a081c0c
     Docker Root Dir: /var/lib/docker
     Debug Mode: false
     Experimental: false
     Insecure Registries:
      127.0.0.0/8
     Live Restore Enabled: false

     

    2. systemd를 cgroupfs로 바꾸기 위해 두 파일을 수정한다. 

    sudo vi /etc/docker/daemon.json
    {
    "default-runtime" : "nvidia",
    "exec-opts" :[
            "native.cgroupdriver=cgroupfs"],
        "runtimes": {
            "nvidia": {
                "args": [],
                "path": "nvidia-container-runtime"
            }
        }
    }
     sudo vi /etc/nvidia-container-runtime/config.toml
     no-cgroups = false # true

     

    3. 다시 docker를 시작한다.

    sudo systemctl restart docker

     

    4. 이제 docker info를 다시 해보면 Cgroup Driver에 cgroupfs라고 뜨는 걸 볼 수 있다. 그리고 docker container를 만들어도 위의 에러가 나오지 않고 GPU device가 잘 detect 된다.

    '에러 관리' 카테고리의 다른 글

    vscode에서 commandline으로 run 할때 DEBUG 하기  (0) 2023.06.22
Designed by Tistory.