Failed to initialize NVML: Unknown Error in Docker after Few hours

Asked 11/7, 2022 at 1:28 Answered 12/3, 2024 at 2:19

I am having interesting and weird issue.

When I start docker container with gpu it works fine and I see all the gpus in docker. However, few hours or few days later, I can't use gpus in docker.

When I do nvidia-smi in docker machine. I see this msg

"Failed to initialize NVML: Unknown Error"

However, in the host machine, I see all the gpus with nvidia-smi. Also, when I restart the docker machine. It totally works fine and showing all gpus.

My Inference Docker machine should be turned on all the time and do the inference depends on server requests. Does any one have same issue or the solution for this problem?

Transpacific answered 11/7, 2022 at 1:28 Comment(1)

Adding --privileged to command line options of docker run helped me. – Gadoid 16/4, 2024 at 8:24

I selected the Method 1 of my findings below, which is:

sudo vim /etc/nvidia-container-runtime/config.toml, then changed no-cgroups = false, save
Restart docker daemon: sudo systemctl restart docker, then you can test by running sudo docker run --rm --runtime=nvidia --gpus all ubuntu nvidia-smi

Based on

https://bobcares.com/blog/docker-failed-to-initialize-nvml-unknown-error/ 2. https://bbs.archlinux.org/viewtopic.php?id=266915

Panto answered 10/3, 2024 at 22:22 Comment(3)

I got into this state by configuring rootless version of nvidia container toolkit, when I should do a normal one. After I changed config to normal one it would fail. This solved the problem. – Chincapin 3/5, 2024 at 6:39

thanks. confirming method one no-cgroups = false working, service still have gpu access after 42hrs! to those looking to setup services with gpu acess in swarm, check out: gist.github.com/coltonbh/374c415517dbeb4a6aa92f462b9eb287 – Roundshouldered 27/6, 2024 at 15:49

update: welp... am still getting the "Failed to initialize NVML" error. resorted to adding nvidia-smi as part of healthcheck. – Roundshouldered 26/7, 2024 at 16:34

There is a workaround that I tried and found it work. Please check this link in case you need full detail: https://github.com/NVIDIA/nvidia-docker/issues/1730

I summarize the cause of the problem and elaborate on a solution here for your convenience.

Cause:
The host performs daemon-reload (or a similar activity). If the container uses systemd to manage cgroups, daemon-reload "triggers reloading any Unit files that have references to NVIDIA GPUs." Then, your container loses access the reloaded GPU references.

How to check if your problem is caused by the issue:
When your container still has GPU access, open a "host" terminal and run

sudo systemctl daemon-reload

Then, go back to your container. If nvidia-smi in the container has the problem right away, you may continue to use the workarounds.

Workarounds:
Although I saw in one discussion that NVIDIA planned to release a formal fix in mid Jun, as of July 8, 2023, I did not see it yet. So, this should be still useful for you, especially when you just can't update your container stack.

The easiest way is to disable cgroups in your containers through docker daemon.json. If disabling cgroups does not hurt you, here is the steps. All is done in the host system.

sudo nano /etc/docker/daemon.json

Then, within the file, add this parameter setting.

"exec-opts": ["native.cgroupdriver=cgroupfs"]

Do not forget to add a comma before this parameter setting. It is a well-known JSON syntax, but I think some may not be familiar with it. This is an example edited file from my machine.

{  
   "runtimes": {  
       "nvidia": {  
           "args": [],  
           "path": "nvidia-container-runtime"  
       }  
   },  
   "exec-opts": ["native.cgroupdriver=cgroupfs"]  
}

As for the last step, restart the docker service in the host.

sudo service docker restart

Note: if your container runs its own NVIDIA driver, the above steps will not work, but the reference link has more detail for dealing with it. I elaborate only on a simple solution that I expect many people will find it useful.

V2 answered 9/7, 2023 at 10:17 Comment(0)

I had the same Error. I tried the health check of docker as a temporary solution. When nvidia-smi failed, the container will be marked unhealth, and restart by willfarrell/autoheal.

Docker-compose Version:

services:
  gpu_container:
    ...
    healthcheck:
      test: ["CMD-SHELL", "test -s `which nvidia-smi` && nvidia-smi || exit 1"]
      start_period: 1s
      interval: 20s
      timeout: 5s
      retries: 2
    labels:
      - autoheal=true
      - autoheal.stop.timeout=1
    restart: always
  autoheal:
    image: willfarrell/autoheal
    environment:
      - AUTOHEAL_CONTAINER_LABEL=all
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    restart: always

Dockerfile Version:

HEALTHCHECK \
    --label autoheal=true \
    --label autoheal.stop.timeout=1 \
    --start-period=60s \
    --interval=20s \
    --timeout=10s \  
    --retries=2 \
    CMD nvidia-smi || exit 1

with autoheal daemon:

docker run -d \
    --name autoheal \
    --restart=always \
    -e AUTOHEAL_CONTAINER_LABEL=all \
    -v /var/run/docker.sock:/var/run/docker.sock \
    willfarrell/autoheal

Goose answered 13/9, 2022 at 14:5 Comment(1)

Disabling cgroups as proposed in other workarounds didn't work for me. That did the trick. – Unbounded 15/12, 2023 at 7:27

I had the same weird issue. According to your description, it's most likely relevant to this issue on nvidia-docker official repo:

https://github.com/NVIDIA/nvidia-docker/issues/1618

I plan to try the solution mentioned in related thread which suggest to upgrade the kernel cgroup version on host machine from v1 to v2.

ps: We have verified this solution on our production environment and it really works! But unfortunately, this solution need at least linux kernel 4.5. If it is not possible to upgrade kernel, the method mentioned by sih4sing5hog5 could also be a workaround solution.

Australia answered 13/10, 2022 at 4:3 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Guillen 21/10, 2022 at 1:32

I faced the same error without any changes to my container, just after starting it anew. Simply restarting the container again solved the problem.

Moral: before going deeper, try the simplest solution first.

Erechtheum answered 3/1, 2024 at 8:10 Comment(0)

Slightly different, but for other people that might stumble upon this.

For me the GPUs were not available already after start of the docker container with nvidia-docker, but only showed Failed to initialize NVML: Unknown Error on nivida-smi.

After some hours of looking for a solution I stumbled upon the similar error Failed to initialise NVML: Driver/library version mismatch. And one suggestion was to simply reboot the host machine. I did that and it now works.

This happened after I upgraded both Ubuntu 20->22 and Docker 19->20 along with the nvidia drivers 525.116.04.

Ichthyic answered 12/7, 2023 at 12:15 Comment(0)

-1

i have a try that when docker can not run nvidia-smi, i restart the container and this help

Sexology answered 12/3, 2024 at 2:19 Comment(0)

-3

I had the same issue, I just ran screen watch -n 1 nvidia-smi in the container and now it works continuously.

Anglophobia answered 21/8, 2022 at 17:58 Comment(0)

Recommended topics

Hot tags