There is a workaround that I tried and found it work.
Please check this link in case you need full detail:
https://github.com/NVIDIA/nvidia-docker/issues/1730
I summarize the cause of the problem and elaborate on a solution here for your convenience.
Cause:
The host performs daemon-reload (or a similar activity). If the container uses systemd to manage cgroups, daemon-reload "triggers reloading any Unit files that have references to NVIDIA GPUs." Then, your container loses access the reloaded GPU references.
How to check if your problem is caused by the issue:
When your container still has GPU access, open a "host" terminal and run
sudo systemctl daemon-reload
Then, go back to your container. If nvidia-smi in the container has the problem right away, you may continue to use the workarounds.
Workarounds:
Although I saw in one discussion that NVIDIA planned to release a formal fix in mid Jun, as of July 8, 2023, I did not see it yet. So, this should be still useful for you, especially when you just can't update your container stack.
The easiest way is to disable cgroups in your containers through docker daemon.json. If disabling cgroups does not hurt you, here is the steps. All is done in the host system.
sudo nano /etc/docker/daemon.json
Then, within the file, add this parameter setting.
"exec-opts": ["native.cgroupdriver=cgroupfs"]
Do not forget to add a comma before this parameter setting. It is a well-known JSON syntax, but I think some may not be familiar with it. This is an example edited file from my machine.
{
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"exec-opts": ["native.cgroupdriver=cgroupfs"]
}
As for the last step, restart the docker service in the host.
sudo service docker restart
Note: if your container runs its own NVIDIA driver, the above steps will not work, but the reference link has more detail for dealing with it. I elaborate only on a simple solution that I expect many people will find it useful.
--privileged
to command line options ofdocker run
helped me. – Gadoid