How to unload an NVIDIA kernel module 'nvidia' for new driver installation?

Asked 1/5, 2020 at 13:16 Answered 23/2, 2023 at 17:37

Solved tensorflow ubuntu gpu ubuntu-18.04 nvidia

I needed to upgrade my nvidia driver so that I have tried running NVIDIA-LInux-x86_64.run file

However, I was seeing following message

ERROR: An NVIDIA kernel module 'nvidia' appears to already be loaded in your kernel.  This may be because it is in use (for example, by an X server, a CUDA program, or the NVIDIA Persistence Daemon), but this may also happen if your kernel was configured without support for module unloading.  Please be sure to exit any programs that may be using the GPU(s) before attempting to upgrade your driver.  If no GPU-based programs are running, you know that your kernel supports module unloading, and you still receive this message, then an error may have occured that has corrupted an NVIDIA kernel module's usage count, for which the simplest remedy is to reboot your computer.

I have already unloaded nvidia-drm and when I tried to unload nvidia

$ sudo modprobe -r nvidia
modprobe: FATAL: Module nvidia is in use.

Can anyone guide me on installing this new driver without any issue?

Thanks

Humanoid answered 1/5, 2020 at 13:16 Comment(1)

Hi Brandon Lee. Your question sounds like a system administration question rather than a programming question. You might have more luck asking at superuser.com or, if this is truly an Ubuntu-specific question, at askubuntu.com. – Pix 1/5, 2020 at 20:5

I just removed existing driver and reinstalled

Humanoid answered 7/5, 2020 at 23:14 Comment(1)

Seems to be the best solution. worked for me too, while upgrading nvidia driver. – Experimental 1/2 at 20:35

Use lsof /dev/nvidia* to find the processes that are using the old driver. In my case it was "nvidia-persistenced". Just kill the process by pid and retry the installer NVIDIA-***.run

# lsof /dev/nvidia*
COMMAND    PID                USER   FD   TYPE  DEVICE SIZE/OFF NODE NAME
nvidia-pe 1334 nvidia-persistenced    2u   CHR 195,255      0t0  420 /dev/nvidiactl
nvidia-pe 1334 nvidia-persistenced    3u   CHR   195,0      0t0  421 /dev/nvidia0
nvidia-pe 1334 nvidia-persistenced    5u   CHR   195,0      0t0  421 /dev/nvidia0
nvidia-pe 1334 nvidia-persistenced    6u   CHR   195,0      0t0  421 /dev/nvidia0
nvidia-pe 1334 nvidia-persistenced    7u   CHR   195,0      0t0  421 /dev/nvidia0

Annihilator answered 23/9, 2020 at 17:49 Comment(1)

Might need sudo preceding lsof – Cranwell 12/9 at 3:31

I just removed existing driver and reinstalled

Humanoid answered 7/5, 2020 at 23:14 Comment(1)

Seems to be the best solution. worked for me too, while upgrading nvidia driver. – Experimental 1/2 at 20:35

I wrote python script for this:

unload_all_nvidia_modules.py

from subprocess import run, getoutput
from shlex import split
import re

def get_all_nvidia_modules():
    all_modules = getoutput("lsmod").splitlines()
    modules_to_unload = set()
    for m in all_modules:
        m = m.strip()
        m_splitted = re.split("\s+", m)

        module_name = m_splitted[0]
        if len(m_splitted) == 4:
            deps = m_splitted[-1].split(",")
        else:
            deps = []

        if "nvidia" in module_name or any("nvidia" in d for d in deps):
            modules_to_unload.add(module_name)
            for d in deps:
                modules_to_unload.add(d)

    return modules_to_unload

def get_usage_pids(pattern):
    all_files = getoutput("lsof").splitlines()
    pids = set()
    commands = []
    for f in all_files:
        if pattern in f:
            f.strip()
            pid = re.split("\s+", f)[1]
            pids.add(pid)
            commands.append(f)

    return pids



def unload_all_nvidia_modules():
    cnt = 100
    while cnt > 0:
        cnt -= 1
        modules = get_all_nvidia_modules()

        if len(modules) == 0:
            break

        for m in modules:
            pids = get_usage_pids(m)
            for pid in pids:
                run(split(f"killall -9 {pid}"))

            run(split(f"rmmod {m}"))

if __name__ == "__main__":
    get_all_nvidia_modules()
    unload_all_nvidia_modules()

usage:
sudo python3 unload_all_nvidia_modules.py

WARNING:

Use this script on your own risk. Save your work and all open documents, because this script will kill all processes that use nvidia driver (GUI programs, for example).

Domineering answered 23/2, 2023 at 17:37 Comment(2)

Killing a bunch of processes without giving the user a chance to review and confirm is a recipe for lost work, in case one of the processes is the user's X session or anything equivalent. Similarly, before sending SIGKILL, one should send SIGTERM, wait a few seconds (for example, the systemd default of 90 seconds), and only then send SIGKILL to any processes that failed to shut down gracefully. Sending SIGKILL immediately increases the chance of lost work or data corruption for any applications that aren't careful enough about their data. – Imbibe 14/6, 2023 at 21:41

@SimonRuggier Thanks for comment. I'm aware of what you wrote. I created this script for my use cases only and shared it here in cases someone will find it useful. I use it on headless machines so no X processes. If you want to send SIGTEM before and wait - go for it – Domineering 15/6, 2023 at 9:29

WARNING:

Recommended topics

Hot tags