GPU RAM occupied but no PIDs
Asked Answered
R

3

13

The nvidia-smi shows following indicating 3.77GB utilized on GPU0 but no processes are listed for GPU0:

(base) ~/.../fast-autoaugment$ nvidia-smi
Fri Dec 20 13:48:12 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50       Driver Version: 430.50       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  TITAN Xp            Off  | 00000000:03:00.0 Off |                  N/A |
| 23%   34C    P8     9W / 250W |   3771MiB / 12196MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN Xp            Off  | 00000000:84:00.0  On |                  N/A |
| 38%   62C    P8    24W / 250W |   2295MiB / 12188MiB |      8%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    1      1910      G   /usr/lib/xorg/Xorg                           105MiB |
|    1      2027      G   /usr/bin/gnome-shell                          51MiB |
|    1      3086      G   /usr/lib/xorg/Xorg                          1270MiB |
|    1      3237      G   /usr/bin/gnome-shell                         412MiB |
|    1     30593      G   /proc/self/exe                               286MiB |
|    1     31849      G   ...quest-channel-token=4371017438329004833   164MiB |
+-----------------------------------------------------------------------------+

Similarly nvtop shows same GPU RAM utilization but the processes it lists shows TYPE=Compute and if you try to kill PIDs it shows then you get error:

(base) ~/.../fast-autoaugment$ kill 27761
bash: kill: (27761) - No such process

How to reclaim GPU RAM occupied by apparently ghost processes?

Retroversion answered 20/12, 2019 at 21:59 Comment(0)
R
26

Use following command to get insight into ghost processes occupying GPU RAM:

sudo fuser -v /dev/nvidia*

In my case, output is:

(base) ~/.../fast-autoaugment$ sudo fuser -v /dev/nvidia*
                     USER        PID ACCESS COMMAND
/dev/nvidia0:        shitals     517 F.... nvtop
                     root       1910 F...m Xorg
                     gdm        2027 F.... gnome-shell
                     root       3086 F...m Xorg
                     shitals    3237 F.... gnome-shell
                     shitals   27808 F...m python
                     shitals   27809 F...m python
                     shitals   27813 F...m python
                     shitals   27814 F...m python
                     shitals   28091 F...m python
                     shitals   28092 F...m python
                     shitals   28096 F...m python

This shows processes that nvidia-smi as well as nvtop fails to shows. After I killed all of the python processes, the GPU RAM was freed up.

Another thing to try is to reset GPU using the command:

sudo nvidia-smi --gpu-reset -i 0
Retroversion answered 20/12, 2019 at 21:59 Comment(0)
M
1

Here is a method to do this programmatically in python if all the processes you want to kill have a common substring in their name (USE CAUTION, add sudo if desired).

import subprocess

process_name_substring = 'python'

result = subprocess.run(['fuser', '/dev/nvidia0', '-v'], stdout = subprocess.PIPE)

process_ids = [int(i) for i in str(result.stdout).split(' ') if i.isdigit()]

for process_id in process_ids:
    pid_info = subprocess.run(['ps', '-p', str(process_id)], stdout = subprocess.PIPE)
    
    if process_name_substring in str(pid_info.stdout):
        kill_output = subprocess.run(['kill', '-9', str(process_id)], stdout=subprocess.PIPE)

Mark answered 1/12, 2023 at 4:2 Comment(0)
I
1

As Shital Shah mentioned, you can use the fuser command to inspect which zombie processes are running on your GPUs:

sudo fuser -v /dev/nvidia*

In case you do not want to reset your GPU because you've got other processes running on it, I've used Chaitanya's script as a basis and added a bit more functionality on top of it to allow killing only processes by specific users on specific GPU cores.

Usage:

python3 script.py gpu_number process_name user_name

Code:

import argparse
import subprocess


def get_num_gpus():
    result = subprocess.run(['nvidia-smi', '-L'], stdout=subprocess.PIPE)
    gpu_info = result.stdout.decode('utf-8').strip()
    num_gpus = len(gpu_info.split('\n'))  # each line corresponds to one GPU
    return num_gpus


def main(args):
    process_name_to_kill = args.process_name
    user_name = args.user_name
    gpu_number = args.gpu_number

    num_gpus = get_num_gpus()
    print(f'Number of GPUs: {num_gpus}')
    if gpu_number >= num_gpus or gpu_number < 0:
        print(f'GPU #{gpu_number} does not exist')
        return

    print(f'Scanning GPU #{gpu_number} processes')
    result = subprocess.run(['fuser', '/dev/nvidia' + str(gpu_number), '-v'], stdout=subprocess.PIPE)
    result_str = result.stdout.decode('utf-8').strip()

    process_ids = [int(i) for i in result_str.split(' ') if i.isdigit()]
    for process_id in process_ids:
        pid_info = subprocess.run(['ps', '-p', str(process_id), '-o', 'comm=', '-o', 'user='], stdout=subprocess.PIPE)
        process_info = pid_info.stdout.decode('utf-8').strip().split()
        process_name = process_info[0]
        owner = process_info[1]
        print(f"PID: {process_id}, Process name: {process_name}, User: {owner}")
        if process_name_to_kill in process_name and owner == user_name:
           kill_output = subprocess.run(['kill', '-9', str(process_id)], stdout=subprocess.PIPE)
           print(f'killed process with id {process_id}') # indentation


if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('gpu_number', type=int, help='GPU number to scan')
    parser.add_argument('process_name', help='Name of the process to kill')
    parser.add_argument('user_name', help='Name of the user who owns the process')
    args = parser.parse_args()
    main(args)
Ironsmith answered 21/12, 2023 at 23:10 Comment(1)
Thank you, By this way I could kill all zombie process. Hope you have a peaceful day!Ulcerate

© 2022 - 2024 — McMap. All rights reserved.