GPU memory is empty, but CUDA out of memory error occurs

During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And even after terminated the training process, the GPUS still give out of memory error.

As above, currently, all of my GPU devices are empty. And there is no other python process running except these two.

import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working

torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True

# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory

However, GPU:0,1 give out of memory errors. If I reboot the computer(ubuntu 18.04.3), it returns to normal, but if I train the code again, the same problem occurs.

How can I debug this problem, or resolve it without rebooting?

ubuntu 18.04.3
RTX 2080ti
CUDA version 10.2
nvidia driver version: 460.27.04
cudnn 7.6.4.38
Python 3.8.4
pytorch 1.7.0, 1.9.0, 1.9.0+cu111
cpu: (AMD Ryzen Threadripper 2950X 16-Core Processor)x32
memory: 125G
power: 2000W
dmesg result. (No GPU has fallen off the bus error)

dmesg | grep -i -e nvidia -e nvrm
[    5.946174] nvidia: loading out-of-tree module taints kernel.
[    5.946181] nvidia: module license 'NVIDIA' taints kernel.
[    5.956595] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.968280] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    5.970485] nvidia 0000:09:00.0: enabling device (0000 -> 0003)
[    5.970571] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.015145] nvidia 0000:0a:00.0: enabling device (0000 -> 0003)
[    6.015394] nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.064993] nvidia 0000:42:00.0: enabling device (0000 -> 0003)
[    6.065072] nvidia 0000:42:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.115778] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    6.164680] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.27.04  Fri Dec 11 23:35:05 UTC 2020
[    6.174137] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  460.27.04  Fri Dec 11 23:24:19 UTC 2020
[    6.176472] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[    6.176567] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 0
[    6.176635] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[    6.176636] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1
[    6.176709] [drm] [nvidia-drm] [GPU ID 0x00004200] Loading driver
[    6.176710] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:42:00.0 on minor 2
[    6.176760] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[    6.176761] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 3
[    6.189768] nvidia-uvm: Loaded the UVM driver, major device number 511.
[    6.744582] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input12
[    6.744664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input15
[    6.744755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input17
[    6.744852] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input19
[    6.744952] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input11
[    6.745301] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input16
[    6.745739] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input18
[    6.746280] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input20
[    7.117377] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input9
[    7.117453] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input10
[    7.117505] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input13
[    7.117559] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input14
[    7.117591] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input21
[    7.117650] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input22
[    7.117683] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input23
[    7.117720] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input24
[    9.462521] caller os_map_kernel_space.part.8+0x74/0x90 [nvidia] mapping multiple BARs

numba and tensorflow have same problem, so it seems like it is not because of pytorch.

>>> from numba import cuda
>>> device = cuda.get_current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/api.py", line 460, in get_current_device
    return current_context().device
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
    return self._activate_context_for(0)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
    newctx = gpu.get_primary_context()
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 542, in get_primary_context
    driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 302, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 342, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY

update

It seems like it is not happening after rebooting and upgrading pytorch version to 1.9.1+cu111.

update 2

It is occurring again, but I cannot do the suggested commands in the comment section because I don't have root access.

update

update 2

Recommended topics

Hot tags