GPU memory is empty, but CUDA out of memory error occurs
Asked Answered
M

1

10

During training this code with ray tune(1 gpu for 1 trial), after few hours of training (about 20 trials) CUDA out of memory error occurred from GPU:0,1. And even after terminated the training process, the GPUS still give out of memory error.

nvidia-smi result

As above, currently, all of my GPU devices are empty. And there is no other python process running except these two.

import torch
torch.rand(1, 2).to('cuda:0') # cuda out of memory error
torch.rand(1, 2).to('cuda:1') # cuda out of memory error
torch.rand(1, 2).to('cuda:2') # working
torch.rand(1, 2).to('cuda:3') # working

torch.cuda.device_count() # 4
torch.cuda.memory_reserved() # 0
torch.cuda.is_available() # True
# error message of GPU 0, 1
RuntimeError: CUDA error: out of memory

However, GPU:0,1 give out of memory errors. If I reboot the computer(ubuntu 18.04.3), it returns to normal, but if I train the code again, the same problem occurs.

How can I debug this problem, or resolve it without rebooting?

  • ubuntu 18.04.3
  • RTX 2080ti
  • CUDA version 10.2
  • nvidia driver version: 460.27.04
  • cudnn 7.6.4.38
  • Python 3.8.4
  • pytorch 1.7.0, 1.9.0, 1.9.0+cu111
  • cpu: (AMD Ryzen Threadripper 2950X 16-Core Processor)x32
  • memory: 125G
  • power: 2000W
  • dmesg result. (No GPU has fallen off the bus error)
dmesg | grep -i -e nvidia -e nvrm
[    5.946174] nvidia: loading out-of-tree module taints kernel.
[    5.946181] nvidia: module license 'NVIDIA' taints kernel.
[    5.956595] nvidia: module verification failed: signature and/or required key missing - tainting kernel
[    5.968280] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    5.970485] nvidia 0000:09:00.0: enabling device (0000 -> 0003)
[    5.970571] nvidia 0000:09:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.015145] nvidia 0000:0a:00.0: enabling device (0000 -> 0003)
[    6.015394] nvidia 0000:0a:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.064993] nvidia 0000:42:00.0: enabling device (0000 -> 0003)
[    6.065072] nvidia 0000:42:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.115778] nvidia 0000:43:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[    6.164680] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  460.27.04  Fri Dec 11 23:35:05 UTC 2020
[    6.174137] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  460.27.04  Fri Dec 11 23:24:19 UTC 2020
[    6.176472] [drm] [nvidia-drm] [GPU ID 0x00000900] Loading driver
[    6.176567] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:09:00.0 on minor 0
[    6.176635] [drm] [nvidia-drm] [GPU ID 0x00000a00] Loading driver
[    6.176636] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:0a:00.0 on minor 1
[    6.176709] [drm] [nvidia-drm] [GPU ID 0x00004200] Loading driver
[    6.176710] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:42:00.0 on minor 2
[    6.176760] [drm] [nvidia-drm] [GPU ID 0x00004300] Loading driver
[    6.176761] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:43:00.0 on minor 3
[    6.189768] nvidia-uvm: Loaded the UVM driver, major device number 511.
[    6.744582] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input12
[    6.744664] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input15
[    6.744755] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input17
[    6.744852] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:03.1/0000:43:00.1/sound/card4/input19
[    6.744952] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input11
[    6.745301] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input16
[    6.745739] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input18
[    6.746280] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:40/0000:40:01.3/0000:42:00.1/sound/card3/input20
[    7.117377] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input9
[    7.117453] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input10
[    7.117505] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input13
[    7.117559] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.3/0000:09:00.1/sound/card0/input14
[    7.117591] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input21
[    7.117650] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input22
[    7.117683] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input23
[    7.117720] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:03.1/0000:0a:00.1/sound/card1/input24
[    9.462521] caller os_map_kernel_space.part.8+0x74/0x90 [nvidia] mapping multiple BARs
  • numba and tensorflow have same problem, so it seems like it is not because of pytorch.
>>> from numba import cuda
>>> device = cuda.get_current_device()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/api.py", line 460, in get_current_device
    return current_context().device
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 212, in get_context
    return _runtime.get_or_create_context(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 138, in get_or_create_context
    return self._get_or_create_context_uncached(devnum)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 153, in _get_or_create_context_uncached
    return self._activate_context_for(0)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/devices.py", line 169, in _activate_context_for
    newctx = gpu.get_primary_context()
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 542, in get_primary_context
    driver.cuDevicePrimaryCtxRetain(byref(hctx), self.id)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 302, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/home/user_name/.pyenv/versions/tensorflow/lib/python3.8/site-packages/numba/cuda/cudadrv/driver.py", line 342, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuDevicePrimaryCtxRetain results in CUDA_ERROR_OUT_OF_MEMORY

update

It seems like it is not happening after rebooting and upgrading pytorch version to 1.9.1+cu111.

update 2

It is occurring again, but I cannot do the suggested commands in the comment section because I don't have root access.

Murat answered 3/9, 2021 at 6:15 Comment(9)
In the failing state, if you search the system logs (e.g. dmesg), do you see any error messages that read in part GPU has fallen off the bus? What's the wattage of the power supply unit in this machine, what CPU does it use, and how much system memory does it have?Canute
CUDA reports the last encountered (and uncleared) error, so maybe you were doing something that did take up all memory, but didn't clear the error indication? Just speculating here.Cruzeiro
nvidia-smi shows the maximum CUDA version the installed driver can support. So no red flag in that point.Canute
A relatively common HW configuration issue for deep learning systems is that the power supply is underdimensioned for the number of GPUs installed. The system here has a nominal TDP power draw of around 1280W, which requires about 2000W nominal PSU to be resilient against power spikes (i.e. instantaneous power) from GPUs and CPU. A typical indication of insufficient power supply are "GPU has fallen off the bus" errors reported in system logs, however none seem to occur here, so my hypothesis does not seem applicable.Canute
@Canute I am checking PSU but having some problems, I guess it is 2000W. I will update the post if I figure it out. really thanks for the commentsMurat
If you have root access to the system, and you don't mind interrupting work on all 4 GPUs, you could try unloading and then reloading the GPU driver. First, make sure nvidia-smi reports "no running processes found." The specific command for this may vary depending on GPU driver, but try something like sudo rmmod nvidia-uvm nvidia-drm nvidia-modeset nvidia. After that, if you get errors of the form "rmmod: ERROR: Module nvidiaXYZ is not currently loaded", those are not an actual problem and can be ignored. After that, as an ordinary user, run nvidia-smi and then run your tests again.Priscilapriscilla
The other suggestion you've gotten here is to use the nvida-smi -r functionality.Priscilapriscilla
It could have been memory fragmentation and at some point in the flow an allocation request that requires coalesced blocks of memory beyond what is available, even if the total number of free memory is higher.Plage
It looks like your Nvidia driver is ancient. But that may not be the problem. Could you debug your root cause: why do you get out of memory in the first place? This looks like the XY problem.Permanent
I
1

I believe this could be due to memory fragmentation that occurs in certain cases in CUDA when allocating and deallocation of memory.

Try torch.cuda.empty_cache() after model training or set PYTORCH_NO_CUDA_MEMORY_CACHING=1 in your environment to disable caching, it may help reduce fragmentation of GPU memory in certain cases.

For debugging memory, you can follow the documentation below https://pytorch.org/docs/stable/notes/cuda.html#cuda-memory-management

Inexpugnable answered 5/12, 2023 at 3:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.