CUDA Out of memory when there is plenty available
Asked Answered
F

2

15

I'm having trouble with using Pytorch and CUDA. Sometimes it works fine, other times it tells me RuntimeError: CUDA out of memory. However, I am confused because checking nvidia-smi shows that the used memory of my card is 563MiB / 6144 MiB, which should in theory leave over 5GiB available. output of nvidia-smi

However, upon running my program, I am greeted with the message: RuntimeError: CUDA out of memory. Tried to allocate 578.00 MiB (GPU 0; 5.81 GiB total capacity; 670.69 MiB already allocated; 624.31 MiB free; 898.00 MiB reserved in total by PyTorch)

It looks like Pytorch is reserving 1GiB, knows that ~700MiB are allocated, and is trying to assign ~600MiB to the program—but claims that the GPU is out of memory. How can this be? There should be plenty of GPU memory left given these numbers.

Fishnet answered 30/5, 2022 at 8:19 Comment(4)
Have you looked at your console when running your training? You can do so with nvidia-smi -l 1Barranca
@Barranca just tried that, and it does end up trying to use my whole GPUs memory before the program terminates. I'm guessing that the estimate given by CUDA is an underestimate? Seems like the 'tried to allocate' message is around 10x lower than it should be—after ensuring that the GPUs memory is completely free, the program takes over 5.8GiB. No clue why it's such a large underestimate though.Fishnet
Coming back to this later: this was possibly because a conflicting CUDA install causing double the memory usage? I had one installed from the NVIDIA website and one also from a system76 distribution, removing the system76 one seemed to fix the problemFishnet
Running into this, just trying to get clipit/pixray work. There's 1GiB of memory free but cuda does not assign it. Seems to be a bug in cuda, but I have the newest driver on my system.Syriac
I
2

You need empty torch cache after some method(before error)

torch.cuda.empty_cache()
Indigotin answered 30/5, 2022 at 10:56 Comment(2)
I am emptying the cache and collecting garbage after every script, it hasn't fixed it. The only fix that I have found was to make sure that my GPU was not being used whatsoever, by making my CPU manage the monitor while the GPU only works on the neural net.Fishnet
I had this problem with some driver version, maybe you need replace drivers. My config - NVIDIA-SMI 495.29.05 Driver Version: 495.29.05 CUDA Version: 11.5. Also you can try to decrease batch sizeIndigotin
F
0

Possible answer: I received this error most often when running a program that uses both Tensorflow and PyTorch (which I have since stopped doing). It appears that the PyTorch OOM error message will take precedence over Tensorflow.

If for some reason you want to use both, I fixed my issues by limiting the TensorFlow memory with the following line:

tf.config.experimental.set_virtual_device_configuration(gpus[0], [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=TF_MEM_LIM)])

where TF_MEM_LIM is the integer value in megabytes of your desired limit.

Fishnet answered 14/4, 2023 at 18:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.