RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED using pytorch
Asked Answered
L

10

32

I am trying to run a simple pytorch sample code. It's works fine using CPU. But when using GPU, i get this error message:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 263, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/conv.py", line 260, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED

The code i am trying to run is the following:

import torch
from torch import nn
m = nn.Conv1d(16, 33, 3, stride=2)
m=m.to('cuda')
input = torch.randn(20, 16, 50)
input=input.to('cuda')
output = m(input)

I am running this code in a NVIDIA docker with CUDA version 10.2 and my GPU is a RTX 2070

Libertylibia answered 11/3, 2021 at 18:50 Comment(4)
One hint which is not related to your problem. Please do not use python keywords as a variable because this can cause some very ugly and difficult problems.Luigiluigino
import torch.cuda / torch.cuda.is_available() ?Roberson
I have exactly the same problem on CUDA 10.2. Did you solve it?Danais
@GuojunZhang I solved it by using the pytorch container for nvidia docker.Libertylibia
T
27

In my case it actually had nothing do with the PyTorch/CUDA/cuDNN version. PyTorch initializes cuDNN lazily whenever a convolution is executed for the first time. However, in my case there was not enough GPU memory left to initialize cuDNN because PyTorch itself already held the entire memory in its internal cache. One can release the cache manually with "torch.cuda.empty_cache()" right before the first convolution that is executed. A cleaner solution is to force cuDNN initialization at the beginning by doing a mock convolution:

def force_cudnn_initialization():
    s = 32
    dev = torch.device('cuda')
    torch.nn.functional.conv2d(torch.zeros(s, s, s, s, device=dev), torch.zeros(s, s, s, s, device=dev))

Calling the above function at the very beginning of the program solved the problem for me.

Twitt answered 2/11, 2021 at 9:44 Comment(0)
C
21

There is some discussion regarding this here. I had the same issue but using cuda 11.1 resolved it for me.

This is the exact pip command

pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
Counterchange answered 21/3, 2021 at 3:44 Comment(0)
T
7

I am also using Cuda 10.2. I had the exact same error when upgrading torch and torchvision to the latest version (torch-1.8.0 and torchvision-0.9.0). Which version are you using?

I guess this is not the best solution but by downgrading to torch-1.7.1 and torchvision-0.8.2 it works just fine.

Thorfinn answered 16/3, 2021 at 9:9 Comment(0)
A
2

I had the same issue when I was training yolov7 with a chess dataset. By reducing batch size from 8 to 4, the issue was solved.

Arand answered 26/10, 2022 at 4:10 Comment(2)
Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center.Glabrous
As noted, some additional information would be helpful.Riff
T
1

In my cases this error occurred when trying to estimate loss. I used a mixed bce-dice loss. It turned out that my output was linear instead of sigmoid. I then used the sigmoid predictions as of bellow and worked fine.

output = torch.nn.Sigmoid()(output)
loss = criterion1(output, target)
Tog answered 3/12, 2021 at 8:34 Comment(0)
F
1

In my case, I had an array indexing operation but the index was out of bounds. CUDA did not tell me that. I was using inference on a neural network. So I moved to CPU instead of the GPU. The logs were much more informative after that. For debugging if you see this error, switch to CPU first and you will know what to do.

Federalist answered 2/12, 2022 at 17:36 Comment(0)
H
1

In my problem i used to kill exisiting process in gpu.Use nvidia-smi to check what are the process are running.Use killall -9 python3(what process you want) to kill process.After freeup space then run the process.

Horseshoes answered 9/12, 2022 at 9:23 Comment(0)
R
1

I had the same issue. Turns out multiple processes were trying to run at the same time because by using control + C not all the processes are being terminated. I logout and in on the server and it was working.

Rickety answered 16/2 at 18:20 Comment(0)
C
0

Sometimes, if any error happens in the CUDA c++ code that is converted into .so file and used inside Python code, it could cause this problem, so check your C++ source code if you have any.

Caterwaul answered 21/7, 2023 at 8:33 Comment(0)
R
0

My solution is similar to saturn660's answer and the link provided there is also helpful to understand the problem.

For many users, they might install pytorch using conda or pip directly without specifying any labels, e.g. pip install torch. It might work for some users but can fail if the cuda version doesn't match the official default build.

If you check the pytorch install guide, it actually instructs the users to provide --index-url https://download.pytorch.org/whl/cuxxx where xxx stands for a cuda version like 118 for cuda 11.8, for installing from pip. But that's the way to install the latest stable version matching the specified cuda version. If you need to install a pytorch version that matches a cuda version you can use, you can (for example):

pip install torch==1.8.1+cu102 -f https://download.pytorch.org/whl/torch_stable.html

It means you want to install the torch 1.8.1 version built for cuda 10.2 which you can access in your computer. Of course, you need to download the corresponding cuDNN library and extract it to the cuda 10.2's lib64 and include directories.

Check this page for all the combinations you can install.

Rayner answered 8/9, 2023 at 18:15 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.