RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle)` with GPU only

H

7

24

I'm working on the CNN with one-dimensional signal. It works totally fine with CPU device. However, when I training model in GPU, CUDA error occurred. I set os.environ['CUDA_LAUNCH_BLOCKING'] = "1" command after I got RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling cublasCreate(handle). With doing this, a cublasSgemm error occurred instead of cublasCreate error. Though the nvidia document doubt the hardware problem, I can training other CNN with images without any error. Below is my code for the data loading and set data in training model.

    idx = np.arange(len(dataset))  # dataset & label shuffle in once
    np.random.shuffle(idx)

    dataset = dataset[idx]
    sdnn = np.array(sdnn)[idx.astype(int)]        

    train_data, val_data = dataset[:int(0.8 * len(dataset))], dataset[int(0.8 * len(dataset)):]
    train_label, val_label = sdnn[:int(0.8 * len(sdnn))], sdnn[int(0.8 * len(sdnn)):]
    train_set = DataLoader(dataset=train_data, batch_size=opt.batch_size, num_workers=opt.workers)

    for i, data in enumerate(train_set, 0):  # data.shape = [batch_size, 3000(len(signal)), 1(channel)] tensor

        x = data.transpose(1, 2)
        label = torch.Tensor(train_label[i * opt.batch_size:i * opt.batch_size + opt.batch_size])
        x = x.to(device, non_blocking=True)
        label = label.to(device, non_blocking=True) # [batch size]
        label = label.view([len(label), 1])
        optim.zero_grad()

        # Feature of signal extract
        y_predict = model(x) # [batch size, fc3 output] # Error occurred HERE
        loss = mse(y_predict, label)

Below is the error message from this code.

File C:/Users/Me/Desktop/Me/Study/Project/Analysis/Regression/main.py", line 217, in Processing
    y_predict = model(x) # [batch size, fc3 output]
  File "C:\Anaconda\envs\torch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Users\ME\Desktop\ME\Study\Project\Analysis\Regression\cnn.py", line 104, in forward
    x = self.fc1(x)
  File "C:\Anaconda\envs\torch\lib\site-packages\torch\nn\modules\module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "C:\Anaconda\envs\torch\lib\site-packages\torch\nn\modules\linear.py", line 91, in forward
    return F.linear(input, self.weight, self.bias)
  File "C:\Anaconda\envs\torch\lib\site-packages\torch\nn\functional.py", line 1674, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I've tried to solve this error for weeks but can't find the solution. If you can see anything wrong here, please let me know.

He answered 12/3, 2021 at 12:58 Comment(3)

I've had the same error. Not sure of the root cause but this is what I found from digging: - When the batch size was < 8 the gradients became super low - (likely related) if the number of sample was not divisible by the batch size the last batch of the epoch was < 8 so I got this error. - by ensuring my batch size was divisible evenly by my batch size and my batch size was >= 8 I this error seems to have gone away. – Reposition 30/11, 2021 at 22:33

the most simple hack that has worked me almost everytime is restarting the session or the machine itself. I see it happens because of over accumulated cache. – Capitalization 9/3, 2023 at 15:55

Running the same code in native python instead of jupyter notebook solved my problem. It seems to be a problem with jupyter's kernel and cuda. – Gussy 19/3, 2023 at 16:6

H

10

With searched with the partial keywords, I finally got the similar situation. Because of the stability, I used the CUDA 10.2 version. The reference asked to upgrade CUDA toolkit to higher - 11.2 in my case - and problem solved! I've deal with other training processes but this one only caused error. As the CUDA error occurred with various reasons, changes the version could be counted for solution.

He answered 29/3, 2021 at 8:12 Comment(1)

I tried to use CUDA 11.5 but it does not work. – Westering 21/2, 2023 at 1:55

H

27

Please know that, it can also be caused if you have a mismatch between the dimension of your input tensor and the dimensions of your nn.Linear module. (ex. input.shape = (a, b) and nn.Linear(c, c, bias=False) with c not matching).

Hilmahilt answered 30/7, 2021 at 8:22 Comment(3)

this answer is correct – Benjie 18/1, 2022 at 12:42

try to know output shape after nn.Flatten() and then use this as input in nn.Linear() – Benjie 18/1, 2022 at 12:43

This was my issue – Phillips 6/2, 2023 at 0:0

H

10

With searched with the partial keywords, I finally got the similar situation. Because of the stability, I used the CUDA 10.2 version. The reference asked to upgrade CUDA toolkit to higher - 11.2 in my case - and problem solved! I've deal with other training processes but this one only caused error. As the CUDA error occurred with various reasons, changes the version could be counted for solution.

He answered 29/3, 2021 at 8:12 Comment(1)

I tried to use CUDA 11.5 but it does not work. – Westering 21/2, 2023 at 1:55

M

7

Rightly said by Loich, and I think shape mismatch is a prime reason why this error is thrown.

I too got this error while training a image recognition model, where the shapes of - output of final Conv2d and input of first Linear layers was not same.

If none of that works, then the best thing to do is to run a smaller version of the process on CPU and recreate the error. When running it on CPU instead of CUDA, you will get a more useful traceback that can solve your error.

One remedy explained in this answer (quoted above) is, with disabled gpu try to recreate similar situation by executing the code (without changing any line) on cpu, it should give better and understandable error.

P.S.: Although, the original question states that their code is executing fine on cpu, I've posted this answer for someone with similar error and not as a result of Cuda version mismatch.

Mulct answered 25/8, 2021 at 8:24 Comment(0)

A

5

Putting another answer here which solved the issue for me:

You will see the exact same error message if you use an instance of nn.Embedding which receives an input index which is outside the pre-defined vocabulary range. So if you created the Embedding for 100 units, and you input the index 100 (the Embedding now expects inputs from 0-99!), you end up with this CUDA Error which is super hard to track down to the embedding.

Astragalus answered 28/2, 2023 at 20:42 Comment(1)

I faced the same error. I raised the embedding cardinality by 1 after looking at your answer and the code works. – Schroeder 19/2 at 10:55

P

2

I was getting the same error when running the same task on a single gpu in a 4-gpus machine, with a detectron2-based model:

The first gpu worked fine (aka cuda:0), while the rest gave this error after 170 epochs. I didn't want to change cuda version or update environment (not that simple once you have everything working...), and I didn't want to change any of the layers either (doesn't make sense when it does work on one gpu).

So I found another simple solution, for running with 1 gpu that is not the first in line: CUDA_VISIBLE_DEVICES=<gpu_number> before your script arguments, and if you have --device argument set it to 0 (as cuda will only see one gpu, the one you stated).

Polystyrene answered 24/5, 2023 at 20:3 Comment(1)

Yes this was the fix for me as well, the only thing I would add is that the device id after you set CUDA_VISIBLE_DEVICES = <gpu_number> (where gpu_number is a string btw) will be 0 for the first gpu in that list, so I had to change some t.to(device_id) code to account for this. – Elaterite 19/5 at 11:48

Z

1

I got this error when my tensors where to large:

A.size() ; B.size()
# this works
torch.matmul(A[:450, ...], B).size
# this doesn't
torch.matmul(A, B)

output:

torch.Size([512, 256, 3, 3, 4])
torch.Size([4])
torch.Size([450, 256, 3, 3])

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling 
`cublasSgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy)`

So, splitting the large tensor fixed it for me

Zwiebel answered 15/5, 2023 at 9:14 Comment(0)

Q

1

I got this error result from my CUDA version in GPU A100. I finally solved it by upgrading CUDA version from 10.2 to 11.7.

conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.7 -c pytorch -c nvidia

Quietly answered 25/12, 2023 at 6:4 Comment(0)

Recommended topics

Hot tags