CUDA error: CUBLAS_STATUS_INVALID_VALUE error when training BERT model using HuggingFace
Asked Answered
S

1

2

I am working on sentiment analysis on steam reviews dataset using BERT model where I have 2 labels: positive and negative. I have fine-tuned the model with 2 Linear layers and the code for that is as below.

 bert = BertForSequenceClassification.from_pretrained("bert-base-uncased",
                                                 num_labels = len(label_dict),
                                                 output_attentions = False,
                                                 output_hidden_states = False)

 class bertModel(nn.Module):
   def __init__(self, bert):
     super(bertModel, self).__init__()
     self.bert = bert
     self.dropout1 = nn.Dropout(0.1)
     self.relu =  nn.ReLU()
     self.fc1 = nn.Linear(768, 512)
     self.fc2 = nn.Linear(512, 2)
     self.softmax = nn.LogSoftmax(dim = 1)

  def forward(self, **inputs):
     _, x = self.bert(**inputs)
    x = self.fc1(x)
    x = self.relu(x)
    x = self.dropout1(x)
    x = self.fc2(x)
    x = self.softmax(x)

  return x

This is my train function:

def model_train(model, device, criterion, scheduler, optimizer, n_epochs):
  train_loss = []
  model.train()
 for epoch in range(1, epochs+1):
   total_train_loss, training_loss = 0,0 
  for idx, batch in enumerate(dataloader_train):
     model.zero_grad()
     data = tuple(b.to(device) for b in batch)
     inputs = {'input_ids':      data[0],'attention_mask': data[1],'labels':data[2]}
     outputs = model(**inputs)
     loss = criterion(outputs, labels)
     loss.backward()
     torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
     #update the weights
     optimizer.step()
     scheduler.step()
     training_loss += loss.item()
     total_train_loss += training_loss
     if idx % 25 == 0:
        print('Epoch: {}, Batch: {}, Training Loss: {}'.format(epoch, idx, training_loss/10))
        training_loss = 0      
  #avg training loss
  avg_train_loss = total_train_loss/len(dataloader_train)
  #validation data loss
  avg_pred_loss = model_evaluate(dataloader_val)
  #print for every end of epoch
  print('End of Epoch {}, Avg. Training Loss: {}, Avg. validation Loss: {} \n'.format(epoch, avg_train_loss, avg_pred_loss))

I am running this code on Google Colab. When I run the train function, I get the following the error, I have tried with batch sizes 32, 256, 512.

RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Can anyone please help me on this? Thank you.

Update on the code: I tried running the code on the CPU and the error is in the matrix shapes mismatch. The input shape, shape after the self.bert is printed in the image. Since the first linear layer (fc1) is not getting executed, the shape after that is not printed.

enter image description here enter image description here

Samothrace answered 14/7, 2021 at 18:47 Comment(0)
B
2

I suggest trying out couple of things that can possibly solve the error.

As shown in this forum, one possible solution is to lower the batch size of how you load data. Since it might be a memory error.

If that does not work then I suggest as shown in this github issue to update to a new version of Pytorch cuda that fixes a matrix multiplication bug that releases this same error that your code could be doing. Hence, as shown in this forum You can update Pytorch to the nightly pip wheel, or use the CUDA10.2 or conda binaries. You can find information on such installations on the pytorch home page where it mentions how to install pytorch.

If none of that works, then the best thing to do is to run a smaller version of the process on CPU and recreate the error. When running it on CPU instead of CUDA, you will get a more useful traceback that can solve your error.

EDIT (Based on Comments):

You have a matrix error in your model. The problem stems in your forward func then

The model BERT outputs a tensor that has torch.size (64, 2) which means if you put it in the Linear layer you have it will error since that linear layer requires input of (?, 768) b/c you initialized it as nn.Linear(768, 512). In order to make the error disappear you need to either do some transformation on the tensor or initialize another linear layer as shown below:

somewhere defined in __init__: self.fc0 = nn.Linear(2, 768)
def forward(self, **inputs):
     _, x = self.bert(**inputs)
     
    x = self.fc0(x)
    x = self.fc1(x)
    x = self.relu(x)
    x = self.dropout1(x)
    x = self.fc2(x)
    x = self.softmax(x)

Sarthak Jain

Ballistic answered 14/7, 2021 at 19:8 Comment(13)
okay, I ll try these and let you know how it works, thank you.Samothrace
I followed the links you mentioned and tried to change the Pytorch version and currently I have "1.9.0+cu102" this version. But I still get the same error. Can you pls take a look at my notebook if I share it with you?Samothrace
I also changed the batch size and tried with smaller size of 32, 64..but still the same error existsSamothrace
Ok, I suggest you run with a batch size of 1 and swicth the process to CPU temporarily. I think a batch size of 32 is still rlly big for big models. Also, make sure you run once on CPU, because that is how you will get an actual error that makes more sense. You may have some problem with how you are encoding or predicting labels that will be hard to debug unless you run a training process on CPU once and it gives a readable error instead of the above CUDA error that does not say much. Since CPU is slow, to recreate the error faster I suggest making the batch size 1.Ballistic
yes, I just ran it on CPU and I am getting matrices shape error: RuntimeError: mat1 and mat2 shapes cannot be multiplied (64x2 and 768x512)Samothrace
I made a new edit to the answer. See if that helps.Ballistic
got it, so I tried out as you mentioned and printed the shape of x before and after flatten, and I get the shape as: torch.Size([64, 2])...I think it is not flattening after the flatten code...so I am not able to pass it to the linear layersSamothrace
Ah, what is the shape of your input? Also, what is the shape of your x after it goes through self.bert and what is shape after is goes through fc1. Also can you post a picture of your entire traceback error on CPU so I can know what line in your code the error happens.Ballistic
I have added the image of the error in my question after the update on code. Also you can find my notebook at this link: github.com/gprashmi/Sentiment_Analysis/blob/main/…Samothrace
Ok I have made another edit, which I am sure should solve your problem.Ballistic
got it, it works, thank you very much for your help!Samothrace
Hi, I made the changes to the code and I am able to train the model for 6 epochs. However, my training loss does not seem to decrease as the training progresses, I have tried with different lr, batch size and epochs, tried without freezing the BERT parameters also. Can you pls take a look and let me know my mistake here? github.com/gprashmi/Sentiment_Analysis/blob/main/…Samothrace
@Ballistic Thanks my friend, I didn't know that running on cpu gives a better descriptive error.Tillfourd

© 2022 - 2024 — McMap. All rights reserved.