The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1
Asked Answered
G

2

20

I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my dataframe. If it is longer than 512 I split the sentence into sequences each sequence has 512 token. And then do tokenizer encode. The length of the seqience is 512, however, after doing tokenize encode the length becomes 707 and I get this error.

The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1

Here is the code I used to do the preivous steps:

tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
import math

pred=[]
if (len(test_sentence_in_df.split())>512):
  
  n=math.ceil(len(test_sentence_in_df.split())/512)
  for i in range(n):
    if (i==(n-1)):
      print(i)
      test_sentence=' '.join(test_sentence_in_df.split()[i*512::])
    else:
      print("i in else",str(i))
      test_sentence=' '.join(test_sentence_in_df.split()[i*512:(i+1)*512])
      
      #print(len(test_sentence.split()))  ##here's the length is 512
    tokenized_sentence = tokenizer.encode(test_sentence)
    input_ids = torch.tensor([tokenized_sentence]).cuda()
    print(len(tokenized_sentence)) #### here's the length is 707
    with torch.no_grad():
      output = model(input_ids)
      label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
    pred.append(label_indices)

print(pred)
Gismo answered 12/10, 2020 at 15:34 Comment(1)
did you solve it please ?Achromatism
A
19

This is because, BERT uses word-piece tokenization. So, when some of the words are not in the vocabulary, it splits the words to it's word pieces. For example: if the word playing is not in the vocabulary, it can split down to play, ##ing. This increases the amount of tokens in a given sentence after tokenization. You can specify certain parameters to get fixed length tokenization:

tokenized_sentence = tokenizer.encode(test_sentence, padding=True, truncation=True,max_length=50, add_special_tokens = True)

Accredit answered 13/10, 2020 at 8:35 Comment(6)
if the encode() function doesn't work, then batch_encode_plus() definitely works.Diametral
Just as a side-note: This error ist very likely to appear if a monolingual Bert model is used on another language ;)Sweetbrier
@AshwinGeetD'Sa. I'm using batch_encode_plus() and I still get this error. This is the code I use: tokenizer.batch_encode_plus( df.abstract.values, add_special_tokens=True, return_attention_mask=True, padding='longest', max_length=256, return_tensors='pt' )Colunga
What is the error?Diametral
This does not show us how to solve the issue within the pipeline() setup of transformers. Passing these args to AutoTokenizer.from_pretrained() doesn't affect the behavior when you call the pipeline.Dacron
for pipeline use this https://mcmap.net/q/663478/-runtimeerror-the-expanded-size-of-the-tensor-585-must-match-the-existing-size-514-at-non-singleton-dimension-1Sassoon
I
9

If you are running a transformer model with HuggingFace, there is a chance that one of the input sentences is longer than 512 tokens. Either truncate or split you sentences. I suspect the shorter sentences are padded to 512 tokens.

Irenairene answered 13/7, 2022 at 18:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.