I am trying to do text classification using pretrained BERT model. I trained the model on my dataset, and in the phase of testing; I know that BERT can only take to 512 tokens, so I wrote if condition to check the length of the test senetence in my dataframe. If it is longer than 512 I split the sentence into sequences each sequence has 512 token. And then do tokenizer encode. The length of the seqience is 512, however, after doing tokenize encode the length becomes 707 and I get this error.
The size of tensor a (707) must match the size of tensor b (512) at non-singleton dimension 1
Here is the code I used to do the preivous steps:
tokenizer = BertTokenizer.from_pretrained('bert-base-cased', do_lower_case=False)
import math
pred=[]
if (len(test_sentence_in_df.split())>512):
n=math.ceil(len(test_sentence_in_df.split())/512)
for i in range(n):
if (i==(n-1)):
print(i)
test_sentence=' '.join(test_sentence_in_df.split()[i*512::])
else:
print("i in else",str(i))
test_sentence=' '.join(test_sentence_in_df.split()[i*512:(i+1)*512])
#print(len(test_sentence.split())) ##here's the length is 512
tokenized_sentence = tokenizer.encode(test_sentence)
input_ids = torch.tensor([tokenized_sentence]).cuda()
print(len(tokenized_sentence)) #### here's the length is 707
with torch.no_grad():
output = model(input_ids)
label_indices = np.argmax(output[0].to('cpu').numpy(), axis=2)
pred.append(label_indices)
print(pred)