HuggingFace: ValueError: expected sequence of length 165 at dim 1 (got 128)

Asked 17/2, 2022 at 23:45 Answered 19/8 at 22:50

Solved python deep-learning pytorch huggingface-transformers bert-language-model

I am trying to fine-tune the BERT language model on my own data. I've gone through their docs, but their tasks seem to be not quite what I need, since my end goal is embedding text. Here's my code:

from datasets import load_dataset
from transformers import BertTokenizerFast, AutoModel, TrainingArguments, Trainer
import glob
import os


base_path = '../data/'
model_name = 'bert-base-uncased'
max_length = 512
checkpoints_dir = 'checkpoints'

tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)


def tokenize_function(examples):
    return tokenizer(examples['text'], padding=True, truncation=True, max_length=max_length)


dataset = load_dataset('text',
        data_files={
            'train': f'{base_path}train.txt',
            'test': f'{base_path}test.txt',
            'validation': f'{base_path}valid.txt'
        }
)

print('Tokenizing data. This may take a while...')
tokenized_dataset = dataset.map(tokenize_function, batched=True)
train_dataset = tokenized_dataset['train']
eval_dataset = tokenized_dataset['test']

model = AutoModel.from_pretrained(model_name)

training_args = TrainingArguments(checkpoints_dir)

print('Training the model...')
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)
trainer.train()

I get the following error:

  File "train_lm_hf.py", line 44, in <module>
    trainer.train()
...
  File "/opt/conda/lib/python3.7/site-packages/transformers/data/data_collator.py", line 130, in torch_default_data_collator
    batch[k] = torch.tensor([f[k] for f in features])
ValueError: expected sequence of length 165 at dim 1 (got 128)

What am I doing wrong?

Reunite answered 17/2, 2022 at 23:45 Comment(3)

I usually get this error when the lengths of the features are not equal. For you, it seems the first feature [k] has a length of 165, and it was expecting the same length for the second one. Can you check if you have features of equal lengths? – Mesic 18/2, 2022 at 7:37

@Mesic My data is just raw text in text files, so I'm unsure how I'd change that. Perhaps something wrong in the tokenizer? – Reunite 18/2, 2022 at 20:41

I think the data is fine, but I'm not sure why the padding keyword in the tokenizer is not doing its job. Can you check (via debugging with pdb and printing) elements of train_dataset manually and check their lengths? or provide a sample data file I can test. – Mesic 21/2, 2022 at 3:16

I fixed this solution by changing the tokenize function to:

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True, max_length=max_length)

(note the padding argument). Also, I used a data collator like so:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
trainer = Trainer(
        model=model,
        args=training_args,
        data_collator=data_collator,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset
)

Reunite answered 23/2, 2022 at 6:8 Comment(2)

Worked for me thanks. But what is the source of the problem? Is the data being tokenized too long for the given model? – Ingmar 7/6, 2023 at 5:29

Thanks for the answer. It worked for me. I am a bit confused. Data Collator is used for dynamic padding based on the batch level and hence we generally dont add padding in the tokenize_function when we use Data Collator. I will revisit the data collator documentation to find this out. – Rese 28/9, 2023 at 14:16

Increasing a fixed max_length is inefficient. The error occurs because the tokenize function is applied batched and resets at the default value (1000).

Change the below line from:

tokenized_dataset = dataset.map(tokenize_function, batched=True)

To:

tokenized_dataset = dataset.map(tokenize_function, batched=True, batch_size=2000)

I used 2000, but you can use the maximum length of your dataset being tokenized.

Translocation answered 13/8 at 0:3 Comment(0)

With PyTorch, I tried to create a 2D tensor(Matrix) and 3D tensor with the different number of elements, but I got the same errors as shown below:

import torch

torch.tensor([[2, 7, 4], [8, 3], [5, 0, 8], [3, 6, 1]]) # Error

ValueError: expected sequence of length 3 at dim 1 (got 2)

import torch

torch.tensor([[[2, 7, 4], [8, 3, 2]], [[5, 0, 8]]]) # Error

ValueError: expected sequence of length 2 at dim 1 (got 1)

So, I made the number of the elements the same, then I could create a 2D tensor(Matrix) and 3D tensor as shown below. *To create a 2D, 3D, 4D, ... tensor, the number of elements must be the same:

import torch
                              # ↓
torch.tensor([[2, 7, 4], [8, 3, 2], [5, 0, 8], [3, 6, 1]])

tensor([[2, 7, 4], [8, 3, 2], [5, 0, 8], [3, 6, 1]])

import torch
                                                # ↓ ↓ ↓ ↓ ↓ 
torch.tensor([[[2, 7, 4], [8, 3, 2]], [[5, 0, 8], [3, 6, 1]]])

tensor([[[2, 7, 4], [8, 3, 2]], [[5, 0, 8], [3, 6, 1]]])

Saturday answered 19/8 at 22:50 Comment(0)

Recommended topics

Hot tags