max_seq_length for transformer (Sentence-BERT)
P

1

9

I'm using sentence-BERT from Huggingface in the following way:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
model.max_seq_length = 512
model.encode(text)

When text is long and contains more than 512 tokens, it does not throw an exception. I assume it automatically truncates the input to 512 tokens.

How can I make it throw an exception when the input length is larger than max_seq_length?

Further, what is the maximum possible max_seq_length for all-MiniLM-L6-v2?

Parodist answered 31/3, 2023 at 17:29 Comment(0)
C
14

First of all, it should be noted that the sentence transformer supports a different sequence length than the underlying transformer. You check those values with:

# that's the sentence transformer
print(model.max_seq_length)
# that's the underlying transformer
print(model[0].auto_model.config.max_position_embeddings)

Output:

256
512

That means, the position embedding layer of the transformers has 512 weights, but the sentence transformer will only use and was trained with the first 256 of them. Therefore, you should be careful with increasing the value above 256. It will work from a technical perspective, but the position embedding weights (>256) are not properly trained and can therefore mess up your results. Please also check this StackOverflow post.

Regarding throwing an exception, I think that is not offered by the library and you, therefore, have a write a workaround by yourself:

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

my_text = "this is a test "*1000

try:
  o = model[0].tokenizer(my_text, return_attention_mask=False, return_token_type_ids=False)
  if len(o.input_ids) > model.max_seq_length:
    raise ValueError("Oh no!")
except ValueError:
  ...


model.encode(my_text)
Comeau answered 1/4, 2023 at 20:10 Comment(4)
Do you know by any chance how useful the model will be for strings > 128 tokens? according to the website they trained with maximum input lengths of 128 tokens. (huggingface.co/sentence-transformers/all-MiniLM-L6-v2)Marlysmarmaduke
Sorry, I don't know that. You could look for a dataset with long sequences and benchmark the model.Comeau
Thanks =) I will ask in Huggingface first and see what they say.Marlysmarmaduke
I got an answer*. The experience seem to be, that it does not perform good. huggingface.co/BAAI/bge-small-en-v1.5 was recommended to me. *) huggingface.co/sentence-transformers/all-MiniLM-L6-v2/…Marlysmarmaduke

© 2022 - 2024 — McMap. All rights reserved.