Sentence embedding using T5
Asked Answered
M

1

6

I would like to use state-of-the-art LM T5 to get sentence embedding vector. I found this repository https://github.com/UKPLab/sentence-transformers As I know, in BERT I should take the first token as [CLS] token, and it will be the sentence embedding. In this repository I see the same behaviour on T5 model:

cls_tokens = output_tokens[:, 0, :]  # CLS token is first token

Does this behaviour correct? I have taken encoder from T5 and encoded two phrases with it:

"I live in the kindergarden"
"Yes, I live in the kindergarden"

The cosine similarity between them was only "0.2420".

I just need to understand how sentence embedding works - should I train network to find similarity to reach correct results? Or I it is enough of base pretrained language model?

Maureen answered 28/10, 2020 at 18:35 Comment(0)
S
7

In order to obtain the sentence embedding from the T5, you need to take the take the last_hidden_state from the T5 encoder output:

model.encoder(input_ids=s, attention_mask=attn, return_dict=True)
pooled_sentence = output.last_hidden_state # shape is [batch_size, seq_len, hidden_size]
# pooled_sentence will represent the embeddings for each word in the sentence
# you need to sum/average the pooled_sentence
pooled_sentence = torch.mean(pooled_sentence, dim=1)

You have now a sentence embeddings from T5

Shaina answered 28/10, 2020 at 23:20 Comment(2)
Also, note that you have to ignore the mask when doing the mean: (output.last_hidden_state * attn.unsqueeze(-1)).sum(dim=-2) / attn.sum(dim=-1)Khan
To back up this idea further, in Sentence-T5 they show that the mean token embedding is a great choice for T5.Khan

© 2022 - 2024 — McMap. All rights reserved.