I would like to use state-of-the-art LM T5 to get sentence embedding vector. I found this repository https://github.com/UKPLab/sentence-transformers As I know, in BERT I should take the first token as [CLS] token, and it will be the sentence embedding. In this repository I see the same behaviour on T5 model:
cls_tokens = output_tokens[:, 0, :] # CLS token is first token
Does this behaviour correct? I have taken encoder from T5 and encoded two phrases with it:
"I live in the kindergarden"
"Yes, I live in the kindergarden"
The cosine similarity between them was only "0.2420".
I just need to understand how sentence embedding works - should I train network to find similarity to reach correct results? Or I it is enough of base pretrained language model?
(output.last_hidden_state * attn.unsqueeze(-1)).sum(dim=-2) / attn.sum(dim=-1)
– Khan