Sentence embedding using T5

I would like to use state-of-the-art LM T5 to get sentence embedding vector. I found this repository https://github.com/UKPLab/sentence-transformers As I know, in BERT I should take the first token as [CLS] token, and it will be the sentence embedding. In this repository I see the same behaviour on T5 model:

cls_tokens = output_tokens[:, 0, :]  # CLS token is first token

Does this behaviour correct? I have taken encoder from T5 and encoded two phrases with it:

"I live in the kindergarden"
"Yes, I live in the kindergarden"

The cosine similarity between them was only "0.2420".

I just need to understand how sentence embedding works - should I train network to find similarity to reach correct results? Or I it is enough of base pretrained language model?

model.encoder(input_ids=s, attention_mask=attn, return_dict=True) pooled_sentence = output.last_hidden_state # shape is [batch_size, seq_len, hidden_size] # pooled_sentence will represent the embeddings for each word in the sentence # you need to sum/average the pooled_sentence pooled_sentence = torch.mean(pooled_sentence, dim=1)

Recommended topics

Hot tags