NLP Transformers: Best way to get a fixed sentence embedding-vector shape?
Asked Answered
P

3

9

I'm loading a language model from torch hub (CamemBERT a French RoBERTa-based model) and using it do embed some french sentences:

import torch
camembert = torch.hub.load('pytorch/fairseq', 'camembert.v0')
camembert.eval()  # disable dropout (or leave in train mode to finetune)


def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   embeddings = all_layers[0]
   return embeddings

# Here we see that the shape of the embedding vector depends on the number of tokens in the sentence

u = embed(sentence="Bonjour, ça va ?")
u.shape # torch.Size([1, 7, 768])
v = embed(sentence="Salut, comment vas-tu ?")
v.shape # torch.Size([1, 9, 768])

Imagine now in order to do some semantic search, I want to calculate the cosine distance between the vectors (tensors in our case) u and v :

cos = torch.nn.CosineSimilarity(dim=1)
cos(u, v) # will throw an error since the shape of `u` is different from the shape of `v`

I'm asking what is the best method to use in order to always get the same embedding shape for a sentence regardless the count of its tokens?

=> The first solution I'm thinking of is calculating the mean on axis=1 (embedding of a sentence is the mean embedding its tokens) since axis=0 and axis=2 have always the same size:

cos = torch.nn.CosineSimilarity(dim=1)
cos(u.mean(axis=1), v.mean(axis=1)) # works now and gives 0.7269

But, I'm afraid that I'm hurting the embedding of the sentence when calculating the mean since it gives the same weight for each token (maybe multiplying by TF-IDF?).

=> The second solution is to pad shorter sentences out. That means:

  • giving a list of sentences to embed at a time (instead of embedding sentence by sentence)
  • look up for the sentence with the longest tokens and embed it, get its shape S
  • for the rest of sentences embed then pad zero to get the same shape S (the sentence has 0 in the rest of dimensions)

What are your thoughts? What other techniques would you use and why?

Thanks in advance!

Preuss answered 25/11, 2019 at 11:36 Comment(0)
J
4

Bert-as-service is a great example of doing exactly what you are asking about.

They use padding. But read the FAQ, in terms of which layer to get the representation from how to pool it: long story short, depends on the task.

EDIT: I am not saying "use Bert-as-service"; I am saying "rip off what Bert-as-service does."

In your example, you are getting word embeddings (because of the layer you are extracting from). Here is how Bert-as-service does that. So, it actually shouldn't surprise you that this depends on sentence length.

You then talk about getting sentence embeddings by mean pooling over word embeddings. That is... a way to do it. But, using Bert-as-service as a guide for how to get a fixed-length representation from Bert...

Q: How do you get the fixed representation? Did you do pooling or something?

A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.

So, to do Bert-as-service's default behavior, you'd do

def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   pooling_layer = all_layers[-2]
   embedded = pooling_layer.mean(1)  # 1 is the dimension you want to average ovber
   # note, using numpy to take the mean is bad if you want to stay on GPU
   return embedded
Jujitsu answered 25/11, 2019 at 15:54 Comment(2)
Thanks for your answer :) I don't want to use any other lib I want to stay with pure torch code. Could you explain or show use how to do the padding layer?Preuss
edited answer to be more clear - I'm saying "rip off what Bert-as-service does" not "use bert-as-service"Jujitsu
S
7

This is quite a general question, as there is no one specific right answer.

As you found out, of course the shapes differ because you get one output per token (depending on the tokenizer, those can be subword units). In other words, you have encoded all tokens into their own vector. What you want is a sentence embedding, and there are a number of ways to get those (with not one specifically right answer).

Particularly for sentence classification, we'd often use the output of the special classification token when the language model has been trained on it (CamemBERT uses <s>). Note that depending on the model, this can be the first (mostly BERT and children; also CamemBERT) or the last token (CTRL, GPT2, OpenAI, XLNet). I would suggest to use this option when available, because that token is trained exactly for this purpose.

If a [CLS] (or <s> or similar) token is not available, there are some other options that fall under the term pooling. Max and mean pooling are often used. What this means is that you take the max value token or the mean over all tokens. As you say, the "danger" is that you then reduce the vector value of the whole sentence to "some average" or "some max" that might not be very representative of the sentence. However, literature shows that this works quite well as well.

As another answer suggests, the layer whose output you use can play a difference as well. IIRC the Google paper on BERT suggests that they got the best score when concatenating the last four layers. This is more advanced and I will not go into it here unless requested.

I have no experience with fairseq, but using the transformers library, I'd write something like this (CamemBERT is available in the library from v2.2.0):

import torch
from transformers import CamembertModel, CamembertTokenizer

text = "Salut, comment vas-tu ?"

tokenizer = CamembertTokenizer.from_pretrained('camembert-base')

# encode() automatically adds the classification token <s>
token_ids = tokenizer.encode(text)
tokens = [tokenizer._convert_id_to_token(idx) for idx in token_ids]
print(tokens)

# unsqueeze token_ids because batch_size=1
token_ids = torch.tensor(token_ids).unsqueeze(0)
print(token_ids)

# load model
model = CamembertModel.from_pretrained('camembert-base')

# forward method returns a tuple (we only want the logits)
# squeeze() because batch_size=1
output = model(token_ids)[0].squeeze()
# only grab output of CLS token (<s>), which is the first token
cls_out = output[0]
print(cls_out.size())

Printed output is (in order) the tokens after tokenisation, the token IDs, and the final size.

['<s>', '▁Salut', ',', '▁comment', '▁vas', '-', 'tu', '▁?', '</s>']
tensor([[   5, 5340,    7,  404, 4660,   26,  744,  106,    6]])
torch.Size([768])
Sweettalk answered 25/11, 2019 at 18:26 Comment(6)
thank your for your feedback about what solutions I'm thinking of! What does your Hugging Face transformers code add comparing to pure torch code?Preuss
Do you suggest any other solution for a semantic search NLP task? Fo example this paper discuses the idea of adding a pooling layer arxiv.org/pdf/1908.10084.pdfPreuss
It just provides an interface the load the model and tokenizer. The other parts are purely Python/torch. Note that your code is not "pure torch" either. It's pure fairseq. So it just depends on what kind of library you wish to use.Sweettalk
As I mention in my post, what works best depends on the model and your task. CamemBERT is relatively new, so I haven't seen real-life work done with it. So it's up to you try things out and see what works best for your scenario. Pooling might work better than using the CLS token, or maybe worse. Using the penultimate layer might work better, or not. Concatenating the last four layers might work better, or not. These are all things for you to test on your downstream task.Sweettalk
Appreciate your feedback, thank you. For "pure torch" I meant no other python package to install just torch, but yeah sure it depends on faireseq :)Preuss
@julien_c Thanks! Edited in the post.Sweettalk
J
4

Bert-as-service is a great example of doing exactly what you are asking about.

They use padding. But read the FAQ, in terms of which layer to get the representation from how to pool it: long story short, depends on the task.

EDIT: I am not saying "use Bert-as-service"; I am saying "rip off what Bert-as-service does."

In your example, you are getting word embeddings (because of the layer you are extracting from). Here is how Bert-as-service does that. So, it actually shouldn't surprise you that this depends on sentence length.

You then talk about getting sentence embeddings by mean pooling over word embeddings. That is... a way to do it. But, using Bert-as-service as a guide for how to get a fixed-length representation from Bert...

Q: How do you get the fixed representation? Did you do pooling or something?

A: Yes, pooling is required to get a fixed representation of a sentence. In the default strategy REDUCE_MEAN, I take the second-to-last hidden layer of all of the tokens in the sentence and do average pooling.

So, to do Bert-as-service's default behavior, you'd do

def embed(sentence):
   tokens = camembert.encode(sentence)
   # Extract all layer's features (layer 0 is the embedding layer)
   all_layers = camembert.extract_features(tokens, return_all_hiddens=True)
   pooling_layer = all_layers[-2]
   embedded = pooling_layer.mean(1)  # 1 is the dimension you want to average ovber
   # note, using numpy to take the mean is bad if you want to stay on GPU
   return embedded
Jujitsu answered 25/11, 2019 at 15:54 Comment(2)
Thanks for your answer :) I don't want to use any other lib I want to stay with pure torch code. Could you explain or show use how to do the padding layer?Preuss
edited answer to be more clear - I'm saying "rip off what Bert-as-service does" not "use bert-as-service"Jujitsu
O
2

Take a look at sentence-transformers. Your model can be implemented as:

from sentence_transformers import SentenceTransformer
word_embedding_model = models.CamemBERT('camembert-base')
dim = word_embedding_model.get_word_embedding_dimension()
pooling_model = models.Pooling(dim, pooling_mode_mean_tokens=True, pooling_mode_cls_token=False, pooling_mode_max_tokens=False)

model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
sentences = ['sentence 1', 'sentence 3', 'sentence 3']
sentence_embeddings = model.encode(sentences)

In the benchmark section you can see a comparison to several embedding methods such as Bert as a Service which I wouldn't recommend for similarity tasks. Additionally you can fine tune the embeddings for your task.

Also interesting to try a multilingual model:

model = SentenceTransformer('distiluse-base-multilingual-cased')
model.encode([...])

Will probably yield better results than mean pooling CamemBert.

Oldest answered 20/4, 2020 at 14:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.