BERT sentence embeddings from transformers

P

3

17

I'm trying to get sentence vectors from hidden states in a BERT model. Looking at the huggingface BertModel instructions here, which say:

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased')
model = BertModel.from_pretrained("bert-base-multilingual-cased")
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt') 
output = model(**encoded_input)

So first note, as it is on the website, this does /not/ run. You get:

>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'BertTokenizer' object is not callable

But it looks like a minor change fixes it, in that you don't call the tokenizer directly, but ask it to encode the input:

encoded_input = tokenizer.encode(text, return_tensors="pt")
output = model(encoded_input)

OK, that aside, the tensors I get, however, have a different shape than I expected:

>>> output[0].shape
torch.Size([1,11,768])

This is a lot of layers. Which is the correct layer to use for sentence embeddings? [0]? [-1]? Averaging several? I have the goal of being able to do cosine similarity with these, so I need a proper 1xN vector rather than an NxK tensor.

I see that the popular bert-as-a-service project appears to use [0]

Is this correct? Is there documentation for what each of the layers are?

Postpone answered 18/8, 2020 at 3:0 Comment(2)

Regarding TypeError: 'BertTokenizer' object is not callable you probably have installed an older version of transformers. – Stowers 18/8, 2020 at 16:11

Agree with @cronoik, your first example now works fine. – Yeorgi 8/3, 2022 at 7:27

Y

17

I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.

I think the rule of thumb is:

Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.

Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.

Yearling answered 18/8, 2020 at 8:37 Comment(7)

Does it make sense to aggregate multiple layers, say the last and the second to last? Is a simple arithmetic mean appropriate for that operation or no? – Postpone 18/8, 2020 at 14:31

It certainly does. In some sense, the last layer contains all the previous layers, because the model is interconnected via residual connections, i.e., after each layer, the output of the layer is summed up with the previous one. Due to the residual connections, the layers are sort of commensurable, and averaging them is just changing the ratio in which the layers were mixed previously. – Horned 18/8, 2020 at 16:35

Sorry, and the layers are ordered such that to get the /last/ 3 layers, that would be something like: >>> output[0][:,-4:-1,:].shape. For torch.Size([1, 3, 768]) Right? – Postpone 20/8, 2020 at 3:15

Exactly. (Btw. instead of -4-:1, you can only write -4:.) – Horned 20/8, 2020 at 7:41

And sorry to revive an old question, but the layer subset is for sure the middle dimension of the output[0] object? This appears to vary depending on the document length. – Postpone 5/10, 2020 at 23:59

@Yearling do you know how can I pass multiple texts instead of one. For example, instead of: text = "Replace me by any text you'd like", a list of texts such as text =["First text", "Second text"] – Ahouh 27/11, 2020 at 6:28

Not sure if this is what you're looking for: BERTify – Carolinacaroline 27/9, 2021 at 7:34

S

19

While the existing answer of Jindrich is generally correct, it does not address the question entirely. The OP asked which layer he should use to calculate the cosine similarity between sentence embeddings and the short answer to this question is none. A metric like cosine similarity requires that the dimensions of the vector contribute equally and meaningfully, but this is not the case for BERT weights released by the original authors. Jacob Devlin (one of the authors of the BERT paper) wrote:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors. It seems that this is doing average pooling over the word tokens to get a sentence vector, but we never suggested that this will generate meaningful sentence representations. And even if they are decent representations when fed into a DNN trained for a downstream task, it doesn't mean that they will be meaningful in terms of cosine distance. (Since cosine distance is a linear space where all dimensions are weighted equally).

However, that does not mean you can not use BERT for such a task. It just means that you can not use the pre-trained weights out-of-the-box. You can either train a classifier on top of BERT which learns which sentences are similar (using the [CLS] token) or you can use sentence-transformers which can be used in an unsupervised scenario because they were trained to produce meaningful sentence representations.

Stowers answered 7/10, 2020 at 4:50 Comment(6)

sentence-transformers is still limited to sentences, right? It doesn’t apply to multi-sentence documents without the same kind of failing BERT has composing from words to documents as well? – Postpone 7/10, 2020 at 6:34

No, you can use it for whole paragraphs. @Postpone – Stowers 7/10, 2020 at 13:7

This is quite an interesting question. So, in order to look for similar sentences, you would not use the output from BERT embeddings and try to use cosine similarity, am I right? But what if the idea is instead of looking for similar sentences but to look for similar words? I retrieve the embedding of the word and try to look for similar embeddings on other sentence. – Experiential 21/1, 2021 at 10:11

@Experiential No that is not what I said here. I said the original BERT weights released by google were never intended to be used for finding similar sequences. You need some weights for BERT that are trained for this task. This is what sentence-transformer project does. They release weights that are trained for such an objective. Regarding your other question, are you looking for a way to determine the similarity of a word in the context of a sentence or just for synonyms? – Stowers 22/1, 2021 at 14:17

@Stowers Thanks for your answer. When you say you need some weights for BERT that are trained for this task you mean retrain a new Bert? or use something already pretrained from another place? My task now is to search entities on plain text, to do so I am doing embeddings from the name of the fields I want to look for and I use Bert as well to convert the plain text into vectors. Once I have those 2 vectors I retrieve the most similar words to the fields I want to look for. I do not know if Bert and this method is a valid approach to this problem. Perhaps you can guid me a bit. Thanks a lot! – Experiential 22/1, 2021 at 15:32

@Experiential You don't need to train BERT from scratch. As written in the answer you can either finetune the BERT with a similarity task or use the weights provided by the sentence-transformers project. The other question is not really really suited for stackoverflow. Maybe you can post it in the huggingface forum with a small example. – Stowers 22/1, 2021 at 17:53

Y

17

I don't think there is single authoritative documentation saying what to use and when. You need to experiment and measure what is best for your task. Recent observations about BERT are nicely summarized in this paper: https://arxiv.org/pdf/2002.12327.pdf.

I think the rule of thumb is:

Use the last layer if you are going to fine-tune the model for your specific task. And finetune whenever you can, several hundred or even dozens of training examples are enough.
Use some of the middle layers (7-th or 8-th) if you cannot finetune the model. The intuition behind that is that the layers first develop a more and more abstract and general representation of the input. At some point, the representation starts to be more target to the pre-training task.

Bert-as-services uses the last layer by default (but it is configurable). Here, it would be [:, -1]. However, it always returns a list of vectors for all input tokens. The vector corresponding to the first special (so-called [CLS]) token is considered to be the sentence embedding. This where the [0] comes from in the snipper you refer to.

Yearling answered 18/8, 2020 at 8:37 Comment(7)

Does it make sense to aggregate multiple layers, say the last and the second to last? Is a simple arithmetic mean appropriate for that operation or no? – Postpone 18/8, 2020 at 14:31

It certainly does. In some sense, the last layer contains all the previous layers, because the model is interconnected via residual connections, i.e., after each layer, the output of the layer is summed up with the previous one. Due to the residual connections, the layers are sort of commensurable, and averaging them is just changing the ratio in which the layers were mixed previously. – Horned 18/8, 2020 at 16:35

Sorry, and the layers are ordered such that to get the /last/ 3 layers, that would be something like: >>> output[0][:,-4:-1,:].shape. For torch.Size([1, 3, 768]) Right? – Postpone 20/8, 2020 at 3:15

Exactly. (Btw. instead of -4-:1, you can only write -4:.) – Horned 20/8, 2020 at 7:41

And sorry to revive an old question, but the layer subset is for sure the middle dimension of the output[0] object? This appears to vary depending on the document length. – Postpone 5/10, 2020 at 23:59

@Yearling do you know how can I pass multiple texts instead of one. For example, instead of: text = "Replace me by any text you'd like", a list of texts such as text =["First text", "Second text"] – Ahouh 27/11, 2020 at 6:28

Not sure if this is what you're looking for: BERTify – Carolinacaroline 27/9, 2021 at 7:34

D

10

As mentioned in other answers, BERT was not meant to produce sentence level embeddings. Now, let's work on the how we can leverage power of BERT for computing context-sensitive sentence level embeddings.

BERT does carry the context at word level, here is an example:

This is a wooden stick. Stick to your work.

Above two sentences carry the word 'stick', BERT does a good job in computing embeddings of stick as per sentence(or say, context).

Now, let's move to one another example:

--What is your age?

--How old are you?

Above two sentences are contextually very similar, so, we need a model that can accept a sentence or text chunk or paragraph and produce right embeddings collectively. Here is how it can be achieved.

Method 1:

Use pre-trained sentence_transformers, here is link to huggingface hub.

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim


model = SentenceTransformer(r"sentence-transformers/paraphrase-MiniLM-L6-v2")


embd_a = model.encode("What is your age?")
embd_b = model.encode("How old are you?")


sim_score = cos_sim(embd_a, embd_b)

print(sim_score)

output: tensor([[0.8648]])

Now, there may be a question on how can we train our on sentence_transformer, specific to a domain. Here we go,

Supervised approach:

A common challenge for Data Scientist or MLEngineers is to get rightly annotated data, mostly it is hard to get it in good volume, but say, if you have it here is how we can train our on sentence_transformer (don't worry, there is an unsupervised approach too).

model = SentenceTransformer('distilbert-base-nli-mean-tokens') 

train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)

train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

#Tune the model
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=100)

More details here.

Tip: If you have a set of sentences that are similar to each other, say, you have a CSV, where column A and B contains sentences similar to each other(I mean each row will have a pair of sentences which are similar to each other), just load the csv and assign random values between 0.85 to 0.95 as similarity score and proceed.

Unsupervised approach

Say you don't have a huge set of annotated data, but you want to train a domain specific sentence_transformer, here is how we do it. Even for unsupervised training, data will be required, i.e. list of sentences/paragraphs, but need not to be annotated. Say, you don't have any data at all, still there is a work round (please visit last part of the answer).

Multiple approaches are available for unsupervised training, listing two of the most prominent ones. To see list of all available approaches, please visit here.

TSDAE link to research paper.

from sentence_transformers import SentenceTransformer, LoggingHandler
from sentence_transformers import models, util, datasets, evaluation, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'bert-base-uncased'
word_embedding_model = models.Transformer(model_name)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), 'cls')
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise", 
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Create the special denoising dataset that adds noise on-the-fly
train_dataset = datasets.DenoisingAutoEncoderDataset(train_sentences)

# DataLoader to batch your data
train_dataloader = DataLoader(train_dataset, batch_size=8, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.DenoisingAutoEncoderLoss(model, decoder_name_or_path=model_name, tie_encoder_decoder=True)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    weight_decay=0,
    scheduler='constantlr',
    optimizer_params={'lr': 3e-5},
    show_progress_bar=True
)

model.save('output/tsdae-model')

SimCSE link to research paper

from sentence_transformers import SentenceTransformer, InputExample
from sentence_transformers import models, losses
from torch.utils.data import DataLoader

# Define your sentence transformer model using CLS pooling
model_name = 'distilroberta-base'
word_embedding_model = models.Transformer(model_name, max_seq_length=32)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

# Define a list with sentences (1k - 100k sentences)
train_sentences = ["Your set of sentences",
                   "Model will automatically add the noise",
                   "And re-construct it",
                   "You should provide at least 1k sentences"]

# Convert train sentences to sentence pairs
train_data = [InputExample(texts=[s, s]) for s in train_sentences]

# DataLoader to batch your data
train_dataloader = DataLoader(train_data, batch_size=128, shuffle=True)

# Use the denoising auto-encoder loss
train_loss = losses.MultipleNegativesRankingLoss(model)

# Call the fit method
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=1,
    show_progress_bar=True
)

model.save('output/simcse-model')

Tip: If you carefully observer, major difference is in the loss function used. To see a list of all the loss function applicable to such training scenarios, visit here. Also, with all the experiments I did, I found that TSDAE is more useful, when you want decent precision and good recall. However, SimCSE can be used when you want very high precision and low recall.

Now, if you don't have sufficient data to fine tune the model, but you find a BERT model trained on your domain, you can directly leverage that by adding pooling and dense layers. Please do research on what is 'pooling', to have better understanding on what you are doing.

from sentence_transformers import SentenceTransformer, models
from torch import nn

word_embedding_model = models.Transformer('bert-base-uncased', max_seq_length=256)
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
dense_model = models.Dense(in_features=pooling_model.get_sentence_embedding_dimension(), out_features=256, activation_function=nn.Tanh())

model = SentenceTransformer(modules=[word_embedding_model, pooling_model, dense_model])

Tip: With above approach, if you start getting extreme high cosine score, it is an alarm to do negative testing. Sometime, simply adding pooling layers may not help, you must take few examples and check similarity scores for the inputs that are not similar (it is possible that even for dissimilar sentences, this may show good similarity, and that is the time you should stop and try to collect some data and do unsupervised training)

People who are interested in going deeper, here is a list of topics that may help you.

Pooling
Siamese Networks
Contrastive Loss

:) :)

Dandelion answered 22/3, 2022 at 14:59 Comment(4)

What an excellent write-up and thoughtful links. Thank you! – Postpone 22/3, 2022 at 19:48

Excellent. Nils Reimers talks about these very techniques and their performance in this video. – Kettledrummer 1/4, 2022 at 10:20

Excellent. I am able to train unsupervised on bert-base-uncased, is there some useful method for evaluation? similar to "EmbeddingSimilarityEvaluator" which seem to be defined for supervided approach.(github.com/UKPLab/sentence-transformers/blob/master/examples/…) – Semimonthly 1/12, 2023 at 14:12

While training the model, the loss function performance can be a good indicator. For the specific computation of Embeddings' Similarity performance, I guess you have to curate a dataset i.e., go with a supervised approach. – Dandelion 12/12, 2023 at 23:33

Recommended topics

Hot tags