How to get cosine similarity of word embedding from BERT model

I was interesting in how to get the similarity of word embedding in different sentences from BERT model (actually, that means words have different meanings in different scenarios).

For example:

sent1 = 'I like living in New York.'
sent2 = 'New York is a prosperous city.'

I want to get the cos(New York, New York)'s value from sent1 and sent2, even if the phrase 'New York' is same, but it appears in different sentence. I got some intuition from https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2

But I still do not know which layer's embedding I need to extract and how to caculate the cos similarity for my above example.

Thanks in advance for any suggestions!

Okay let's do this.

First you need to understand that BERT has 13 layers. The first layer is basically just the embedding layer that BERT gets passed during the initial training. You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding. For simplicity I'm going to only use the last hidden layer of BERT.

Here you're using two words: "New" and "York". You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted. In this case I'm going to treat it as two separate words and average the embedding that BERT produces.

This can be described in a few steps:

Tokenize the inputs
Determine where the tokenizer has word_ids for New and York (suuuuper important)
Pass through BERT
Average
Cosine similarity

First, what you need to import: from transformers import AutoTokenizer, AutoModel

Now we can create our tokenizer and our model:

tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()

Make sure to use the model in evaluation mode unless you're trying to fine tune!

Next we need to tokenize (step 1):

tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')

Step 2. Need to determine where the index of the words match

# This is where the "New" and "York" can be found in sent1
sent1_idxs = [4, 5]
sent2_idxs = [0, 1]

tok1_ids = [np.where(np.array(tok1.word_ids()) == idx) for idx in sent1_idxs]
tok2_ids = [np.where(np.array(tok2.word_ids()) == idx) for idx in sent2_idxs]

The above code checks where the word_ids() produced by the tokenizer overlap the word indices from the original sentence. This is necessary because the tokenizer splits rare words. So if you have something like "aardvark", when you tokenize it and look at it you actually get this:

In [90]: tokenizer.convert_ids_to_tokens( tokenizer('aardvark').input_ids)
Out[90]: ['[CLS]', 'a', '##ard', '##var', '##k', '[SEP]']

In [91]: tokenizer('aardvark').word_ids()
Out[91]: [None, 0, 0, 0, 0, None]

Step 3. Pass through BERT

Now we grab the embeddings that BERT produces across the token ids that we've produced:

with torch.no_grad():
    out1 = model(**tok1)
    out2 = model(**tok2)

# Only grab the last hidden state
states1 = out1.hidden_states[-1].squeeze()
states2 = out2.hidden_states[-1].squeeze()

# Select the tokens that we're after corresponding to "New" and "York"
embs1 = states1[[tup[0][0] for tup in tok1_ids]]
embs2 = states2[[tup[0][0] for tup in tok2_ids]]

Now you will have two embeddings. Each is shape (2, 768). The first size is because you have two words we're looking at: "New" and "York. The second size is the embedding size of BERT.

Step 4. Average

Okay, so this isn't necessarily what you want to do but it's going to depend on how you treat these embeddings. What we have is two (2, 768) shaped embeddings. You can either compare New to New and York to York or you can combine New York into an average. I'll just do that but you can easily do the other one if it works better for your task.

avg1 = embs1.mean(axis=0)
avg2 = embs2.mean(axis=0)

Step 5. Cosine sim

Cosine similarity is pretty easy using torch:

torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1))

# tensor([0.6440])

This is good! They point in the same direction. They're not exactly 1 but that can be improved in several ways.

You can fine tune on a training set
You can experiment with averaging different layers rather than just the last hidden layer like I did
You can try to be creative in combining New and York. I took the average but maybe there's a better way for your exact needs.

Recommended topics

Hot tags