Okay let's do this.
First you need to understand that BERT has 13 layers. The first layer is basically just the embedding layer that BERT gets passed during the initial training. You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding. For simplicity I'm going to only use the last hidden layer of BERT.
Here you're using two words: "New" and "York". You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted. In this case I'm going to treat it as two separate words and average the embedding that BERT produces.
This can be described in a few steps:
- Tokenize the inputs
- Determine where the tokenizer has word_ids for New and York (suuuuper important)
- Pass through BERT
- Average
- Cosine similarity
First, what you need to import: from transformers import AutoTokenizer, AutoModel
Now we can create our tokenizer and our model:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()
Make sure to use the model in evaluation mode unless you're trying to fine tune!
Next we need to tokenize (step 1):
tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')
Step 2. Need to determine where the index of the words match
# This is where the "New" and "York" can be found in sent1
sent1_idxs = [4, 5]
sent2_idxs = [0, 1]
tok1_ids = [np.where(np.array(tok1.word_ids()) == idx) for idx in sent1_idxs]
tok2_ids = [np.where(np.array(tok2.word_ids()) == idx) for idx in sent2_idxs]
The above code checks where the word_ids() produced by the tokenizer overlap the word indices from the original sentence. This is necessary because the tokenizer splits rare words. So if you have something like "aardvark", when you tokenize it and look at it you actually get this:
In [90]: tokenizer.convert_ids_to_tokens( tokenizer('aardvark').input_ids)
Out[90]: ['[CLS]', 'a', '##ard', '##var', '##k', '[SEP]']
In [91]: tokenizer('aardvark').word_ids()
Out[91]: [None, 0, 0, 0, 0, None]
Step 3. Pass through BERT
Now we grab the embeddings that BERT produces across the token ids that we've produced:
with torch.no_grad():
out1 = model(**tok1)
out2 = model(**tok2)
# Only grab the last hidden state
states1 = out1.hidden_states[-1].squeeze()
states2 = out2.hidden_states[-1].squeeze()
# Select the tokens that we're after corresponding to "New" and "York"
embs1 = states1[[tup[0][0] for tup in tok1_ids]]
embs2 = states2[[tup[0][0] for tup in tok2_ids]]
Now you will have two embeddings. Each is shape (2, 768)
. The first size is because you have two words we're looking at: "New" and "York. The second size is the embedding size of BERT.
Step 4. Average
Okay, so this isn't necessarily what you want to do but it's going to depend on how you treat these embeddings. What we have is two (2, 768)
shaped embeddings. You can either compare New to New and York to York or you can combine New York into an average. I'll just do that but you can easily do the other one if it works better for your task.
avg1 = embs1.mean(axis=0)
avg2 = embs2.mean(axis=0)
Step 5. Cosine sim
Cosine similarity is pretty easy using torch
:
torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1))
# tensor([0.6440])
This is good! They point in the same direction. They're not exactly 1 but that can be improved in several ways.
- You can fine tune on a training set
- You can experiment with averaging different layers rather than just the last hidden layer like I did
- You can try to be creative in combining New and York. I took the average but maybe there's a better way for your exact needs.