Mapping word vector to the most similar/closest word using spaCy

Asked 15/2, 2019 at 21:43 Answered 14/11, 2021 at 20:38

Solved nlp spacy word2vec word-embedding

I am using spaCy as part of a topic modelling solution and I have a situation where I need to map a derived word vector to the "closest" or "most similar" word in a vocabulary of word vectors.

I see gensim has a function (WordEmbeddingsKeyedVectors.similar_by_vector) to calculate this, but I was wondering if spaCy has something like this to map a vector to a word within its vocabulary (nlp.vocab)?

Despite answered 15/2, 2019 at 21:43 Comment(0)

After a bit of experimentation, I found a scikit function (cdist in scikit.spatial.distance) that finds a "close" vector in a vector space to the input vector.

# Imports
from scipy.spatial import distance
import spaCy

# Load the spacy vocabulary
nlp = spacy.load("en_core_web_lg")

# Format the input vector for use in the distance function
# In this case we will artificially create a word vector from a real word ("frog")
# but any derived word vector could be used
input_word = "frog"
p = np.array([nlp.vocab[input_word].vector])

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)

# *** Find the closest word below ***
closest_index = distance.cdist(p, vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
# output_word is identical, or very close, to the input word

Despite answered 16/2, 2019 at 18:44 Comment(0)

Yes, spacy has an API method to do that, just like KeyedVectors.similar_by_vector:

import numpy as np
import spacy
    
nlp = spacy.load("en_core_web_lg")
    
your_word = "king"
    
ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]), 
    n=10,
)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)
['King', 'KIng', 'king', 'KING', 'kings', 'KINGS', 'Kings', 'PRINCE', 'Prince', 'prince']

(the words are not properly normalized in sm_core_web_lg, but you could play with other models and observe a more representative output).

Fortyfour answered 12/10, 2020 at 10:36 Comment(1)

Thanks - works for the English model but not the German model de_core_news_sm. Threw the key error [E058] but de_core_news_md worked. – Charitycharivari 1/2 at 16:22

After a bit of experimentation, I found a scikit function (cdist in scikit.spatial.distance) that finds a "close" vector in a vector space to the input vector.

# Imports
from scipy.spatial import distance
import spaCy

# Load the spacy vocabulary
nlp = spacy.load("en_core_web_lg")

# Format the input vector for use in the distance function
# In this case we will artificially create a word vector from a real word ("frog")
# but any derived word vector could be used
input_word = "frog"
p = np.array([nlp.vocab[input_word].vector])

# Format the vocabulary for use in the distance function
ids = [x for x in nlp.vocab.vectors.keys()]
vectors = [nlp.vocab.vectors[x] for x in ids]
vectors = np.array(vectors)

# *** Find the closest word below ***
closest_index = distance.cdist(p, vectors).argmin()
word_id = ids[closest_index]
output_word = nlp.vocab[word_id].text
# output_word is identical, or very close, to the input word

Despite answered 16/2, 2019 at 18:44 Comment(0)

A word of caution on this answer. Traditionally Word similarity (in gensim, spacy, and nltk) uses cosine similarity while by default, scipy's cdist uses euclidean distance. You can get the cosine distance which is not the same as similarity, but they are related. To duplicate gensim's calculation, change your cdist call to the following:

distance.cdist(p, vectors, metric='cosine').argmin()

However, you should also note that scipy measures cosine distance which is "backwards" from cosine similarity where "cosine dist" = 1 - cos x (x is the angle between vectors), so to match/duplicate the gensim numbers, you must subtract your answer from one (and of course, take the MAX argument--similar vectors are closer to 1). It is a very subtle difference but can cause a great deal of confusion.

Similar vectors should have a large (near 1) similarity, while the distance is small (close to zero).

Cosine similarity can be negative (meaning the vectors have opposite directions) but their DISTANCE will be positive (as distance should be).

source: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html

https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.n_similarity.html#gensim.models.Word2Vec.n_similarity

also to do similarity in spacy is the following:

import spacy
nlp = spacy.load("en_core_web_md")
x = nlp("man")
y = nlp("king")
print(x.similarity(y))
print(x.similarity(x))

Marlowe answered 4/11, 2019 at 1:42 Comment(1)

x.similarity was quick enough for me to iterate over all words in the vocabulary for a small number of cases. – Misguided 8/5, 2020 at 8:11

This is an example of a similarity search with feature vectors of dimension 300 (1.2kB for 32bit floats).

You can store the wordvectors in a geometric data structure, sklearn.neighbors.BallTree, to speed the search significantly while avoiding the high-dimensional losses associated with k-d trees (no speedup when the dimension exceeds ~100). These can be pickled and unpickled easily and held in memory if you need to avoid loading spaCy. See below for implementation details. Demo, source.

The other answers with linear search work (and I'd just note be careful when using cosine similarity if any of your vectors are zero), but will be slow for large vocabularies. spaCy's en_core_web_lg library has about 680k words with wordvectors. As each word is typically a few bytes, this can result in memory usage of a few GB.

We can use make our search case insensitive, and remove infrequent words using a word frequency table (as of v3.0, spaCy has built in table but you now have to load them separately) to trim down the vocabulary to ~100k words. However, the search is still linear and can take a couple seconds which may not be acceptable.

There are libraries to do similarity searches quickly, however they can be quite cumbersome and complicated to install, and are meant for feature vectors on the order of MB's or GB's with GPU speedups and the rest.

We also may not want to always be loading the entire spaCy vocabulary every time the application runs, so we pickle/unpickle the vocabulary as needed.

import spacy, numpy, pickle
import sklearn.neighbors as nbs

#load spaCy
nlp=spacy.load("en_core_web_lg")

#load lexeme probability table
lookups = load_lookups("en", ["lexeme_prob"])
nlp.vocab.lookups.add_table("lexeme_prob", lookups.get_table("lexeme_prob"))

#get lowercase words above frequency threshold with vectors, min_prob=-20
words = [word for word in nlp.vocab.strings if nlp.vocab.has_vector(word) and word.islower() and nlp.vocab[word].prob >= -18]
wordvecs = numpy.array([nlp.vocab.get_vector(word) for word in words])  #get wordvectors
tree = nbs.BallTree(wordvecs)  #create the balltree
dict = dict(zip(words,wordvecs))  #create word:vector dict

After trimming the vocabulary, we can pickle the words, dict, and balltree and load it when we need it without having to load spaCy again:

#pickle/unpickle the balltree if you don't want to load spaCy
with open('balltree.pkl', 'wb') as f:
        pickle.dump(tree,f,protocol=pickle.HIGHEST_PROTOCOL)
#...
#load wordvector balltree from pickle file
with open('./balltree.pkl','rb') as f:
    tree = pickle.load(f)

Given a word, get its wordvector, search the tree for the index of the closest word, then look up that word with a dictionary:

#get wordvector and lookup nearest words
def nearest_words(word):
    #get vectors for all words
        try:
            vec = to_vec[word]
        #if word is not in vocab, set to zero vector
        except KeyError:
            vec = numpy.zeros(300)

    #perform nearest neighbor search of wordvector vocabulary
    dist, ind = tree.query([vec],10)

    #lookup nearest words using indices from tree
    near_words = [vocab[i] for i in ind[0]]

    return near_words

Gadget answered 7/2, 2021 at 20:47 Comment(0)

# python -m spcay download en_core_web_md
import spacy
nlp = spacy.load('en_core_web_md')
word = 'overflow'
nwords = 10
doc = nlp(word)
vector = doc.vector
vect2word = lambda idx: nlp.vocab.strings[idx]
print([vect2word(simword) for simword in nlp.vocab.vectors.most_similar(vector.reshape(1,-1), n=nwords)[0][0]])

Eidson answered 14/11, 2021 at 20:38 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Bishopric 15/11, 2021 at 1:33

Recommended topics

Hot tags