This is an example of a similarity search with feature vectors of dimension 300 (1.2kB for 32bit floats).
You can store the wordvectors in a geometric data structure, sklearn.neighbors.BallTree, to speed the search significantly while avoiding the high-dimensional losses associated with k-d trees (no speedup when the dimension exceeds ~100). These can be pickled and unpickled easily and held in memory if you need to avoid loading spaCy. See below for implementation details. Demo, source.
The other answers with linear search work (and I'd just note be careful when using cosine similarity if any of your vectors are zero), but will be slow for large vocabularies. spaCy's en_core_web_lg
library has about 680k words with wordvectors. As each word is typically a few bytes, this can result in memory usage of a few GB.
We can use make our search case insensitive, and remove infrequent words using a word frequency table (as of v3.0, spaCy has built in table but you now have to load them separately) to trim down the vocabulary to ~100k words. However, the search is still linear and can take a couple seconds which may not be acceptable.
There are libraries to do similarity searches quickly, however they can be quite cumbersome and complicated to install, and are meant for feature vectors on the order of MB's or GB's with GPU speedups and the rest.
We also may not want to always be loading the entire spaCy vocabulary every time the application runs, so we pickle/unpickle the vocabulary as needed.
import spacy, numpy, pickle
import sklearn.neighbors as nbs
#load spaCy
nlp=spacy.load("en_core_web_lg")
#load lexeme probability table
lookups = load_lookups("en", ["lexeme_prob"])
nlp.vocab.lookups.add_table("lexeme_prob", lookups.get_table("lexeme_prob"))
#get lowercase words above frequency threshold with vectors, min_prob=-20
words = [word for word in nlp.vocab.strings if nlp.vocab.has_vector(word) and word.islower() and nlp.vocab[word].prob >= -18]
wordvecs = numpy.array([nlp.vocab.get_vector(word) for word in words]) #get wordvectors
tree = nbs.BallTree(wordvecs) #create the balltree
dict = dict(zip(words,wordvecs)) #create word:vector dict
After trimming the vocabulary, we can pickle the words, dict, and balltree and load it when we need it without having to load spaCy again:
#pickle/unpickle the balltree if you don't want to load spaCy
with open('balltree.pkl', 'wb') as f:
pickle.dump(tree,f,protocol=pickle.HIGHEST_PROTOCOL)
#...
#load wordvector balltree from pickle file
with open('./balltree.pkl','rb') as f:
tree = pickle.load(f)
Given a word, get its wordvector, search the tree for the index of the closest word, then look up that word with a dictionary:
#get wordvector and lookup nearest words
def nearest_words(word):
#get vectors for all words
try:
vec = to_vec[word]
#if word is not in vocab, set to zero vector
except KeyError:
vec = numpy.zeros(300)
#perform nearest neighbor search of wordvector vocabulary
dist, ind = tree.query([vec],10)
#lookup nearest words using indices from tree
near_words = [vocab[i] for i in ind[0]]
return near_words
de_core_news_sm
. Threw the key error [E058] butde_core_news_md
worked. – Charitycharivari