List most similar words in spaCy in pretrained model
Asked Answered
A

3

15

With Gensim, after I've trained my own model, I can use model.wv.most_similar('cat', topn=5) and get a list of the 5 words that are closest to cat in the vector space. For example:

from gensim.models import Word2Vec
model = Word2Vec.load('mymodel.model')

In: model.wv.most_similar('cat', topn=5)
Out: ('kitten', .99)
     ('dog', .98)
     ...

With spaCy, as per the documentation, I can do:

import spacy

nlp = spacy.load('en_core_web_md')
tokens = nlp(u'dog cat banana')

for token1 in tokens:
    for token2 in tokens:
        print(token1.text, token2.text, token1.similarity(token2))

which gives similarity for tokens in a specified string. But combing through the docs and searching, I can't figure out if there is a gensim-type way of listing all similar words for a preloaded model with either nlp = spacy.load('en_core_web_lg') or nlp = spacy.load('en_vectors_web_lg'). Is there a way to do this?

Aparicio answered 28/8, 2019 at 17:27 Comment(0)
A
15

I used Andy's response and it worked correctly but slowly. To resolve that I took the approach below.

SpaCy uses the cosine similarity, in the backend, to compute .similarity. Therefore, I decided to replace word.similarity(w) with its optimized counterpart. The optimized method that I worked with was cosine_similarity_numba(w.vector, word.vector), shown below, that uses the Numba library to speed up computations. You should replace line 12 in the most_similar method with the line below.

by_similarity = sorted(queries, key=lambda w: cosine_similarity_numba(w.vector, word.vector), reverse=True)

The method became 2-3 times faster which was essential for me.

from numba import jit

@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu != 0 and vv != 0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

I explained it in more details in this article: How to Build a Fast “Most-Similar Words” Method in SpaCy

Asyut answered 3/10, 2020 at 5:35 Comment(0)
F
7

It's not implemented out of the box. However, based on this issue (https://github.com/explosion/spaCy/issues/276) here is a code that makes it work as you want.

import spacy
import numpy as np
nlp = spacy.load('en_core_web_lg')

def most_similar(word, topn=5):
  word = nlp.vocab[str(word)]
  queries = [
      w for w in word.vocab 
      if w.is_lower == word.is_lower and w.prob >= -15 and np.count_nonzero(w.vector)
  ]

  by_similarity = sorted(queries, key=lambda w: word.similarity(w), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

most_similar("dog", topn=3)
Fulgurate answered 19/11, 2019 at 10:41 Comment(3)
Shouldn't you iterate over all the words in nlp.vocab rather than word.vocab?Marxist
@Romain, I tried this with very poor results. Could you post the topn=3 most_similar of dog?Fallible
Indeed. The example yields: [('she', 0.42), ('when', 0.41), ('he', 0.39)] which is totally ridiculous and unfortunately is illustrated in many sites. Upvoted for the link to the relevant issue though.Luzern
S
3

Here is performance check of the methods to obtain list of most similar words. In some way this is extreme case where a model has neither w.prob nor w.cluster to narrow down the search space. I used four methods: two mentioned above, most_similar from SpaCy and most_similar from Gensim:

def spacy_most_similar(word, topn=10):
  ms = nlp_ru.vocab.vectors.most_similar(
      nlp_ru(word).vector.reshape(1,nlp_ru(word).vector.shape[0]), n=topn)
  words = [nlp_ru.vocab.strings[w] for w in ms[0][0]]
  distances = ms[2]
  return words, distances

def spacy_similarity(word, topn=10):
  word = nlp_ru.vocab[str(word)]
  queries = [
      w for w in word.vocab if w.is_lower == word.is_lower and np.count_nonzero(w.vector)
  ]
  by_similarity = sorted(queries, key=lambda w: w.similarity(word), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

def spacy_similarity_numba(word, topn=10):
  word = nlp_ru.vocab[str(word)]
  queries = [
      w for w in word.vocab if w.is_lower == word.is_lower and np.count_nonzero(w.vector)
  ]
  by_similarity = sorted(queries, key=lambda w: cosine_similarity_numba(w.vector, word.vector), reverse=True)
  return [(w.lower_,w.similarity(word)) for w in by_similarity[:topn+1] if w.lower_ != word.lower_]

from numba import jit


@jit(nopython=True)
def cosine_similarity_numba(u:np.ndarray, v:np.ndarray):
    assert(u.shape[0] == v.shape[0])
    uv = 0
    uu = 0
    vv = 0
    for i in range(u.shape[0]):
        uv += u[i]*v[i]
        uu += u[i]*u[i]
        vv += v[i]*v[i]
    cos_theta = 1
    if uu != 0 and vv != 0:
        cos_theta = uv/np.sqrt(uu*vv)
    return cos_theta

Here is timing results:

from time import time
import timeit, functools
from timeit import default_timer as timer
print(nlp_ru.vocab.vectors.shape)
arr = ("дерево")
print(f'Gensim most_similar: {timeit.Timer(functools.partial(wv.most_similar, arr)).timeit(1)}')
print(f'Spacy most_similar: {timeit.Timer(functools.partial(spacy_most_similar, arr)).timeit(1)}')
print(f'Spacy cosine_similarity_numba: {timeit.Timer(functools.partial(spacy_similarity_numba, arr)).timeit(1)}')
print(f'Spacy similarity: {timeit.Timer(functools.partial(spacy_similarity, arr)).timeit(1)}')

(1239964, 100)
Gensim most_similar: 0.06437033399993197
Spacy most_similar: 0.4855721250000897
Spacy cosine_similarity_numba: 13.404324778000046
Spacy similarity: 60.58928110700003

All methods return identical results. As you see Gensim is blazingly fast comparing to others. And you don't even need to narrow down the search space. All measurements done on CPU. Embeddings taken from here http://panchenko.me/data/dsl-backup/w2v-ru/all.norm-sz100-w10-cb0-it1-min100.w2v

Stop answered 22/10, 2020 at 12:50 Comment(1)
I tried this spacy_similarity but this but X.vocab is returning just 300 elements with very bad results.Fallible

© 2022 - 2024 — McMap. All rights reserved.