Given a model, e.g.
from gensim.models.word2vec import Word2Vec
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]
texts = [d.lower().split() for d in documents]
w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)
It's possible to remove the word from the w2v vocabulary, e.g.
# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433 0.08862179 0.08601206 0.05281207 -0.00673626]
>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)
# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]
# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"
But when we do a similarity on other words after deleting graph
, we see the word graph
popping up, e.g.
>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]
How to remove a word completely from a Word2Vec model in gensim?
Updated
To answer @vumaasha's comment:
could you give some details as to why you want to delete a word
Lets say my universe of words in all words in the corpus to learn the dense relations between all words.
But when I want to generate the similar words, it should only come from a subset of domain specific word.
It's possible to generate more than enough from
.most_similar()
then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.It would be better if the word is totally removed from the word vectors then the
.most_similar()
words won't return words outside of the specific domain.