How to remove a word completely from a Word2Vec model in gensim?
Asked Answered
V

6

18

Given a model, e.g.

from gensim.models.word2vec import Word2Vec


documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of EPS",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"]

texts = [d.lower().split() for d in documents]

w2v_model = Word2Vec(texts, size=5, window=5, min_count=1, workers=10)

It's possible to remove the word from the w2v vocabulary, e.g.

# Originally, it's there.
>>> print(w2v_model['graph'])
[-0.00401433  0.08862179  0.08601206  0.05281207 -0.00673626]

>>> print(w2v_model.wv.vocab['graph'])
Vocab(count:3, index:5, sample_int:750148289)

# Find most similar words.
>>> print(w2v_model.most_similar('graph'))
[('binary', 0.6781558990478516), ('a', 0.6284914612770081), ('unordered', 0.5971308350563049), ('perceived', 0.5612867474555969), ('iv', 0.5470727682113647), ('error', 0.5346164703369141), ('machine', 0.480206698179245), ('quasi', 0.256790429353714), ('relation', 0.2496253103017807), ('trees', 0.2276223599910736)]

# We can delete it from the dictionary
>>> del w2v_model.wv.vocab['graph']
>>> print(w2v_model['graph'])
KeyError: "word 'graph' not in vocabulary"

But when we do a similarity on other words after deleting graph, we see the word graph popping up, e.g.

>>> w2v_model.most_similar('binary')
[('unordered', 0.8710334300994873), ('ordering', 0.8463168144226074), ('perceived', 0.7764195203781128), ('error', 0.7316686511039734), ('graph', 0.6781558990478516), ('generation', 0.5770125389099121), ('computer', 0.40017056465148926), ('a', 0.2762695848941803), ('testing', 0.26335978507995605), ('trees', 0.1948457509279251)]

How to remove a word completely from a Word2Vec model in gensim?


Updated

To answer @vumaasha's comment:

could you give some details as to why you want to delete a word

  • Lets say my universe of words in all words in the corpus to learn the dense relations between all words.

  • But when I want to generate the similar words, it should only come from a subset of domain specific word.

  • It's possible to generate more than enough from .most_similar() then filter the words but lets say the space of the specific domain is small, I might be looking for a word that's ranked 1000th most similar which is inefficient.

  • It would be better if the word is totally removed from the word vectors then the .most_similar() words won't return words outside of the specific domain.

Visser answered 23/2, 2018 at 5:26 Comment(4)
could you give some details as to why you want to delete a wordVexillum
Sorry the motivation to delete a word is too long to type as a comment, see updated question. It shouldn't be hard to just remove a word totally from the embedding matrix. Just that there seems to be something I'm missing and not sure how it can be removed. Maybe it's because it's not possible to remove since the similarity is already sort of hard-baked into the huffman tree per word.Visser
do you have a complete list of domain specific keywords that you want to get in similarity results?Vexillum
Yes, I do. But please note that removing them before training would have removed the relations of the words outside of the domain, so that's not desirable. They have to be removed after training. Think of the model as a pre-trained model and it's meant to adapt to a domain but I'm not implying full-blown transfer learning here.Visser
G
12

I wrote a function which removes words from KeyedVectors which aren't in a predefined word list.

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = new_vectors
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    w2v.vectors_norm = new_vectors_norm

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

Glossology answered 18/1, 2019 at 17:42 Comment(4)
I don't know if the specs have changed since, but for readers using gensim 3.7.1, you have to keep vectors as numpy array, thus w2v.vectors = numpy.array(new_vectors). Also, you need to call w2v.init_sims() before you call thish function. Last, save the model with w2v.save_word2vec_format(), not a usual w2v.save(), and load with w2v.load_word2vec_format() as shown in the answer. Thanks @Glossology for the cool function, and hope this helps any future readers.Chinchy
if you want your model to be savable you must use numpy array. you can see saveable version of code with a few more comments in https://mcmap.net/q/741531/-gensim-word2vec-select-minor-set-of-word-vectors-from-pretrained-model.Taeniasis
Could the model be trained in an increased way after this rewrite?Eustatius
Doesn't work for Gensim 4, see my answerCrystlecs
V
2

There is no direct way to do what you are looking for. However, you are not completely lost. The method most_similar is implemented in the class WordEmbeddingsKeyedVectors (check the link). You can take a look at this method and modify it to suit your needs.

The lines shown below perform the actual logic of computing the similar words, you need to replace the variable limited with vectors corresponding to words of your interest. Then you are done

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
        dists = dot(limited, mean)
        if not topn:
            return dists
best = matutils.argsort(dists, topn=topn + len(all_words), reverse=True)

Update:

limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]

If you see this line, it means if restrict_vocab is used it restricts top n words in the vocab, it is meaningful only if you have sorted the vocab by frequency. If you are not passing restrict_vocab, self.vectors_norm is what goes into limited

the method most_similar calls another method init_sims. This initializes the value for [self.vector_norm][4] like shown below

        self.vectors_norm = (self.vectors / sqrt((self.vectors ** 2).sum(-1))[..., newaxis]).astype(REAL)

so, you can pickup the words that you are interested in, prepare their norm and use it in place of limited. This should work

Vexillum answered 23/2, 2018 at 14:43 Comment(1)
Thanks but that's not exactly correct, we have to be careful here. limited here points to the restrict_vocab (github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…) which isn't a list of specified vocabulary but an integer point to limit the most-similar search, see github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/…Visser
S
2

Note that this does not trim the model per se. It trims the KeyedVectors object that the similarity look-ups is based on.

Suppose you only want to keep the top 5000 words in your model.

wv = w2v_model.wv
words_to_trim = wv.index2word[5000:]
# In op's case 
# words_to_trim = ['graph'] 
ids_to_trim = [wv.vocab[w].index for w in words_to_trim]

for w in words_to_trim:
    del wv.vocab[w]

wv.vectors = np.delete(wv.vectors, ids_to_trim, axis=0)
wv.init_sims(replace=True)

for i in sorted(ids_to_trim, reverse=True):
    del(wv.index2word[i])

This does the job because the BaseKeyedVectors class contains the following attributes: self.vectors, self.vectors_norm, self.vocab, self.vector_size, self.index2word.

The advantage of this is that if you write the KeyedVectors using methods such as save_word2vec_format(), the file is much smaller.

Serica answered 22/12, 2018 at 22:52 Comment(0)
C
1

Have tried and felt that the most straightforward way is as follows:

  1. Get the Word2Vec embeddings in text file format.
  2. Identify the lines corresponding to the word vectors that you would like to keep.
  3. Write a new text file Word2Vec embedding model.
  4. Load model and enjoy (save to binary if you wish, etc.)...

My sample code is as follows:

line_no = 0 # line0 = header
numEntities=0
targetLines = []

with open(file_entVecs_txt,'r') as fp:
    header = fp.readline() # header

    while True:
        line = fp.readline()
        if line == '': #EOF
            break
        line_no += 1

        isLatinFlag = True
        for i_l, char in enumerate(line):
            if not isLatin(char): # Care about entity that is Latin-only
                isLatinFlag = False
                break
            if char==' ': # reached separator
                ent = line[:i_l]
                break

        if not isLatinFlag:
            continue

        # Check for numbers in entity
        if re.search('\d',ent):
            continue

        # Check for entities with subheadings '#' (e.g. 'ENTITY/Stereotactic_surgery#History')
        if re.match('^ENTITY/.*#',ent):
            continue

        targetLines.append(line_no)
        numEntities += 1

# Update header with new metadata
header_new = re.sub('^\d+',str(numEntities),header,count=1)

# Generate the file
txtWrite('',file_entVecs_SHORT_txt)
txtAppend(header_new,file_entVecs_SHORT_txt)

line_no = 0
ptr = 0
with open(file_entVecs_txt,'r') as fp:
    while ptr < len(targetLines):
        target_line_no = targetLines[ptr]

        while (line_no != target_line_no):
            fp.readline()
            line_no+=1

        line = fp.readline()
        line_no+=1
        ptr+=1
        txtAppend(line,file_entVecs_SHORT_txt)

FYI. FAILED ATTEMPT I tried out @zsozso's method (with the np.array modifications suggested by @Taegyung), left it to run overnight for at least 12 hrs, it was still stuck at getting new words from the restricted set...). This is perhaps because I have a lot of entities... But my text-file method works within an hour.

FAILED CODE

# [FAILED] Stuck at Building new vocab...
def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    print('Building new vocab..')

    for i in range(len(w2v.vocab)):

        if (i%int(1e6)==0) and (i!=0):
            print(f'working on {i}')

        word = w2v.index2entity[i]
        vec = np.array(w2v.vectors[i])
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    print('Assigning new vocab')
    w2v.vocab = new_vocab
    print('Assigning new vectors')
    w2v.vectors = np.array(new_vectors)
    print('Assigning new index2entity, index2word')
    w2v.index2entity = new_index2entity
    w2v.index2word = new_index2entity
    print('Assigning new vectors_norm')
    w2v.vectors_norm = np.array(new_vectors_norm)
Crashaw answered 18/4, 2019 at 7:52 Comment(0)
C
1

Same idea as in zsozso's answer, but for Gensim 4:

def restrict_w2v(w2v, restricted_word_set):
    new_index_to_key = []
    new_key_to_index = {}
    new_vectors = []
    for ind, word in enumerate(w2v.index_to_key):
        if word in restricted_word_set:
            new_key_to_index[word] = len(new_index_to_key)
            new_index_to_key.append(word)
            new_vectors.append(w2v.vectors[ind])
    w2v.index_to_key = new_index_to_key
    w2v.key_to_index = new_key_to_index
    w2v.vectors = np.array(new_vectors)

Usage:

restricted_words = ...
vectors = KeyedVectors.load_word2vec_format(input_file)
restrict_w2v(vectors, restricted_words)
vectors.save_word2vec_format(output_file)

Checked and it works for me (Gensim 4.3.1)

Crystlecs answered 29/7, 2023 at 18:14 Comment(0)
N
1

But when I want to generate the similar words, it should only come from a subset of domain specific word.

You can use most_similar_to_given to get the most similar word, which comes from a set of your choice. The method uses cosine similarity under the hood.

Example

import gensim.downloader

w2v = gensim.downloader.load('glove-twitter-50')
w2v.most_similar_to_given("hotel", ["plane", "house", "penguin"]) # yieldshouse
Nigel answered 10/10, 2023 at 21:44 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.