Gensim Word2Vec select minor set of word vectors from pretrained model
Asked Answered
G

3

8

I have a large pretrained Word2Vec model in gensim from which I want to use the pretrained word vectors for an embedding layer in my Keras model.

The problem is that the embedding size is enormous and I don't need most of the word vectors (because I know which words can occure as Input). So I want to get rid of them to reduce the size of my embedding layer.

Is there a way to just keep desired wordvectors (including the coresponding indices!), based on a whitelist of words?

Glace answered 18/6, 2018 at 17:32 Comment(0)
C
14

Thanks to this answer (I've changed the code a little bit to make it better). you can use this code for solving your problem.

we have all our minor set of words in restricted_word_set(it can be either list or set) and w2v is our model, so here is the function:

import numpy as np

def restrict_w2v(w2v, restricted_word_set):
    new_vectors = []
    new_vocab = {}
    new_index2entity = []
    new_vectors_norm = []

    for i in range(len(w2v.vocab)):
        word = w2v.index2entity[i]
        vec = w2v.vectors[i]
        vocab = w2v.vocab[word]
        vec_norm = w2v.vectors_norm[i]
        if word in restricted_word_set:
            vocab.index = len(new_index2entity)
            new_index2entity.append(word)
            new_vocab[word] = vocab
            new_vectors.append(vec)
            new_vectors_norm.append(vec_norm)

    w2v.vocab = new_vocab
    w2v.vectors = np.array(new_vectors)
    w2v.index2entity = np.array(new_index2entity)
    w2v.index2word = np.array(new_index2entity)
    w2v.vectors_norm = np.array(new_vectors_norm)

WARNING: when you first create the model the vectors_norm == None so you will get an error if you use this function there. vectors_norm will get a value of the type numpy.ndarray after the first use. so before using the function try something like most_similar("cat") so that vectors_norm not be equal to None.

It rewrites all of the variables which are related to the words based on the Word2VecKeyedVectors.

Usage:

w2v = KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin.gz", binary=True)
w2v.most_similar("beer")

[('beers', 0.8409687876701355),
('lager', 0.7733745574951172),
('Beer', 0.71753990650177),
('drinks', 0.668931245803833),
('lagers', 0.6570086479187012),
('Yuengling_Lager', 0.655455470085144),
('microbrew', 0.6534324884414673),
('Brooklyn_Lager', 0.6501551866531372),
('suds', 0.6497018337249756),
('brewed_beer', 0.6490240097045898)]

restricted_word_set = {"beer", "wine", "computer", "python", "bash", "lagers"}
restrict_w2v(w2v, restricted_word_set)
w2v.most_similar("beer")

[('lagers', 0.6570085287094116),
('wine', 0.6217695474624634),
('bash', 0.20583480596542358),
('computer', 0.06677375733852386),
('python', 0.005948573350906372)]

it can be used for removing some words either.

Chemosh answered 17/4, 2019 at 10:8 Comment(1)
Very helpful, do note that the type of index2entity and index2word seems to be list and not ndarray (this gives problems when you call other functions like w2v.add() afterwards).Ornas
F
0

There's no built-in feature that does exactly that, but it shouldn't require much code, and could be modeled on existing gensim code. A few possible alternative strategies:

  1. Load the full vectors, then save in an easy-to-parse format - such as via .save_word2vec_format(..., binary=False). This format is nearly self-explanatory; write your own code to drop all lines from this file that aren't on your whitelist (being sure to update the leading line declaration of entry-count). The existing source code for load_word2vec_format() & save_word2vec_format() may be instructive. You'll then have a subset file.

  2. Or, pretend you were going to train a new Word2Vec model, using your corpus-of-interest (with just the interesting words). But, only create the model and do the build_vocab() step. Now, you have untrained model, with random vectors, but just the right vocabulary. Grab the model's wv property - a KeyedVectors instance with that right vocabulary. Then separately load the oversized vector-set, and for each word in the right-sized KeyedVectors, copy over the actual vector from the larger set. Then save the right-sized subset.

  3. Or, look at the (possibly-broken-since-gensim-3.4) method on Word2Vec intersect_word2vec_format(). It more-or-less tries to do what's described in (2) above: with an in-memory model that has the vocabulary you want, merge in just the overlapping words from another word2vec-format set on disk. It'll either work, or provide the template for what you'd want to do.

Freda answered 18/6, 2018 at 20:1 Comment(0)
L
0

Some years ago, I wrote an utility package called embfile for working with "embedding files" (but I published it only in 2020). It supports various formats:

  • .txt (with or without a "header row")
  • .bin, Google Word2Vec format
  • .vvm, a custom format I use (it's just a TAR file with vocabulary, vectors and metadata in separate files so that the vocabulary can be entirely read in a fraction of second and vectors can be randomly accessed).

The use case I wanted to cover is the creation of a pre-trained embedding matrix to initialize an Embedding layer. I wanted to do it by loading just the word vectors I needed and as quickly as possible.

import embfile

with embfile.open(EMBEDDING_FILE_PATH) as f:

    emb_matrix, word2index, missing_words = embfile.build_matrix(
        f, 
        words=vocab,     # this could also be a word2index dictionary as well
        start_index=1,   # leave the first row to zeros 
    )

This function also handles the initialization of words that are out of the file vocabulary. By default, it fits a normal distribution on found vectors and use it for generating new random vectors (this is what AllenNLP did). I'm not sure this feature is still relevant: nowadays you can generate embeddings for unknown words using FastText or whatever.

The package is extensively documented and tested. There are also examples that show how to use it with Keras.

Keep in mind that txt and bin files are essentialy sequential files and require a full scan (unless you find all the words you are looking for before the end). That's why I use vvm files, which offer random access for vectors. One could have solved the problem just by indexing the sequential files but embfile doesn't have this feature. Nonetheless, you can convert sequential files to vvm (which is something similar to creating an index and packages everything in a single file).

Lightfoot answered 15/2, 2021 at 15:12 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.