I have a Word2Vec
model which is trained in Gensim
. How can I use it in Tensorflow
for Word Embeddings
. I don't want to train Embeddings from scratch in Tensorflow. Can someone tell me how to do it with some example code?
How to use pretrained Word2Vec model in Tensorflow
Asked Answered
Let's assume you have a dictionary and inverse_dict list, with index in list corresponding to most common words:
vocab = {'hello': 0, 'world': 2, 'neural':1, 'networks':3}
inv_dict = ['hello', 'neural', 'world', 'networks']
Notice how the inverse_dict index corresponds to the dictionary values. Now declare your embedding matrix and get the values:
vocab_size = len(inv_dict)
emb_size = 300 # or whatever the size of your embeddings
embeddings = np.zeroes((vocab_size, emb_size))
from gensim.models.keyedvectors import KeyedVectors
model = KeyedVectors.load_word2vec_format('embeddings_file', binary=True)
for k, v in vocab.items():
embeddings[v] = model[k]
You've got your embeddings matrix. Good. Now let's assume you want to train on the sample: x = ['hello', 'world']
. But this doesn't work for our neural net. We need to integerize:
x_train = []
for word in x:
x_train.append(vocab[word]) # integerize
x_train = np.array(x_train) # make into numpy array
Now we are good to go with embedding our samples on-the-fly
x_model = tf.placeholder(tf.int32, shape=[None, input_size])
with tf.device("/cpu:0"):
embedded_x = tf.nn.embedding_lookup(embeddings, x_model)
Now embedded_x
goes into your convolution or whatever. I am also assuming you are not retraining the embeddings, but simply using them. Hope that helps
I also thought of this more manual approach (i.e. iterating the whole vocabulary and looking them up one by one using
model.word_vec(k)
. But is there a way to make use of tf.nn.embedding_lookup
, which it seems would be more efficient? One post using Tensorflow with GloVe guillaumegenthial.github.io/… essentially produced a custom GloVe file which can be used to perform direct index-to-embeddings lookup. I wonder if one can do something similar with Word2Vec (binary) files. –
Rivet @JIXiang in practice you get all the words you want from Word2Vec and save it in a numpy array, pickle, or whatever. Loading word2vec from Gensim every time is very expensive.
tf.nn.embedding_lookup
requires a matrix, so you can't use model.word_vec(k)
on the fly. And tf
is more efficient. –
Marmion © 2022 - 2024 — McMap. All rights reserved.
embeddings[v] = model[k]
should be replaced withembeddings[v] = model.word_vec(k)
– Klaipeda