I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?
I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.
How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?
I just wanted to report my findings about loading a gensim embedding with PyTorch.
0.4.0
and newer:From v0.4.0
there is a new function from_pretrained()
which makes loading an embedding very comfortable.
Here is an example from the documentation.
import torch
import torch.nn as nn
# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)
The weights from gensim can easily be obtained by:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated
As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv
:
weights = model.wv
0.3.1
and older:I'm using version 0.3.1
and from_pretrained()
isn't available in this version.
Therefore I created my own from_pretrained
so I can also use it with 0.3.1
.
Code for from_pretrained
for PyTorch versions 0.3.1
or lower:
def from_pretrained(embeddings, freeze=True):
assert embeddings.dim() == 2, \
'Embeddings parameter is expected to be 2-dimensional'
rows, cols = embeddings.shape
embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
embedding.weight = torch.nn.Parameter(embeddings)
embedding.weight.requires_grad = not freeze
return embedding
The embedding can be loaded then just like this:
embedding = from_pretrained(weights)
I hope this is helpful for someone.
torch.LongTensor([1])
or for a sequence: torch.LongTensor(any_sequence)
resp. torch.LongTensor([1, 2, 5, 9, 12, 92, 7])
. As output you will get the respective embeddings. –
Conscript model.wv.vectors
–
Stingo I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.
You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.
I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):
import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab
# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)
# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])
# build vocabulary
text_field.build_vocab(dataset)
# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)
# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)
# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))
# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
...
embedding(batch.text)
from gensim.models import Word2Vec
model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created
import torch
weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)
Word2Vec
model is only used for training the word vectors, as this format is much slower than KeyedVectors
. After you're done with training you normally save them into KeyedVectors
model. This model is dedicated for saving pre-trained vectors "resulting in a much smaller and faster object" than Word2Vec
model. You can do it that way, but I see no benefit in using it this way. –
Conscript Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"
I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.
def convert_bin_emb_txt(out_path,emb_file):
txt_name = basename(emb_file).split(".")[0] +".txt"
emb_txt_file = os.path.join(out_path,txt_name)
emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
emb_model.save_word2vec_format(emb_txt_file,binary=False)
return emb_txt_file
emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
cache='custom_embeddings',
unk_init=torch.Tensor.normal_)
TEXT.build_vocab(train_data,
max_size=MAX_VOCAB_SIZE,
vectors=custom_embeddings,
unk_init=torch.Tensor.normal_)
tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.
I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.
I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings
. By setting requires_grad
to false we make sure that we are not changing them.
class InferClassifier(nn.Module):
def __init__(self, input_dim, n_classes, matrix_embeddings):
"""initializes a 2 layer MLP for classification.
There are no non-linearities in the original code, Katia instructed us
to use tanh instead"""
super(InferClassifier, self).__init__()
#dimensionalities
self.input_dim = input_dim
self.n_classes = n_classes
self.hidden_dim = 512
#embedding
self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
self.embeddings.requires_grad = False
#creates a MLP
self.classifier = nn.Sequential(
nn.Linear(self.input_dim, self.hidden_dim),
nn.Tanh(), #not present in the original code.
nn.Linear(self.hidden_dim, self.n_classes))
def forward(self, sentence):
"""forward pass of the classifier
I am not sure it is necessary to make this explicit."""
#get the embeddings for the inputs
u = self.embeddings(sentence)
#forward to the classifier
return self.classifier(x)
sentence
is a vector with the indexes of matrix_embeddings
instead of words.
self.classifier(u)
? –
Religieuse © 2022 - 2024 — McMap. All rights reserved.