PyTorch / Gensim - How do I load pre-trained word embeddings?
Asked Answered
C

6

52

I want to load a pre-trained word2vec embedding with gensim into a PyTorch embedding layer.

How do I get the embedding weights loaded by gensim into the PyTorch embedding layer?

Conscript answered 7/4, 2018 at 18:21 Comment(0)
C
84

I just wanted to report my findings about loading a gensim embedding with PyTorch.


  • Solution for PyTorch 0.4.0 and newer:

From v0.4.0 there is a new function from_pretrained() which makes loading an embedding very comfortable. Here is an example from the documentation.

import torch
import torch.nn as nn

# FloatTensor containing pretrained weights
weight = torch.FloatTensor([[1, 2.3, 3], [4, 5.1, 6.3]])
embedding = nn.Embedding.from_pretrained(weight)
# Get embeddings for index 1
input = torch.LongTensor([1])
embedding(input)

The weights from gensim can easily be obtained by:

import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('path/to/file')
weights = torch.FloatTensor(model.vectors) # formerly syn0, which is soon deprecated

As noted by @Guglie: in newer gensim versions the weights can be obtained by model.wv:

weights = model.wv

  • Solution for PyTorch version 0.3.1 and older:

I'm using version 0.3.1 and from_pretrained() isn't available in this version.

Therefore I created my own from_pretrained so I can also use it with 0.3.1.

Code for from_pretrained for PyTorch versions 0.3.1 or lower:

def from_pretrained(embeddings, freeze=True):
    assert embeddings.dim() == 2, \
         'Embeddings parameter is expected to be 2-dimensional'
    rows, cols = embeddings.shape
    embedding = torch.nn.Embedding(num_embeddings=rows, embedding_dim=cols)
    embedding.weight = torch.nn.Parameter(embeddings)
    embedding.weight.requires_grad = not freeze
    return embedding

The embedding can be loaded then just like this:

embedding = from_pretrained(weights)

I hope this is helpful for someone.

Conscript answered 12/4, 2018 at 17:17 Comment(8)
What is the input to your model after that? Is it the text itself or the 1-hot encoding of the text?Roughrider
PyTorch is not using the one-hot encoding, you can just use integer ids / token ids to access the respective embeddings: torch.LongTensor([1]) or for a sequence: torch.LongTensor(any_sequence) resp. torch.LongTensor([1, 2, 5, 9, 12, 92, 7]). As output you will get the respective embeddings.Conscript
@blue-phoenox how do you get the integer/token ids please?Gilbertson
@Gilbertson This is not a general answer and could cause performance to drop, since the pre-trained embedding potentially uses a different indexing than the one you have used in your application.Acrodrome
with newer versions of gensim vectors are in model.wv.vectorsStingo
@Stingo Thank you for mentioning this! I've added it.Conscript
Actually, I think this answer is incorrect. It requires that we have the same token to label mapping. i.e. we require that if label 401 corresponds to "natural" in the gensim vectors, then in our own model, we should carefully require that label 401 also corresponds to "natural"Laurin
Using the full model is the more principled approach: radimrehurek.com/gensim/models/keyedvectors.htmlLaurin
T
4

I think it is easy. Just copy the embedding weight from gensim to the corresponding weight in PyTorch embedding layer.

You need to make sure two things are correct: first is that the weight shape has to be correct, second is that the weight has to be converted to PyTorch FloatTensor type.

Tuber answered 8/4, 2018 at 2:56 Comment(1)
I didn't know there is a _weight parameter in the constructor, I will try it out - thank you!Conscript
E
3

I had the same question except that I use torchtext library with pytorch as it helps with padding, batching, and other things. This is what I've done to load pre-trained embeddings with torchtext 0.3.0 and to pass them to pytorch 0.4.1 (the pytorch part uses the method mentioned by blue-phoenox):

import torch
import torch.nn as nn
import torchtext.data as data
import torchtext.vocab as vocab

# use torchtext to define the dataset field containing text
text_field = data.Field(sequential=True)

# load your dataset using torchtext, e.g.
dataset = data.Dataset(examples=..., fields=[('text', text_field), ...])

# build vocabulary
text_field.build_vocab(dataset)

# I use embeddings created with
# model = gensim.models.Word2Vec(...)
# model.wv.save_word2vec_format(path_to_embeddings_file)

# load embeddings using torchtext
vectors = vocab.Vectors(path_to_embeddings_file) # file created by gensim
text_field.vocab.set_vectors(vectors.stoi, vectors.vectors, vectors.dim)

# when defining your network you can then use the method mentioned by blue-phoenox
embedding = nn.Embedding.from_pretrained(torch.FloatTensor(text_field.vocab.vectors))

# pass data to the layer
dataset_iter = data.Iterator(dataset, ...)
for batch in dataset_iter:
    ...
    embedding(batch.text)
Eolian answered 17/8, 2018 at 18:26 Comment(0)
D
3
from gensim.models import Word2Vec

model = Word2Vec(reviews,size=100, window=5, min_count=5, workers=4)
#gensim model created

import torch

weights = torch.FloatTensor(model.wv.vectors)
embedding = nn.Embedding.from_pretrained(weights)
Drapery answered 12/11, 2018 at 19:47 Comment(5)
Thanks for your reply. I've taken a look at the gensim to check your approach. Taking a look here at the gensim page: radimrehurek.com/gensim/models/word2vec.html#usage-examples It says the Word2Vec model is only used for training the word vectors, as this format is much slower than KeyedVectors. After you're done with training you normally save them into KeyedVectors model. This model is dedicated for saving pre-trained vectors "resulting in a much smaller and faster object" than Word2Vec model. You can do it that way, but I see no benefit in using it this way.Conscript
Thanks, @blue-phoenox I had read that I did this code under the assumption that the embeddings are created and used right away rather than loading from a file.Drapery
Of course you can do that. But this would mean that every time you start the training process you would also train the embeddings. This is just wasted computation then and not really the idea of pre-trained embeddings. When I create models, I normally run them multiple times and I do not wan't to train my pre-trained embeddings every time again when I start the training process of my model.Conscript
the main emphasis is on the torch section and hence, I leave the reader to deal with gensim model and loading, There could be situations wherein the dev could use gensim model right after creationDrapery
I was just pointing out that in this use-case the vectors are not really pre-trained. In your code example it doesn't load pre-trained vectors but instead it trains new word vectors instead. And I was just wondering if there was another use-case, therefore I was asking.Conscript
L
1

Had similar problem: "after training and saving embeddings in binary format using gensim, how I load them to torchtext?"

I just saved the file to txt format and then follow the superb tutorial of loading custom word embeddings.

def convert_bin_emb_txt(out_path,emb_file):
    txt_name = basename(emb_file).split(".")[0] +".txt"
    emb_txt_file = os.path.join(out_path,txt_name)
    emb_model = KeyedVectors.load_word2vec_format(emb_file,binary=True)
    emb_model.save_word2vec_format(emb_txt_file,binary=False)
    return emb_txt_file

emb_txt_file = convert_bin_emb_txt(out_path,emb_bin_file)
custom_embeddings = vocab.Vectors(name=emb_txt_file,
                                  cache='custom_embeddings',
                                  unk_init=torch.Tensor.normal_)

TEXT.build_vocab(train_data,
                 max_size=MAX_VOCAB_SIZE,
                 vectors=custom_embeddings,
                 unk_init=torch.Tensor.normal_)

tested for: PyTorch: 1.2.0 and TorchText: 0.4.0.

I added this answer because with the accepted answer I was not sure how to follow the linked tutorial and initialize all words not in the embeddings using the normal distribution and how to make the vectors and equal to zero.

Local answered 14/9, 2019 at 19:40 Comment(0)
D
0

I had quite some problems in understanding the documentation myself and there aren't that many good examples around. Hopefully this example helps other people. It is a simple classifier, that takes the pretrained embeddings in the matrix_embeddings. By setting requires_grad to false we make sure that we are not changing them.

class InferClassifier(nn.Module):
  def __init__(self, input_dim, n_classes, matrix_embeddings):
    """initializes a 2 layer MLP for classification.
    There are no non-linearities in the original code, Katia instructed us 
    to use tanh instead"""

    super(InferClassifier, self).__init__()

    #dimensionalities
    self.input_dim = input_dim
    self.n_classes = n_classes
    self.hidden_dim = 512

    #embedding
    self.embeddings = nn.Embedding.from_pretrained(matrix_embeddings)
    self.embeddings.requires_grad = False

    #creates a MLP
    self.classifier = nn.Sequential(
            nn.Linear(self.input_dim, self.hidden_dim),
            nn.Tanh(), #not present in the original code.
            nn.Linear(self.hidden_dim, self.n_classes))

  def forward(self, sentence):
    """forward pass of the classifier
    I am not sure it is necessary to make this explicit."""

    #get the embeddings for the inputs
    u = self.embeddings(sentence)

    #forward to the classifier
    return self.classifier(x)

sentence is a vector with the indexes of matrix_embeddings instead of words.

Dayton answered 15/4, 2019 at 17:37 Comment(2)
You mean self.classifier(u)?Religieuse
How do you get those indexes for the sentence though?Mccaslin

© 2022 - 2024 — McMap. All rights reserved.