How do I create a Keras Embedding layer from a pre-trained word embedding dataset?
Asked Answered
T

4

7

How do I load a pre-trained word-embedding into a Keras Embedding layer?

I downloaded the glove.6B.50d.txt (glove.6B.zip file from https://nlp.stanford.edu/projects/glove/) and I'm not sure how to add it to a Keras Embedding layer. See: https://keras.io/layers/embeddings/

Telekinesis answered 8/2, 2018 at 3:30 Comment(1)
here how to incorporate the GENSIM model inside Keras https://mcmap.net/q/534346/-using-gensim-fasttext-model-with-lstm-nn-in-kerasGreening
T
11

You will need to pass an embeddingMatrix to the Embedding layer as follows:

Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)

  • vocabLen: number of tokens in your vocabulary
  • embDim: embedding vectors dimension (50 in your example)
  • embeddingMatrix: embedding matrix built from glove.6B.50d.txt
  • isTrainable: whether you want the embeddings to be trainable or froze the layer

The glove.6B.50d.txt is a list of whitespace-separated values: word token + (50) embedding values. e.g. the 0.418 0.24968 -0.41242 ...

To create a pretrainedEmbeddingLayer from a Glove file:

# Prepare Glove File
def readGloveFile(gloveFile):
    with open(gloveFile, 'r') as f:
        wordToGlove = {}  # map from a token (word) to a Glove embedding vector
        wordToIndex = {}  # map from a token to an index
        indexToWord = {}  # map from an index to a token 

        for line in f:
            record = line.strip().split()
            token = record[0] # take the token (word) from the text line
            wordToGlove[token] = np.array(record[1:], dtype=np.float64) # associate the Glove embedding vector to a that token (word)

        tokens = sorted(wordToGlove.keys())
        for idx, tok in enumerate(tokens):
            kerasIdx = idx + 1  # 0 is reserved for masking in Keras (see above)
            wordToIndex[tok] = kerasIdx # associate an index to a token (word)
            indexToWord[kerasIdx] = tok # associate a word to a token (word). Note: inverse of dictionary above

    return wordToIndex, indexToWord, wordToGlove

# Create Pretrained Keras Embedding Layer
def createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, isTrainable):
    vocabLen = len(wordToIndex) + 1  # adding 1 to account for masking
    embDim = next(iter(wordToGlove.values())).shape[0]  # works with any glove dimensions (e.g. 50)

    embeddingMatrix = np.zeros((vocabLen, embDim))  # initialize with zeros
    for word, index in wordToIndex.items():
        embeddingMatrix[index, :] = wordToGlove[word] # create embedding: word index to Glove word embedding

    embeddingLayer = Embedding(vocabLen, embDim, weights=[embeddingMatrix], trainable=isTrainable)
    return embeddingLayer

# usage
wordToIndex, indexToWord, wordToGlove = readGloveFile("/path/to/glove.6B.50d.txt")
pretrainedEmbeddingLayer = createPretrainedEmbeddingLayer(wordToGlove, wordToIndex, False)
model = Sequential()
model.add(pretrainedEmbeddingLayer)
...
Telekinesis answered 8/2, 2018 at 17:7 Comment(5)
Can I use Word embeddings as vector representation of words in Output layer?Protolanguage
I have the impression that the format of keras.layers.Embedding with weights is deprecated if you check this (keras.io/layers/embeddings) and this (github.com/tensorflow/tensorflow/issues/14392)Rigveda
Damn, things change fast! I think that with the latest version one should use embeddings_initializer=Constant(embeddingMatrix) insteadTelekinesis
note that for some versions of Keras there is an especially nasty bug when using Constant passed to embeddings_initializer, see here for details.Backfill
usually in case of a pretrained embeddings, it is an accepted practice to set trainable=False and let the subsequent layers learn from backprop while freezing the pretrained layer. Note that by default trainable = true which is of course needed for randomly initialized weights.Indigo
B
2

There is one great blog post describing how to create embedding layer with pre-trained word vector embeddings:

https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html

Code for the above article can be found here:

https://github.com/keras-team/keras/blob/master/examples/pretrained_word_embeddings.py

Another good blog for the same purpose: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/

Bohn answered 8/2, 2018 at 3:36 Comment(1)
I have the impression that the format of keras.layers.Embedding with weights is deprecated if you check this (keras.io/layers/embeddings) and this (github.com/tensorflow/tensorflow/issues/14392)Rigveda
T
0

Some years ago, I wrote an utility package called embfile for working with "embedding files" (but I published it only in 2020). The use case I wanted to cover is the creation of a pre-trained embedding matrix to initialize an Embedding layer. I wanted to do it by loading just the word vectors I needed and as quickly as possible.

It supports various formats:

  • .txt (with or without a "header row")
  • .bin, Google Word2Vec format
  • .vvm, a custom format I use (it's just a TAR file with vocabulary, vectors and metadata in separate files so that the vocabulary can be entirely read in a fraction of second and vectors can be randomly accessed).

The package is extensively documented and tested. There are also examples that show how to use it with Keras.

import embfile

with embfile.open(EMBEDDING_FILE_PATH) as f:

    emb_matrix, word2index, missing_words = embfile.build_matrix(
        f, 
        words=vocab,     # this could also be a word2index dictionary as well
        start_index=1,   # leave the first row to zeros 
    )

This function also handles the initialization of words that are out of the file vocabulary. By default, it fits a normal distribution on found vectors and use it for generating new random vectors (this is what AllenNLP did). I'm not sure this feature is still relevant: nowadays you can generate embeddings for unknown words using FastText or whatever.

Keep in mind that txt and bin files are essentialy sequential files and require a full scan (unless you find all the words you are looking for before the end). That's why I use vvm files, which offer random access for vectors. One could have solved the problem just by indexing the sequential files but embfile doesn't have this feature. Nonetheless, you can convert sequential files to vvm (which is something similar to creating an index and packages everything in a single file).

Tribunate answered 15/2, 2021 at 15:20 Comment(0)
B
0

I was searching for a similar thing. I found this blog post which answers the question. It properly explains hot to create an embedding_matrix and pass it to the Embedding() layer.

GloVe Embeddings for deep learning in Keras.

Biometrics answered 3/5, 2021 at 16:14 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.