Using pretrained gensim Word2vec embedding in keras
Asked Answered
T

3

12

I have trained word2vec in gensim. In Keras, I want to use it to make matrix of sentence using that word embedding. As storing the matrix of all the sentences is very space and memory inefficient. So, I want to make embedding layer in Keras to achieve this so that It can be used in further layers(LSTM). Can you tell me in detail how to do this?

PS: It is different from other questions because I am using gensim for word2vec training instead of keras.

Thromboembolism answered 1/9, 2018 at 8:53 Comment(1)
here how to incorporate the GENSIM model inside Keras https://mcmap.net/q/534346/-using-gensim-fasttext-model-with-lstm-nn-in-kerasCanova
B
18

Let's say you have following data that you need to encode

docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size

t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1

You can then enocde it to sequences like this

encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

You can then pad the sequences so that all the sequences are of a fixed length

max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

Then use the word2vec model to make embedding matrix

# load embedding as a dict
def load_embedding(filename):
    # load embedding into memory, skip first line
    file = open(filename,'r')
    lines = file.readlines()[1:]
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = asarray(parts[1:], dtype='float32')
    return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding.get(word)
    return weight_matrix

# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)

Once you have the embedding matrix you can use it in Embedding layer like this

e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)

This layer can be used in making a model like this

model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove

For using word2vec see this post

Berndt answered 1/9, 2018 at 9:16 Comment(6)
Will it work if my Word2vec embedding is trained using gensim?Thromboembolism
I am asking because gensim's word2vec trained model is not txt file and you have written for text fileThromboembolism
No, it is the model file. But there are ways to save it in .bin. I will do that. Will this work for it?Thromboembolism
yes. If there is a bin file, you can save it to txtBerndt
Let us continue this discussion in chat.Thromboembolism
I have the impression that the format of keras.layers.Embedding with weights is deprecated if you check this (keras.io/layers/embeddings) and this (github.com/tensorflow/tensorflow/issues/14392)Hydroid
S
19

With the new Gensim version this is pretty easy:

w2v_model.wv.get_keras_embedding(train_embeddings=False)

there you have your Keras embedding layer

Sattler answered 6/8, 2019 at 11:30 Comment(4)
Simple and elegentAntagonist
seems this has problems later on if you try to use tf.keras and "pure" kerasWatercourse
Should I still use the Tokenizer if calling this function? I'm a little confused if this Embedding layer will really be fitted to the representation created by the Tokenizer. If not, then how can I tokenize the text samples?Jural
Ok, I managed to find the answer for it: you can extract the word indices from the gensim model and feed the tokenizer: ``` vocabulary = {word: vector.index for word, vector in embedding.vocab.items()} tk = Tokenizer(num_words=len(vocabulary)) tk.word_index = vocabulary tk.texts_to_sequences(samples) ```Jural
B
18

Let's say you have following data that you need to encode

docs = ['Well done!',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort!',
        'not good',
        'poor work',
        'Could have done better.']

You must then tokenize it using the Tokenizer from Keras like this and find the vocab_size

t = Tokenizer()
t.fit_on_texts(docs)
vocab_size = len(t.word_index) + 1

You can then enocde it to sequences like this

encoded_docs = t.texts_to_sequences(docs)
print(encoded_docs)

You can then pad the sequences so that all the sequences are of a fixed length

max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')

Then use the word2vec model to make embedding matrix

# load embedding as a dict
def load_embedding(filename):
    # load embedding into memory, skip first line
    file = open(filename,'r')
    lines = file.readlines()[1:]
    file.close()
    # create a map of words to vectors
    embedding = dict()
    for line in lines:
        parts = line.split()
        # key is string word, value is numpy array for vector
        embedding[parts[0]] = asarray(parts[1:], dtype='float32')
    return embedding

# create a weight matrix for the Embedding layer from a loaded embedding
def get_weight_matrix(embedding, vocab):
    # total vocabulary size plus 0 for unknown words
    vocab_size = len(vocab) + 1
    # define weight matrix dimensions with all 0
    weight_matrix = zeros((vocab_size, 100))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for word, i in vocab.items():
        weight_matrix[i] = embedding.get(word)
    return weight_matrix

# load embedding from file
raw_embedding = load_embedding('embedding_word2vec.txt')
# get vectors in the right order
embedding_vectors = get_weight_matrix(raw_embedding, t.word_index)

Once you have the embedding matrix you can use it in Embedding layer like this

e = Embedding(vocab_size, 100, weights=[embedding_vectors], input_length=4, trainable=False)

This layer can be used in making a model like this

model = Sequential()
e = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=4, trainable=False)
model.add(e)
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)

All the codes are adapted from this awesome blog post. follow it to know more about Embeddings using Glove

For using word2vec see this post

Berndt answered 1/9, 2018 at 9:16 Comment(6)
Will it work if my Word2vec embedding is trained using gensim?Thromboembolism
I am asking because gensim's word2vec trained model is not txt file and you have written for text fileThromboembolism
No, it is the model file. But there are ways to save it in .bin. I will do that. Will this work for it?Thromboembolism
yes. If there is a bin file, you can save it to txtBerndt
Let us continue this discussion in chat.Thromboembolism
I have the impression that the format of keras.layers.Embedding with weights is deprecated if you check this (keras.io/layers/embeddings) and this (github.com/tensorflow/tensorflow/issues/14392)Hydroid
G
3

My code for gensim-trained w2v model. Assume all words trained in the w2v model is now a list variable called all_words.

from keras.preprocessing.text import Tokenizer
import gensim
import pandas as pd
import numpy as np
from itertools import chain

w2v = gensim.models.Word2Vec.load("models/w2v.model")
vocab = w2v.wv.vocab    
t = Tokenizer()

vocab_size = len(all_words) + 1
t.fit_on_texts(all_words)

def get_weight_matrix():
    # define weight matrix dimensions with all 0
    weight_matrix = np.zeros((vocab_size, w2v.vector_size))
    # step vocab, store vectors using the Tokenizer's integer mapping
    for i in range(len(all_words)):
        weight_matrix[i + 1] = w2v[all_words[i]]
    return weight_matrix

embedding_vectors = get_weight_matrix()
emb_layer = Embedding(vocab_size, output_dim=w2v.vector_size, weights=[embedding_vectors], input_length=FIXED_LENGTH, trainable=False)
Glauce answered 8/4, 2019 at 2:20 Comment(3)
And what exactly is combined data?Rickrickard
@AleksandarMakragić Edited.Glauce
whats all_words, isnt all_word == vocab?Pili

© 2022 - 2024 — McMap. All rights reserved.