Using pretrained glove word embedding with scikit-learn
Asked Answered
P

1

7

I have used keras to use pre-trained word embeddings but I am not quite sure how to do it on scikit-learn model.

I need to do this in sklearn as well because I am using vecstack to ensemble both keras sequential model and sklearn model.

This is what I have done for keras model:

glove_dir = '/home/Documents/Glove'
embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.200d.txt'), 'r', encoding='utf-8')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

embedding_dim = 200


embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    if i < max_words:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

model = Sequential()
model.add(Embedding(max_words, embedding_dim, input_length=maxlen))
.
.
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False
model.compile(----)
model.fit(-----)

I am very new to scikit-learn, from what I have seen to make an model in sklearn you do:

lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.predict(x_test)

So, my question is how do I use pre-trained Glove with this model? where do I pass the pre-trained glove embedding_matrix

Thank you very much and I really appreciate your help.

Pinette answered 16/3, 2019 at 16:6 Comment(7)
Please describe what you model you want to build in sklearn, best with formula and/or descriptive diagram.Disparage
Hello, I just want a logistic regression model with pre-trained word embedding and take the average of word embedding vectors.Pinette
Input is the amazon review. Since it's a review(text), word embeddings plays a huge role, right?Pinette
So you want to input.... a bag-of-words representation of some text, i.e. a fixed length vector of counts of individual words in the text?Disparage
Well yes and no. I have used Tokenizer to vectorize and convert text into Sequences so it can be used as an input. Instead of Bag of Words I want word embeddings beacause I think bag of word approach is very domain specific and I also want to work cross domain.Pinette
@Pinette I am trying to work on a similar problem now. I think what you want to do is once you have your vectorized documents in a sparse matrix, you can add some additional columns that include the word embedding (i.e. R-vector) average of all the words in the document. That should be an additional number of features that bring context into the classifier from outside your corpus.Underside
@BlueMango, Have you solved this problem? I also need to use glove embedding with sklearn Machine learning model. Please do update?Refractive
T
12

You can simply use the Zeugma library.

You can install it with pip install zeugma, then create and train your model with the following lines of code (assuming corpus_train and corpus_test are lists of strings):

from sklearn.linear_model import LogisticRegresion
from zeugma.embeddings import EmbeddingTransformer

glove = EmbeddingTransformer('glove')
x_train = glove.transform(corpus_train)

model = LogisticRegression()
model.fit(x_train, y_train)

x_test = glove.transform(corpus_test)
model.predict(x_test)

You can also use different pre-trained embeddings (complete list here) or train your own (see Zeugma's documentation for how to do this).

Tangerine answered 7/11, 2019 at 15:8 Comment(5)
This code no longer works with Gensim 4.0.0 or higher.Abstergent
Since today, Zeugma should now support Gensim 4.0+. Just upgrade to the latest version (0.49+) with pip install -U zeugmaTangerine
Yeah, I saw, I'm upgrading it at this moment.Abstergent
There's any alternative to zeugma ? seems to me not supported anymore :/Witkowski
Hey @DanielWiczew I'm not aware of alternatives but Zeugma is still maintained, the just hasn't been commits recently because none were needed. Let me know if you experience issues with it.Tangerine

© 2022 - 2024 — McMap. All rights reserved.