Character-Word Embeddings from lm_1b in Keras

Asked 31/5, 2017 at 1:19 Answered 29/3, 2019 at 8:18

machine-learning nlp keras language-model word-embedding

I would like to use some pre-trained word embeddings in a Keras NN model, which have been published by Google in a very well known article. They have provided the code to train a new model, as well as the embeddings here.

However, it is not clear from the documentation how to retrieve an embedding vector from a given string of characters (word) from a simple python function call. Much of the documentation seems to center on dumping vectors to a file for an entire sentence presumably for sentimental analysis.

So far, I have seen that you can feed in pretrained embeddings with the following syntax:

embedding_layer = Embedding(number_of_words??,
                            out_dim=128??,
                            weights=[pre_trained_matrix_here],
                            input_length=60??,
                            trainable=False)

However, converting the different files and their structures to pre_trained_matrix_here is not quite clear to me.

They have several softmax outputs, so I am uncertain which one would belong - and furthermore how to align the words in my input to the dictionary of words for which they have.

Is there a simple manner to use these word/char embeddings in keras and/or to construct the character/word embedding portion of the model in keras such that further layers may be added for other NLP tasks?

Saponin answered 31/5, 2017 at 1:19 Comment(5)

mccormickml.com/2016/04/12/… – Strophic 30/8, 2017 at 11:55

I can get regular word2vec or glove vectors to work, the main interest here is to use the convolutional lstm network to produce word vectors from the characters, such that the OOV words are given a good estimated vector by essentially calculating the vector on the fly. I have implemented character vectors as well, but their model was trained for weeks on a large array of GPUs, which is not something I can reporduce easily. – Saponin 31/8, 2017 at 17:50

Do you have a clear goal? What do you mean with retrieve an embedding vector? Often you just keep the embedding layer at the beginning of the model. Model weights are just a matrix that was automatically trained and saved. You can't possibly forge or assemble it from data. Either you have the trained matrix or you don't. – Dissipate 4/9, 2017 at 12:15

The lm_1b model has several different modes of output, which can encode characters, words, sentences, etc. I was hoping to create a simple python function that would use their model to convert a sentence of text to a series of word vectors (that wold not be out of vocabulary since the model is character-based). This was the hope for the hope for the question. The code appears to be set up to take in text in file format and spit it out to another file, but to change that from file to text in a variable proved to be more work than I had imaged. – Saponin 4/9, 2017 at 23:35

Lnk to their code & embeddings is dead; we cannot help much. The paper has no footnotes for where they stored their work, so we will have to read it in hopes of finding that link, which is inconvenient. Your code snippet is from keras embeddings: keras.io/layers/embeddings so I can clarify those "??" in a general sense. Your `input_length' should be the max # of words in all of your sentences, all others filled with a dummy token to that length using Tokenizer keras.io/preprocessing/text . Out_dim is the size of each embedding. Num_words = total words in embedding matrix. – Custard 2/4, 2018 at 14:45

The Embedding layer only picks up embeddings (columns of the weight matrix) for integer indices of input words, it does not know anything about the strings. This means you need to first convert your input sequence of words to a sequence of indices using the same vocabulary as was used in the model you take the embeddings from.

Shoa answered 29/3, 2019 at 8:18 Comment(0)

For NLP applications that are related to word or text encoding I would use CountVectorizer or TfidfVectorizer. Both are announced and described in a brief way for Python in the following reference: http://www.bogotobogo.com/python/scikit-learn/files/Python_Machine_Learning_Sebastian_Raschka.pdf

CounterVectorizer can be used for simple application as a SPAM-HAM detector, while TfidfVectorizer gives a deeper insight of how relevant are each term (word) in terms of their frequency in the document and the number of documents in which appears this result in an interesting metric of how discriminant are the terms considered. This text feature extractors may consider a stop-word removal and lemmatization to boost features representations.

Monetmoneta answered 2/8, 2018 at 7:35 Comment(1)

Thank you for your input Pablo. The basic tf-idf approaches are useful; however, for this question I was asking more about incorporating character-level features into an embedding, as many tasks require understanding that the string "\t__BestFriend_;\t" is highly similar to "Best-Friend", for which (highly depending on tokenization) will not happen for simple approaches like tf-idf. Although I am now creating embeddings similar to ELMo, here I was interested in using Google's similar lm1b to compute the embeddings for all my input words prior to feeding them to my task - preferably on the fly – Saponin 2/8, 2018 at 8:45

Recommended topics

Hot tags