Explain with example: how embedding layers in keras works
Asked Answered
I

2

23

I don't understand the Embedding layer of Keras. Although there are lots of articles explaining it, I am still confused. For example, the code below isfrom imdb sentiment analysis:

top_words = 5000
max_review_length = 500
embedding_vecor_length = 32    

model = Sequential()
model.add(Embedding(top_words, embedding_vecor_length, input_length=max_review_length))
model.add(LSTM(100))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
model.fit(X_train, y_train, nb_epoch=3, batch_size=64)

In this code, what exactly is the embedding layer doing? What would be the output of embedding layer? It would be nice if someone could explain it with some examples maybe!

Interurban answered 12/8, 2017 at 11:6 Comment(4)
Possible duplicate of What is an Embedding in Keras?Hutchens
It explained with theano but it would be easier to understand with a example in kerasInterurban
The math for layers follow the same principals.Hutchens
You may have a look at my answer: https://mcmap.net/q/203879/-what-is-an-embedding-in-keras.Stark
A
14

Embedding layer creates embedding vectors out of the input words (I myself still don't understand the math) similarly like word2vec or pre-calculated glove would do.

Before I get to your code, let's make a short example.

texts = ['This is a text', 'This is not a text']

First we turn these sentences into a vector of integers where each word is a number assigned to the word in the dictionary and order of the vector creates the sequence of the words.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences 
from keras.utils import to_categorical

max_review_length = 6  # maximum length of the sentence
embedding_vector_length = 3
top_words = 10

# num_words is the number of unique words in the sequence, if there's more top count words are taken
tokenizer = Tokenizer(top_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
input_dim = len(word_index) + 1
print('Found %s unique tokens.' % len(word_index))

# max_review_length is the maximum length of the input text so that we can create vector [... 0,0,1,3,50] where 1,3,50 are individual words
data = pad_sequences(sequences, max_review_length)

print('Shape of data tensor:', data.shape)
print(data)

[Out:] 
'This is a text' --> [0 0 1 2 3 4]
'This is not a text' --> [0 1 2 5 3 4]

Now you can input these into the embedding layer.

from keras.models import Sequential
from keras.layers import Embedding

model = Sequential()
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length, mask_zero=True))
model.compile(optimizer='adam', loss='categorical_crossentropy')
output_array = model.predict(data)

output_array contains array of size (2, 6, 3): 2 input reviews or sentences in my case, 6 is the maximum number of words in each review (max_review_length) and 3 is embedding_vector_length. E.g.

array([[[-0.01494285, -0.007915  ,  0.01764857],
    [-0.01494285, -0.007915  ,  0.01764857],
    [-0.03019481, -0.02910612,  0.03518577],
    [-0.0046863 ,  0.04763055, -0.02629668],
    [ 0.02297204,  0.02146662,  0.03114786],
    [ 0.01634104,  0.02296363, -0.02348827]],

   [[-0.01494285, -0.007915  ,  0.01764857],
    [-0.03019481, -0.02910612,  0.03518577],
    [-0.0046863 ,  0.04763055, -0.02629668],
    [-0.01736645, -0.03719328,  0.02757809],
    [ 0.02297204,  0.02146662,  0.03114786],
    [ 0.01634104,  0.02296363, -0.02348827]]], dtype=float32)

In your case you have a list of 5000 words, which can create review of maximum 500 words (more will be trimmed) and turn each of these 500 words into vector of size 32.

You can get mapping between the word indexes and embedding vectors by running:

model.layers[0].get_weights()

In the case below top_words was 10, so we have mapping of 10 words and you can see that mapping for 0, 1, 2, 3, 4 and 5 is equal to output_array above.

[array([[-0.01494285, -0.007915  ,  0.01764857],
    [-0.03019481, -0.02910612,  0.03518577],
    [-0.0046863 ,  0.04763055, -0.02629668],
    [ 0.02297204,  0.02146662,  0.03114786],
    [ 0.01634104,  0.02296363, -0.02348827],
    [-0.01736645, -0.03719328,  0.02757809],
    [ 0.0100757 , -0.03956784,  0.03794377],
    [-0.02672029, -0.00879055, -0.039394  ],
    [-0.00949502, -0.02805768, -0.04179233],
    [ 0.0180716 ,  0.03622523,  0.02232374]], dtype=float32)]

As mentioned in: https://stats.stackexchange.com/questions/270546/how-does-keras-embedding-layer-work these vectors are initiated as random and optimized by the network optimizers just like any other parameter of the network.

Arpeggio answered 23/10, 2017 at 18:48 Comment(0)
K
6

I agree with the previous detailed answer, but I would like to try and give a more intuitive explanation.

To understand how Embedding layer works, it is better to just take a step back and understand why we need Embedding in the first place.

Usually ML models take vectors (array of numbers) as input and, when dealing with text, we convert the strings into numbers. One of the easiest way to do this is one-hot encoding where, you treat each strings as categorical variable. But the first issue is that if you use a dictionary (vocabulary) of 10000 words, then one-hot encoding is pretty much waste of space (memory).

Also as discrete entities are mapped to either 0 or 1 signaling a specific category, one-hot encoding cannot capture any relation between words. Thus if you're familiar with IMDB movie data-set, one-hot encoding is nothing but useless for sentiment analysis. Because, if you measure the similarity using the cosine distance, then similarity is always zero for every comparison between different indices.

This should guide us to find a method where --

  • Similar words can have a similar encoding,
  • To represent the categorical variables, we will have fewer numbers than the number of unique categories.

Enters Embedding..

Embedding is a dense vector of floating point values and, these numbers are generated randomly and during training these values are updated via backprop just as the weights in a dense layer get updated during training.
As defined in TensorFlow docs

The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings).

Before building the model with sequential you have already used Keras Tokenizer API and input data is already integer coded. Now once you mention the number of embedding dimensions (e.g. 16, 32, 64, etc.), the number of columns of the lookup table will be determined by that.

Output of the embedding layer is always a 2D array, that's why it is usually flattened before connecting to a dense layer. In the previous answer also, you can see a 2D array of weights for the 0th layer and the number of columns = embedding vector length.

That's how I think of Embedding layer in Keras. Hopefully this shed little more light and I thought this could be a good accompaniment of the answer posted by @Vaasha.

Reference: TensorFlow Word Embedding Tutorial.

Knotweed answered 30/10, 2019 at 5:26 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.