What is an Embedding in Keras?

Asked 4/7, 2016 at 17:26 Answered 20/2, 2019 at 5:1

106

Keras documentation isn't clear what this actually is. I understand we can use this to compress the input feature space into a smaller one. But how is this done from a neural design perspective? Is it an autoenocder, RBM?

Belsky answered 4/7, 2016 at 17:26 Comment(3)

It's a lookup table that can be trained – Autochthonous 15/10, 2016 at 15:31

It simply creates and indexes a weight matrix; see my detailed answer below (https://mcmap.net/q/203879/-what-is-an-embedding-in-keras). – Outworn 1/11, 2018 at 15:52

Although the most voted answer says it's a matrix multiplication, the source code and other answers show that in fact they're just a trainable matrix. The input words just pick the respective row in this matrix. – Keating 5/11, 2018 at 11:47

As far as I know, the Embedding layer is a simple matrix multiplication that transforms words into their corresponding word embeddings.

The weights of the Embedding layer are of the shape (vocabulary_size, embedding_dimension). For each training sample, its input are integers, which represent certain words. The integers are in the range of the vocabulary size. The Embedding layer transforms each integer i into the ith line of the embedding weights matrix.

In order to quickly do this as a matrix multiplication, the input integers are not stored as a list of integers but as a one-hot matrix. Therefore the input shape is (nb_words, vocabulary_size) with one non-zero value per line. If you multiply this by the embedding weights, you get the output in the shape

(nb_words, vocab_size) x (vocab_size, embedding_dim) = (nb_words, embedding_dim)

So with a simple matrix multiplication you transform all the words in a sample into the corresponding word embeddings.

Ulland answered 6/7, 2016 at 14:32 Comment(13)

Interesting that it is just a simple matrix multiplication. Do you think we'd gain anything by learning the embedding with an autoencoder? – Belsky 7/7, 2016 at 1:21

Definitely a valid approach (see Semi-Supervised Sequence Learning ). You can also learn the embeddings with an autoencoder and then use them as initialization of the Embedding layer to reduce the complexity of you neural network (I assume that you do something else after the Embedding layer). – Ulland 7/7, 2016 at 8:28

Here is a nice blogpost about word embeddings and their advantages. – Sequela 27/7, 2016 at 12:0

I guess in this case, each training sample can be a sentence. Each sentence is represented as a one-hot vector. Is it correct? – Incidence 2/12, 2016 at 14:51

In the case that I presented, each training input is a set of words (can be a sentence). Each word is represented as one-hot vector and embedded into a dense vector. The disadvantage of this approach is that, since the input needs to be of constant length, all your sentences need to have the same number of words. An alternative would be paragraph vectors, which can embed sentences, paragraphs or even documents into vectors. – Ulland 2/12, 2016 at 15:21

Does anyone know, keras built-in embedding function is working based on what embedding function? I mean does it follow the same function as w2v or it just use one-hot encoding or something else? – Centuplicate 6/7, 2017 at 7:9

It will train the Embedding layer weights like all other weights in your neural network (e.g. with stochastic gradient descent). You can also pretrain you word embeddings with w2v and use them as initial weights for the Embedding layer. You can then make the weights static or trainable, depending on your preference. – Ulland 6/7, 2017 at 8:1

Thanks @Lorrit, does it mean that it consider the semantic similarity of the words (words happen in the same context are more semantically similar) to generate the vector for them? I know the semantic behind w2v algorithm but I would like to know the semantic behind keras word embedding too! Do the words in the same sequence are gonna get the much similar vectors here? – Centuplicate 6/7, 2017 at 21:34

The Embedding layer will just optimize its weights in order to minimize the loss. Maybe that means that it will consider the semantic similarity, maybe it won't. You never know with neural networks. If you want to be sure that the embedding follows a certain formula (e.g. w2v), use the formula. If you have enough data, you might want to use the Embedding layer and train the embeddings. Just try it and check whether you like the results. – Ulland 8/7, 2017 at 0:37

Source code have this paper as reference 'A Theoretically Grounded Application of Dropout in Recurrent Neural Networks' arxiv.org/pdf/1512.05287.pdf – Gstring 17/8, 2017 at 14:53

Just a small adjustment to @Ulland comment on semantic similarity. Strictly speaking, while the outcome may reflect semantic similarity, embeddings is a way to completely void any need for corpus/lexicon of any kind. In other words, there is no consideration for semantic similarity or other aspects that things like word2vec and conventional NLP approaches depend on. – Mammy 10/9, 2017 at 16:23

I agree with user36624 (answer below). Its NOT a simple matrix multiplication. – Keating 8/5, 2018 at 14:29

@DanielMöller, I agree that the Keras Embedding layer is not doing any matrix multiplication as I show with my answer beow. However, everyone has upvoted this answer and the moderators are not doing anything about all this...haha.... – Outworn 2/11, 2018 at 11:30

The Keras Embedding layer is not performing any matrix multiplication but it only:

1. creates a weight matrix of (vocabulary_size)x(embedding_dimension) dimensions

2. indexes this weight matrix

It is always useful to have a look at the source code to understand what a class does. In this case, we will have a look at the class Embedding which inherits from the base layer class called Layer.

(1) - Creating a weight matrix of (vocabulary_size)x(embedding_dimension) dimensions:

This is occuring at the build function of Embedding:

def build(self, input_shape):
    self.embeddings = self.add_weight(
        shape=(self.input_dim, self.output_dim),
        initializer=self.embeddings_initializer,
        name='embeddings',
        regularizer=self.embeddings_regularizer,
        constraint=self.embeddings_constraint,
        dtype=self.dtype)
    self.built = True

If you have a look at the base class Layer you will see that the function add_weight above simply creates a matrix of trainable weights (in this case of (vocabulary_size)x(embedding_dimension) dimensions):

def add_weight(self,
               name,
               shape,
               dtype=None,
               initializer=None,
               regularizer=None,
               trainable=True,
               constraint=None):
    """Adds a weight variable to the layer.
    # Arguments
        name: String, the name for the weight variable.
        shape: The shape tuple of the weight.
        dtype: The dtype of the weight.
        initializer: An Initializer instance (callable).
        regularizer: An optional Regularizer instance.
        trainable: A boolean, whether the weight should
            be trained via backprop or not (assuming
            that the layer itself is also trainable).
        constraint: An optional Constraint instance.
    # Returns
        The created weight variable.
    """
    initializer = initializers.get(initializer)
    if dtype is None:
        dtype = K.floatx()
    weight = K.variable(initializer(shape),
                        dtype=dtype,
                        name=name,
                        constraint=constraint)
    if regularizer is not None:
        with K.name_scope('weight_regularizer'):
            self.add_loss(regularizer(weight))
    if trainable:
        self._trainable_weights.append(weight)
    else:
        self._non_trainable_weights.append(weight)
    return weight

(2) - Indexing this weight matrix

This is occuring at the call function of Embedding:

def call(self, inputs):
    if K.dtype(inputs) != 'int32':
        inputs = K.cast(inputs, 'int32')
    out = K.gather(self.embeddings, inputs)
    return out

This functions returns the output of the Embedding layer which is K.gather(self.embeddings, inputs). What tf.keras.backend.gather exactly does is to index the weights matrix self.embeddings (see build function above) according to the inputs which should be lists of positive integers.

These lists can be retrieved for example if you pass your text/words inputs to the one_hot function of Keras which encodes a text into a list of word indexes of size n (this is NOT one hot encoding - see also this example for more info: https://machinelearningmastery.com/use-word-embedding-layers-deep-learning-keras/).

Therefore, that's all. There is no matrix multiplication.

On the contrary, the Keras Embedding layer is only useful because exactly it avoids performing a matrix multiplication and hence it economizes on some computational resources.

Otherwise, you could just use a Keras Dense layer (after you have encoded your input data) to get a matrix of trainable weights (of (vocabulary_size)x(embedding_dimension) dimensions) and then simply do the multiplication to get the output which will be exactly the same with the output of the Embedding layer.

Outworn answered 1/11, 2018 at 12:46 Comment(0)

In Keras, the Embedding layer is NOT a simple matrix multiplication layer, but a look-up table layer (see call function below or the original definition).

def call(self, inputs):
    if K.dtype(inputs) != 'int32':
        inputs = K.cast(inputs, 'int32')
    out = K.gather(self.embeddings, inputs)
    return out

What it does is to map each a known integer n in inputs to a trainable feature vector W[n], whose dimension is the so-called embedded feature length.

Andromede answered 14/4, 2018 at 22:18 Comment(3)

Well when you multiply a one-hot represented set of vectors with a matrix, the product becomes a look-up. So the Embedding layer is indeed a matrix multiplication. – Linton 28/4, 2018 at 14:16

Except that nowhere keras performs this multiplication. It just defines "embeddings = a trainable matrix", and use the input indices to gather words from the matrix. – Keating 8/5, 2018 at 14:34

Thus, this embedding spares a lot of memory by simply not creating any one-hot version of the inputs. – Keating 8/5, 2018 at 14:35

In simple words (from the functionality point of view), it is a one-hot encoder and fully-connected layer. The layer weights are trainable.

Kanal answered 20/2, 2019 at 5:1 Comment(0)

Recommended topics

Hot tags