Train only some word embeddings (Keras)

Asked 27/2, 2018 at 13:2 Answered 26/3, 2018 at 22:26

In my model, I use GloVe pre-trained embeddings. I wish to keep them non-trainable in order to decrease the number of model parameters and avoid overfit. However, I have a special symbol whose embedding I do want to train.

Using the provided Embedding Layer, I can only use the parameter 'trainable' to set the trainability of all embeddings in the following way:

embedding_layer = Embedding(voc_size,
                        emb_dim,
                        weights=[embedding_matrix],
                        input_length=MAX_LEN,
                        trainable=False)

Is there a Keras-level solution to training only a subset of embeddings?

Please note:

There is not enough data to generate new embeddings for all words.
These answers only relate to native TensorFlow.

Oleviaolfaction answered 27/2, 2018 at 13:2 Comment(0)

Found some nice workaround, inspired by Keith's two embeddings layers.

Main idea:

Assign the special tokens (and the OOV) with the highest IDs. Generate a 'sentence' containing only special tokens, 0-padded elsewhere. Then apply non-trainable embeddings to the 'normal' sentence, and trainable embeddings to the special tokens. Lastly, add both.

Works fine to me.

    # Normal embs - '+2' for empty token and OOV token
    embedding_matrix = np.zeros((vocab_len + 2, emb_dim))
    # Special embs
    special_embedding_matrix = np.zeros((special_tokens_len + 2, emb_dim))

    # Here we may apply pre-trained embeddings to embedding_matrix

    embedding_layer = Embedding(vocab_len + 2,
                        emb_dim,
                        mask_zero = True,
                        weights = [embedding_matrix],
                        input_length = MAX_SENT_LEN,
                        trainable = False)

    special_embedding_layer = Embedding(special_tokens_len + 2,
                            emb_dim,
                            mask_zero = True,
                            weights = [special_embedding_matrix],
                            input_length = MAX_SENT_LEN,
                            trainable = True)

    valid_words = vocab_len - special_tokens_len

    sentence_input = Input(shape=(MAX_SENT_LEN,), dtype='int32')

    # Create a vector of special tokens, e.g: [0,0,1,0,3,0,0]
    special_tokens_input = Lambda(lambda x: x - valid_words)(sentence_input)
    special_tokens_input = Activation('relu')(special_tokens_input)

    # Apply both 'normal' embeddings and special token embeddings
    embedded_sequences = embedding_layer(sentence_input)
    embedded_special = special_embedding_layer(special_tokens_input)

    # Add the matrices
    embedded_sequences = Add()([embedded_sequences, embedded_special])

Oleviaolfaction answered 26/3, 2018 at 22:26 Comment(0)

I haven't found a nice solution like a mask for the Embedding layer. But here's what I've been meaning to try:

Two embedding layers - one trainable and one not
The non-trainable one has all the Glove embeddings for in-vocab words and zero vectors for others
The trainable one only maps the OOV words and special symbols
The output of these two layers is added (I was thinking of this like ResNet)
The Conv/LSTM/etc below the embedding is unchanged

That would get you a solution with a small number of free parameters allocated to those embeddings.

Initiation answered 13/3, 2018 at 17:38 Comment(5)

I assume this means the two embedding layers are of the same size (to allow addition). However, this leaves us again with the same number of parameters, all from the trainable one. I have done some experiments, and the model trains embeddings initialized to 0 (does not leave them 'as is'). Do I miss something? – Oleviaolfaction 22/3, 2018 at 13:56

When you say set to zero, do you mean word index 0 or all of the word vector is initialized to zero? If it's the first, that's making a single OOV embedding shared amongst all OOVs. – Initiation 23/3, 2018 at 4:55

I meant the second - an emb_dim long vector of 0s .The 0-word-index is used for padding in my case, and is masked out. That is, my problem is that setting an embedding to be [0,...,0] will still end up with it being trained, and so we are back with the same number of parameters from the trainable layer. – Oleviaolfaction 23/3, 2018 at 15:25

It just sounds so weird - I can't find any Keras documentation on what you're describing and the only thing I could find on masking seemed to mask the 0 index. And the Keras model summary doesn't show them as trainable. Did you just train a model with non-trainable embedding and double-check the vectors? – Initiation 24/3, 2018 at 18:36

When setting the layer to 'trainable=false', the vectors remain [0,...,0] as expected. When 'trainable=true', all these vectors are trained - except for the embedding of the 0-word (which is masked out, and also initialized to [0,...0]), and the vector for the OOV (which is naturally not being back-propagated while training), which remain all zeros. All others - are being trained. – Oleviaolfaction 24/3, 2018 at 19:46

Recommended topics

Hot tags