Keras Embedding Layer: keep zero-padded values as zeros

I've been thinking about 0-padding of word sequence and how that 0-padding is then converted to the Embedding layer. At first glance, one would think that you want to keep the embeddings = 0.0 as well. However, Embedding layer in keras generates random values for any input token, and there is no way to force it to generate 0.0's. Note, mask_zero does something different, I've already checked.

One might ask, why worry about this, the code seems to be working even when the embeddings are not 0.0's, as long as they are the same. So I came up with an example, albeit somewhat contrived, where setting the embeddings to 0.0's for the 0 padded token makes a difference.

I used the 20 News Groups data set from sklearn.datasets import fetch_20newsgroups. I do some minimal preprocessing: removal of punctuation, stopwords and numbers. I use from keras.preprocessing.sequence import pad_sequences for 0-padding. I split the ~18K posts into the training and validation set with the proportion of training/validation = 4/1. I create a simple 1 dense hidden layer network with the input being the flattened sequence of embeddings:

    EMBEDDING_DIM = 300
    MAX_SEQUENCE_LENGTH = 1100
    layer_size = 25
    dropout = 0.3
    sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
    embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
    embedded_sequences = embedding_layer(sequence_input)
    x = Flatten(name='flatten_dnn')(embedded_sequences)
    x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
    x = Dropout(dropout, name='dropout')(x)
    preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)

    model = Model(sequence_input, preds)
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

The model has about 14M trainable parameters (this example is a bit contrived, as I've already mentioned). When I train it

    earlystop = EarlyStopping(monitor='val_loss', patience=5)
    history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=30, batch_size=BATCH_SIZE, callbacks=[earlystop])

it looks like for 4 epochs the algorithm is struggling to find its way out of the 'randomness':

Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 58s 4ms/step - loss: 3.1118 - acc: 0.0519 - val_loss: 2.9894 - val_acc: 0.0534
Epoch 2/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.9820 - acc: 0.0556 - val_loss: 2.9827 - val_acc: 0.0527
Epoch 3/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9712 - acc: 0.0626 - val_loss: 2.9718 - val_acc: 0.0579
Epoch 4/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9259 - acc: 0.0756 - val_loss: 2.8363 - val_acc: 0.0874
Epoch 5/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.7092 - acc: 0.1390 - val_loss: 2.3251 - val_acc: 0.2796
...
Epoch 13/30
15048/15048 [==============================] - 56s 4ms/step - loss: 0.0698 - acc: 0.9807 - val_loss: 0.5010 - val_acc: 0.8736

It ends up with the accuracy of ~0.87

print ('Best validation accuracy is ', max(history.history['val_acc']))
Best validation accuracy is  0.874934175379845

However, when I explicitly set the embeddings for the padded 0's to 0.0

def myMask(x):
    mask= K.greater(x,0) #will return boolean values
    mask= K.cast(mask, dtype=K.floatx()) 
    return mask
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

the model with the same number of parameters immediately finds its way out of the 'randomness':

Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 64s 4ms/step - loss: 2.4356 - acc: 0.3060 - val_loss: 1.2424 - val_acc: 0.7754
Epoch 2/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.6973 - acc: 0.8267 - val_loss: 0.5240 - val_acc: 0.8797
...
Epoch 10/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.0496 - acc: 0.9881 - val_loss: 0.4176 - val_acc: 0.8944

and ends up with a better accuracy of ~0.9.

Again, this is a somewhat contrived example, but still it shows that keeping those 'padded' embeddings at 0.0 can be beneficial.

Am I missing something here? And if I'm not missing anything, then, what is the reason Keras doesn't provide this functionality out-of-the-box?

UPDATE

@DanielMöller I tried your suggestion:

layer_size = 25
dropout = 0.3
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()



sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, 
                            EMBEDDING_DIM, 
                            input_length=MAX_SEQUENCE_LENGTH, 
                            name = 'embedding_dnn', 
                            embeddings_initializer=init,
                            embeddings_constraint=constr)

embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)

model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

Unfortunately, the network was stuck in the 'randomness':

Train on 15197 samples, validate on 3649 samples
Epoch 1/30
15197/15197 [==============================] - 60s 4ms/step - loss: 3.1354 - acc: 0.0505 - val_loss: 2.9943 - val_acc: 0.0496
....
Epoch 24/30
15197/15197 [==============================] - 60s 4ms/step - loss: 2.9905 - acc: 0.0538 - val_loss: 2.9907 - val_acc: 0.0496

I also tried without the NonNeg() constraint, the same result.

Well, you're eliminating the computation of the gradients of the weights related to the padded steps.

If you have too many padded steps, then the embedding weights regarding the padding value will participate in a lot of calculations and will significantly compete with the other weights. But training these weights is a waste of computation and will certainly interfere in other words.

Consider also that, for instance, some of the weights for padding might have values between the values for meaningful words. So, increasing the weight might make it similar to another word when it's not. And decreasing too....

These extra calculations, extra contributions to loss and gradient calculations, etc. will create more computational need and more obstacles. It's like having a lot of garbage in the middle of the data.

Notice also that these zeros are going directly to the dense layer, which will also eliminate the gradients for a lot of the dense weights. This might overfit longer sequences though if they are few compared to shorter sequences.

Out of curiosity, what will happen if you do this?

from keras.initializers import RandomUniform
from keras.constraints import NonNeg

init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()


......
embedding_layer = Embedding(len(word_index) + 1, 
                            EMBEDDING_DIM, 
                            input_length=MAX_SEQUENCE_LENGTH, 
                            name = 'embedding_dnn', 
                            embeddings_initializer=init,
                            embeddings_constraint=constr)
..........

Recommended topics

Hot tags