I've been thinking about 0-padding of word sequence and how that 0-padding is then converted to the Embedding layer. At first glance, one would think that you want to keep the embeddings = 0.0 as well. However, Embedding
layer in keras
generates random values for any input token, and there is no way to force it to generate 0.0's. Note, mask_zero
does something different, I've already checked.
One might ask, why worry about this, the code seems to be working even when the embeddings are not 0.0's, as long as they are the same. So I came up with an example, albeit somewhat contrived, where setting the embeddings to 0.0's for the 0 padded token makes a difference.
I used the 20 News Groups data set from sklearn.datasets import fetch_20newsgroups
. I do some minimal preprocessing: removal of punctuation, stopwords and numbers. I use from keras.preprocessing.sequence import pad_sequences
for 0-padding. I split the ~18K posts into the training and validation set with the proportion of training/validation = 4/1.
I create a simple 1 dense hidden layer network with the input being the flattened sequence of embeddings:
EMBEDDING_DIM = 300
MAX_SEQUENCE_LENGTH = 1100
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
x = Flatten(name='flatten_dnn')(embedded_sequences)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
The model has about 14M trainable parameters (this example is a bit contrived, as I've already mentioned). When I train it
earlystop = EarlyStopping(monitor='val_loss', patience=5)
history = model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=30, batch_size=BATCH_SIZE, callbacks=[earlystop])
it looks like for 4 epochs the algorithm is struggling to find its way out of the 'randomness':
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 58s 4ms/step - loss: 3.1118 - acc: 0.0519 - val_loss: 2.9894 - val_acc: 0.0534
Epoch 2/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.9820 - acc: 0.0556 - val_loss: 2.9827 - val_acc: 0.0527
Epoch 3/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9712 - acc: 0.0626 - val_loss: 2.9718 - val_acc: 0.0579
Epoch 4/30
15048/15048 [==============================] - 55s 4ms/step - loss: 2.9259 - acc: 0.0756 - val_loss: 2.8363 - val_acc: 0.0874
Epoch 5/30
15048/15048 [==============================] - 56s 4ms/step - loss: 2.7092 - acc: 0.1390 - val_loss: 2.3251 - val_acc: 0.2796
...
Epoch 13/30
15048/15048 [==============================] - 56s 4ms/step - loss: 0.0698 - acc: 0.9807 - val_loss: 0.5010 - val_acc: 0.8736
It ends up with the accuracy of ~0.87
print ('Best validation accuracy is ', max(history.history['val_acc']))
Best validation accuracy is 0.874934175379845
However, when I explicitly set the embeddings for the padded 0's to 0.0
def myMask(x):
mask= K.greater(x,0) #will return boolean values
mask= K.cast(mask, dtype=K.floatx())
return mask
layer_size = 25
dropout = 0.3
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1, EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH, name = 'embedding_dnn')
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
the model with the same number of parameters immediately finds its way out of the 'randomness':
Train on 15048 samples, validate on 3798 samples
Epoch 1/30
15048/15048 [==============================] - 64s 4ms/step - loss: 2.4356 - acc: 0.3060 - val_loss: 1.2424 - val_acc: 0.7754
Epoch 2/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.6973 - acc: 0.8267 - val_loss: 0.5240 - val_acc: 0.8797
...
Epoch 10/30
15048/15048 [==============================] - 61s 4ms/step - loss: 0.0496 - acc: 0.9881 - val_loss: 0.4176 - val_acc: 0.8944
and ends up with a better accuracy of ~0.9.
Again, this is a somewhat contrived example, but still it shows that keeping those 'padded' embeddings at 0.0 can be beneficial.
Am I missing something here? And if I'm not missing anything, then, what is the reason Keras doesn't provide this functionality out-of-the-box?
UPDATE
@DanielMöller I tried your suggestion:
layer_size = 25
dropout = 0.3
init = RandomUniform(minval=0.0001, maxval=0.05, seed=None)
constr = NonNeg()
sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32', name='dnn_input')
embedding_layer = Embedding(len(word_index) + 1,
EMBEDDING_DIM,
input_length=MAX_SEQUENCE_LENGTH,
name = 'embedding_dnn',
embeddings_initializer=init,
embeddings_constraint=constr)
embedded_sequences = embedding_layer(sequence_input)
y = Lambda(myMask, output_shape=(MAX_SEQUENCE_LENGTH,))(sequence_input)
y = Reshape(target_shape=(MAX_SEQUENCE_LENGTH,1))(y)
merge_layer = Multiply(name = 'masked_embedding_dnn')([embedded_sequences,y])
x = Flatten(name='flatten_dnn')(merge_layer)
x = Dense(layer_size, activation='relu', name ='hidden_dense_dnn')(x)
x = Dropout(dropout, name='dropout')(x)
preds = Dense(num_labels, activation='softmax', name = 'output_dnn')(x)
model = Model(sequence_input, preds)
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
Unfortunately, the network was stuck in the 'randomness':
Train on 15197 samples, validate on 3649 samples
Epoch 1/30
15197/15197 [==============================] - 60s 4ms/step - loss: 3.1354 - acc: 0.0505 - val_loss: 2.9943 - val_acc: 0.0496
....
Epoch 24/30
15197/15197 [==============================] - 60s 4ms/step - loss: 2.9905 - acc: 0.0538 - val_loss: 2.9907 - val_acc: 0.0496
I also tried without the NonNeg()
constraint, the same result.