I needed an answer to this too, and figured out what I needed through the link at the bottom of your question.
In short, you do as the answer in the link says, but you 'simply' leave out the embedding layer if you're not interested in using one. I'd highly recommend reading and understanding the linked answer as it goes into more detail, and the docs on Masking, but here's a modified version which uses a masking layer over the sequence inputs to replace 'sequence_length':
import numpy as np
import tensorflow as tf
pad_value = 0.37
# This is our input to the RNN, in [batch_size, max_sequence_length, num_features] shape
test_input = np.array(
[[[1., 1. ],
[2, 2. ],
[1., 1. ],
[pad_value, pad_value], # <- a row/time step which contains all pad_values will be masked through the masking layer
[pad_value, pad_value]],
[[pad_value, pad_value],
[1., 1. ],
[2, 2. ],
[1., 1. ],
[pad_value, pad_value]]])
# Define the mask layer, telling it to mask all time steps that contain all pad_value values
mask = tf.keras.layers.Masking(mask_value=pad_value)
rnn = tf.keras.layers.GRU(
1,
return_sequences=True,
activation=None, # <- these values and below are just used to initialise the RNN in a repeatable way for this example
recurrent_activation=None,
kernel_initializer='ones',
recurrent_initializer='zeros',
use_bias=True,
bias_initializer='ones'
)
x = tf.keras.layers.Input(shape=test_input.shape[1:])
m0 = tf.keras.Model(inputs=x, outputs=rnn(x))
m1 = tf.keras.Model(inputs=x, outputs=mask(x))
m2 = tf.keras.Model(inputs=x, outputs=rnn(mask(x)))
print('raw inputs\n', test_input)
print('raw rnn output (no mask)\n', m0.predict(test_input).squeeze())
print('masked inputs\n', m1.predict(test_input).squeeze())
print('masked rnn output\n', m2.predict(test_input).squeeze())
out:
raw inputs
[[[1. 1. ]
[2. 2. ]
[1. 1. ]
[0.37 0.37]
[0.37 0.37]]
[[0.37 0.37]
[1. 1. ]
[2. 2. ]
[1. 1. ]
[0.37 0.37]]]
raw rnn output (no mask)
[[ -6. -50. -156. -272.7276 -475.83362 ]
[ -1.2876 -9.862801 -69.314 -213.94202 -373.54672 ]]
masked inputs
[[[1. 1.]
[2. 2.]
[1. 1.]
[0. 0.]
[0. 0.]]
[[0. 0.]
[1. 1.]
[2. 2.]
[1. 1.]
[0. 0.]]]
masked rnn output
[[ -6. -50. -156. -156. -156.]
[ 0. -6. -50. -156. -156.]]
Notice how with the mask applied, the calculations are not performed on a time step where the mask is active (i.e. where the sequence is padded out). Instead, state from the previous time step is carried forward.
A few other points to note:
- In the linked (and this) example, the RNN is created with various activation and initializer parameters. I assume this is to initialize the RNN to a known state for repeatability for the example. In practice, you would initialize the RNN how you would like.
- The pad value can be any value you specify. Typically, padding using zeros is used. In the linked (and this) example, a value of 0.37 is used. I can only assume it is an arbitrary value to show the difference in the raw and masked RNN outputs, as a zero input value with this example RNN initialisation gives little/no difference in output, therefore 'some' value (i.e. 0.37) demonstrates the effect of the masking.
- The Masking docs state that rows/time steps are masked only if all of the values for that time step contain the mask value. For example, in the above, a time step of
[0.37, 2]
would still be fed to the network with those values, however, a time step of [0.37, 0.37]
would be skipped over.
- An alternative approach to this problem instead of masking would be to train several times by batching the different sequence lengths together. For example, if you have a mix of sequence lengths of 10, 20, and 30, instead of padding them all out to 30 and masking, train using all your 10 sequence lengths, then your 20s, then 30s. Or if you have say lots of 100 sequence lengths, and also lots of 3, 4, 5 sequence lengths, you may want to pad your smaller ones to all 5 length and train twice using 100s and padded/masked 5s. You will likely gain training speed, but at the trade-off of less accuracy as you won't be able to shuffle between batches of different sequence lengths.
model.fit()
has a fixed batch size right? What if different length sequences form different batch size? – Spruce