About correctly using dropout in RNNs (Keras)

I am confused between how to correctly use dropout with RNN in keras, specifically with GRU units. The keras documentation refers to this paper (https://arxiv.org/abs/1512.05287) and I understand that same dropout mask should be used for all time-steps. This is achieved by dropout argument while specifying the GRU layer itself. What I don't understand is:

Why there are several examples over the internet including keras own example (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) and "Trigger word detection" assignment in Andrew Ng's Coursera Seq. Models course, where they add a dropout layer explicitly "model.add(Dropout(0.5))" which, in my understanding, will add a different mask to every time-step.
The paper mentioned above suggests that doing this is inappropriate and we might lose the signal as well as long-term memory due to the accumulation of this dropout noise over all the time-steps. But then, how are these models (using different dropout masks at every time-step) are able to learn and perform well.

I myself have trained a model which uses different dropout masks at every time-step, and although I haven't gotten results as I wanted, the model is able to overfit the training data. This, in my understanding, invalidates the "accumulation of noise" and "signal getting lost" over all the time-steps (I have 1000 time-step series being input to the GRU layers).

Any insights, explanations or experience with the situation will be helpful. Thanks.

UPDATE:

To make it more clear I'll mention an extract from keras documentation of Dropout Layer ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features"). So, I believe, it can be seen that when using Dropout layer explicitly and needing the same mask at every time-step (as mentioned in the paper), we need to edit this noise_shape argument which is not done in the examples I linked earlier.

As Asterisk explained in his comment, there is a fundamental difference between dropout within a recurrent unit and dropout after the unit's output. This is the architecture from the keras tutorial you linked in your question:

model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

You're adding a dropout layer after the LSTM finished its computation, meaning that there won't be any more recurrent passes in that unit. Imagine this dropout layer as teaching the network not to rely on the output for a specific feature of a specific time step, but to generalize over information in different features and time steps. Dropout here is no different to feed-forward architectures.

What Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost to be very helpful to understand the paper and how it relates to the keras implementation.

Recommended topics

Hot tags