About correctly using dropout in RNNs (Keras)
Asked Answered
J

1

5

I am confused between how to correctly use dropout with RNN in keras, specifically with GRU units. The keras documentation refers to this paper (https://arxiv.org/abs/1512.05287) and I understand that same dropout mask should be used for all time-steps. This is achieved by dropout argument while specifying the GRU layer itself. What I don't understand is:

  1. Why there are several examples over the internet including keras own example (https://github.com/keras-team/keras/blob/master/examples/imdb_bidirectional_lstm.py) and "Trigger word detection" assignment in Andrew Ng's Coursera Seq. Models course, where they add a dropout layer explicitly "model.add(Dropout(0.5))" which, in my understanding, will add a different mask to every time-step.

  2. The paper mentioned above suggests that doing this is inappropriate and we might lose the signal as well as long-term memory due to the accumulation of this dropout noise over all the time-steps. But then, how are these models (using different dropout masks at every time-step) are able to learn and perform well.

I myself have trained a model which uses different dropout masks at every time-step, and although I haven't gotten results as I wanted, the model is able to overfit the training data. This, in my understanding, invalidates the "accumulation of noise" and "signal getting lost" over all the time-steps (I have 1000 time-step series being input to the GRU layers).

Any insights, explanations or experience with the situation will be helpful. Thanks.

UPDATE:

To make it more clear I'll mention an extract from keras documentation of Dropout Layer ("noise_shape: 1D integer tensor representing the shape of the binary dropout mask that will be multiplied with the input. For instance, if your inputs have shape (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features"). So, I believe, it can be seen that when using Dropout layer explicitly and needing the same mask at every time-step (as mentioned in the paper), we need to edit this noise_shape argument which is not done in the examples I linked earlier.

Jaggers answered 22/5, 2018 at 1:8 Comment(1)
There are several types of dropout. The example code you linked uses explicit output dropout, i.e. some outputs of previous layer are not propagated to the next layer. Dropout parameter in GRU applies dropout to the inputs of GRU cell, recurrent_dropout applies dropout to recurrent connections. You can find more explanation with examples here machinelearningmastery.com/…Bernt
C
7

As Asterisk explained in his comment, there is a fundamental difference between dropout within a recurrent unit and dropout after the unit's output. This is the architecture from the keras tutorial you linked in your question:

model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(Bidirectional(LSTM(64)))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))

You're adding a dropout layer after the LSTM finished its computation, meaning that there won't be any more recurrent passes in that unit. Imagine this dropout layer as teaching the network not to rely on the output for a specific feature of a specific time step, but to generalize over information in different features and time steps. Dropout here is no different to feed-forward architectures.

What Gal & Ghahramani propose in their paper (which you linked in the question) is dropout within the recurrent unit. There, you're dropping input information between the time steps of a sequence. I found this blogpost to be very helpful to understand the paper and how it relates to the keras implementation.

Chosen answered 26/3, 2019 at 9:37 Comment(3)
Hi @Merlin. I did understand what you are saying. I forgot to update the question with an answer. I would like to point out, for completeness, that the source of my confusion was, I was using the argument return_sequences=True instead of default False. So adding Dropout in this case would be incorrect as per the paper. But if return_sequences=False, only the feature vectors of extreme time steps (forward and/or backward) are returned and dropout mask can be applied like this.Jaggers
there won't be any more recurrent passes in that unit - do you mean that it breaks the recurrent behaviour entirely, or just that dropouts won't be applied recurrently?Unstring
@Unstring what I meant is that the LSTM layer already finished its computation and will not be called again during that forward pass. I hope this clarifies my answer.Chosen

© 2022 - 2024 — McMap. All rights reserved.