correct order for SpatialDropout2D, BatchNormalization and activation function?
Asked Answered
S

1

8

For a CNN architecture I want to use SpatialDropout2D layer instead of Dropout layer. Additionaly I want to use BatchNormalization. So far I had always set the BatchNormalization directly after a Convolutional layer but before the activation function, as in the paper by Ioffe and Szegedy mentioned. The dropout layers I had always set after MaxPooling2D layer.

In https://machinelearningmastery.com/how-to-reduce-overfitting-with-dropout-regularization-in-keras/ SpatialDropout2D is set directly after Convolutional layer.

I find it rather confusing in which order I should now apply these layers. I had also read on a Keras page that SpatialDropout should be placed directly behind ConvLayer (but I can't find this page anymore).

Is the following order correct?

ConvLayer - SpatialDropout - BatchNormalization - Activation function - MaxPooling

I really hope for tips and thank you in advance

Update My goal was actually to exchange in the following CNN architecture dropout for spatial dropout:

model = Sequential()
model.add(Conv2D(32,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(32,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2))
model.add(Dropout(0.2))

model.add(Conv2D(64, (3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Conv2D(64,(3,3))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2))
model.add(Dropout(0.2))

model.add(Flatten())
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(0.4))
model.add(Dense(10))
model.add(Activation('softmax'))
Stamp answered 7/1, 2020 at 19:20 Comment(0)
U
19

Dropout vs BatchNormalization - Standard deviation issue

There is a big problem that appears when you mix these layers, especially when BatchNormalization is right after Dropout.

Dropouts try to keep the same mean of the outputs without dropouts, but it does change the standard deviation, which will cause a huge difference in the BatchNormalization between training and validation. (During training, the BatchNormalization receives changed standard deviations, accumulates and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But BatchNormalization, because it's in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics)

So, the first and most important rule is: don't place a BatchNormalization after a Dropout (or a SpatialDropout).

Usually, I try to leave at least two convolutional/dense layers without any dropout before applying a batch normalization, to avoid this.

Dropout vs BatchNormalization - Changing the zeros to another value

Also important: the role of the Dropout is to "zero" the influence of some of the weights of the next layer. If you apply a normalization after the dropout, you will not have "zeros" anymore, but a certain value that will be repeated for many units. And this value will vary from batch to batch. So, although there is noise added, you are not killing units as a pure dropout is supposed to do.

Dropout vs MaxPooling

The problem of using a regular Dropout before a MaxPooling is that you will zero some pixels, and then the MaxPooling will take the maximum value, sort of ignoring part of your dropout. If your dropout happens to hit a maximum pixel, then the pooling will result in the second maximum, not in zero.

So, Dropout before MaxPooling reduces the effectiveness of the dropout.

SpatialDropout vs MaxPooling

But, a SpatialDropout never hits "pixels", it only hits channels. When it hits a channel, it will zero all pixels for that channel, thus, the MaxPooling will effectively result in zero too.

So, there is no difference between spatial dropout before of after the pooling. An entire "channel" will be zero in both orders.

BatchNormalization vs Activation

Depending on the activation function, using a batch normalization before it can be a good advantage.

For a 'relu' activation, the normalization makes the model fail-safe against a bad luck case of "all zeros freeze a relu layer". It will also tend to guarantee that half of the units will be zero and the other half linear.

For a 'sigmoid' or a 'tahn', the BatchNormalization will guarantee that the values are within a healthy range, avoiding saturation and vanishing gradients (values that are too far from zero will hit an almost flat region of these functions, causing vanishing gradients).

There are people that say there are other advantages if you do the contrary, I'm not fully aware of these advantages, I like the ones I mentioned very much.

Dropout vs Activation

With 'relu', there is no difference, it can be proved that the results are exactly the same.

With activations that are not centerd, such as 'sigmoid' putting a dropout before the activation will not result in "zeros", but in other values. For a sigmoid, the final results of the dropout before it would be 0.5.

If you add a 'tanh' after a dropout, for instance, you will have the zeros, but the scaling that dropout applies to keep the same mean will be distorted by the tanh. (I don't know if this is a big problem, but might be)

MaxPooling vs Activation

I don't see much here. If the activation is not very weird, the final result would be the same.

Conclusions?

There are possibilities, but some are troublesome. I find the following order a good one and often use it

I would do something like

  • Group1
    • Conv
    • BatchNorm
    • Activation
    • MaxPooling
    • Dropout or SpatialDropout
  • Group2
    • Conv
    • ----- (there was a dropout in the last group, no BatchNorm here)
    • Activation
    • MaxPooling
    • Dropout or SpatialDropout (decide to use or not)
  • After two groups without dropout
    • can use BatchNorm again
Unfetter answered 7/1, 2020 at 19:50 Comment(6)
These are very detailed explanations, many are very conclusive. But what I'm still wondering is why there are recommendations to put SpatialDropout directly after ConvLayer? I have attached a code above, in which I actually only wanted to exchange dropout for spatial dropout and using this network for image recognition.Stamp
@CodeNow I doubt Möller's advocating for dropout after Conv, but if he is, I'll strongly disagree; your linked article is disappointing, coming from a Ph. D in machine learning. The rest of Möller's answer is sound, I'll add that BN after relu is worth trying, as it nicely standardizes the output as the input to the next layer. Also note that convergence can take much longer with SpatialDropout as opposed to Dropout.Maleate
@Maleate oh, that is interesting that 'SpatialDropout' can take much longer to converge. Maybe it would be a good idea to test 'SpatialDropout' only after the first 'MaxPooling'. Because what Keras writes on page keras.io/layers/core sounds like 'SpatialDropout ' should only be used after the first layers, since feature maps in the early convolutional layers are strongly correlated.Stamp
@CodeNow It's a design decision; correlated vs. decorrelated isn't clear-cut, each has ups and downs. In autoencoders, for example, SpatialDropout can be used to train sparse AEs, which were shown to transfer better for classification than denoising AEs. Though in my experiments outside AEs, with timeseries, SD did indeed perform worse than Dropout for earlier layers. See also this answer, with relevance to CNNs and a visual demo of SD.Maleate
Thanks for such a detailed answer. Based on the following paper though: openaccess.thecvf.com/content_CVPR_2019/papers/…, they recommend using BN throughout the network, and then Dropout after the last BN layer. Any thoughts on that?Messinger
Ok, no problem. Dropouts after BN don't cause any harm. BN after Dropout causes harm. You may follow what they say. I like adding more dropouts for less overfitting, but must take care not to add them too near the BNs.Unvalued

© 2022 - 2025 — McMap. All rights reserved.