correct order for SpatialDropout2D, BatchNormalization and activation function?

model = Sequential() model.add(Conv2D(32,(3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Conv2D(32,(3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2)) model.add(Dropout(0.2)) model.add(Conv2D(64, (3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Conv2D(64,(3,3)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2,2)) model.add(Dropout(0.2)) model.add(Flatten()) model.add(Dense(512)) model.add(BatchNormalization()) model.add(Activation('relu')) model.add(Dropout(0.4)) model.add(Dense(10)) model.add(Activation('softmax'))

Dropout vs BatchNormalization - Standard deviation issue

There is a big problem that appears when you mix these layers, especially when BatchNormalization is right after Dropout.

Dropouts try to keep the same mean of the outputs without dropouts, but it does change the standard deviation, which will cause a huge difference in the BatchNormalization between training and validation. (During training, the BatchNormalization receives changed standard deviations, accumulates and stores them. During validation, the dropouts are turned off, the standard deviation is not a changed one anymore, but the original. But BatchNormalization, because it's in validation, will not use the batch statistics, but the stored statistics, which will be very different from the batch statistics)

So, the first and most important rule is: don't place a BatchNormalization after a Dropout (or a SpatialDropout).

Usually, I try to leave at least two convolutional/dense layers without any dropout before applying a batch normalization, to avoid this.

Dropout vs BatchNormalization - Changing the zeros to another value

Also important: the role of the Dropout is to "zero" the influence of some of the weights of the next layer. If you apply a normalization after the dropout, you will not have "zeros" anymore, but a certain value that will be repeated for many units. And this value will vary from batch to batch. So, although there is noise added, you are not killing units as a pure dropout is supposed to do.

Dropout vs MaxPooling

The problem of using a regular Dropout before a MaxPooling is that you will zero some pixels, and then the MaxPooling will take the maximum value, sort of ignoring part of your dropout. If your dropout happens to hit a maximum pixel, then the pooling will result in the second maximum, not in zero.

So, Dropout before MaxPooling reduces the effectiveness of the dropout.

SpatialDropout vs MaxPooling

But, a SpatialDropout never hits "pixels", it only hits channels. When it hits a channel, it will zero all pixels for that channel, thus, the MaxPooling will effectively result in zero too.

So, there is no difference between spatial dropout before of after the pooling. An entire "channel" will be zero in both orders.

BatchNormalization vs Activation

Depending on the activation function, using a batch normalization before it can be a good advantage.

For a 'relu' activation, the normalization makes the model fail-safe against a bad luck case of "all zeros freeze a relu layer". It will also tend to guarantee that half of the units will be zero and the other half linear.

For a 'sigmoid' or a 'tahn', the BatchNormalization will guarantee that the values are within a healthy range, avoiding saturation and vanishing gradients (values that are too far from zero will hit an almost flat region of these functions, causing vanishing gradients).

There are people that say there are other advantages if you do the contrary, I'm not fully aware of these advantages, I like the ones I mentioned very much.

Dropout vs Activation

With 'relu', there is no difference, it can be proved that the results are exactly the same.

With activations that are not centerd, such as 'sigmoid' putting a dropout before the activation will not result in "zeros", but in other values. For a sigmoid, the final results of the dropout before it would be 0.5.

If you add a 'tanh' after a dropout, for instance, you will have the zeros, but the scaling that dropout applies to keep the same mean will be distorted by the tanh. (I don't know if this is a big problem, but might be)

MaxPooling vs Activation

I don't see much here. If the activation is not very weird, the final result would be the same.

Conclusions?

There are possibilities, but some are troublesome. I find the following order a good one and often use it

I would do something like

Group1
- Conv
- BatchNorm
- Activation
- MaxPooling
- Dropout or SpatialDropout
Group2
- Conv
- ----- (there was a dropout in the last group, no BatchNorm here)
- Activation
- MaxPooling
- Dropout or SpatialDropout (decide to use or not)
After two groups without dropout
- can use BatchNorm again