Ordering of batch normalization and dropout?
Asked Answered
B

8

248

The original question was in regard to TensorFlow implementations specifically. However, the answers are for implementations in general. This general answer is also the correct answer for TensorFlow.

When using batch normalization and dropout in TensorFlow (specifically using the contrib.layers) do I need to be worried about the ordering?

It seems possible that if I use dropout followed immediately by batch normalization there might be trouble. For example, if the shift in the batch normalization trains to the larger scale numbers of the training outputs, but then that same shift is applied to the smaller (due to the compensation for having more outputs) scale numbers without dropout during testing, then that shift may be off. Does the TensorFlow batch normalization layer automatically compensate for this? Or does this not happen for some reason I'm missing?

Also, are there other pitfalls to look out for in when using these two together? For example, assuming I'm using them in the correct order in regards to the above (assuming there is a correct order), could there be trouble with using both batch normalization and dropout on multiple successive layers? I don't immediately see a problem with that, but I might be missing something.

Thank you much!

UPDATE:

An experimental test seems to suggest that ordering does matter. I ran the same network twice with only the batch norm and dropout reverse. When the dropout is before the batch norm, validation loss seems to be going up as training loss is going down. They're both going down in the other case. But in my case the movements are slow, so things may change after more training and it's just a single test. A more definitive and informed answer would still be appreciated.

Barthol answered 25/9, 2016 at 21:12 Comment(0)
L
256

In the Ioffe and Szegedy 2015, the authors state that "we would like to ensure that for any parameter values, the network always produces activations with the desired distribution". So the Batch Normalization Layer is actually inserted right after a Conv Layer/Fully Connected Layer, but before feeding into ReLu (or any other kinds of) activation. See this video at around time 53 min for more details.

As far as dropout goes, I believe dropout is applied after activation layer. In the dropout paper figure 3b, the dropout factor/probability matrix r(l) for hidden layer l is applied to it on y(l), where y(l) is the result after applying activation function f.

So in summary, the order of using batch normalization and dropout is:

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC ->

Loblolly answered 27/10, 2016 at 23:59 Comment(8)
It seems that even Christian Szegedy now likes to perform BatchNorm after the ReLU (not before it). Quote by F. Chollet, the author of Keras: "I haven't gone back to check what they are suggesting in their original paper, but I can guarantee that recent code written by Christian applies relu before BN. It is still occasionally a topic of debate, though." sourceGreyhen
What about pooling, would that go in between batchnorm and the activation?Gleich
Also, looks like accuracy may be higher with BN after activation: github.com/cvjena/cnn-models/issues/3Gleich
video is deleted somehow!Parlin
This paper shows that normally drop out with BN leads to worse results unless some conditioning is done to avoid the risk of variance shifts.Devitt
Caffenet was also reported to perform better with batch normalization after the ReLUAcrophobia
I read Dropout again and I'm sure that in chapter 4. Model Description, parameter y is already used by the activation function. Good discussion, I will read these articles.Berwick
Fig. 3 in arxiv.org/pdf/1904.03392.pdf shows BN/GN -> ReLU -> Dropout -> Conv -> BN/GN -> ReLU -> Dropout -> Conv. GN is Group Normalization, a variant of BN for small batch sizes.Gatefold
R
69

As noted in the comments, an amazing resource to read up on the order of layers is here. I have gone through the comments and it is the best resource on topic i have found on internet

My 2 cents:

Dropout is meant to block information from certain neurons completely to make sure the neurons do not co-adapt. So, the batch normalization has to be after dropout otherwise you are passing information through normalization statistics.

If you think about it, in typical ML problems, this is the reason we don't compute mean and standard deviation over entire data and then split it into train, test and validation sets. We split and then compute the statistics over the train set and use them to normalize and center the validation and test datasets

so i suggest Scheme 1 (This takes pseudomarvin's comment on accepted answer into consideration)

-> CONV/FC -> ReLu(or other activation) -> Dropout -> BatchNorm -> CONV/FC

as opposed to Scheme 2

-> CONV/FC -> BatchNorm -> ReLu(or other activation) -> Dropout -> CONV/FC -> in the accepted answer

Please note that this means that the network under Scheme 2 should show over-fitting as compared to network under Scheme 1 but OP ran some tests as mentioned in question and they support Scheme 2

Royster answered 5/6, 2018 at 11:19 Comment(7)
Relevant reddit discussion on BatchNorm placement: reddit.com/r/MachineLearning/comments/67gonq/…Gemination
But wouldn't this screw up your BN statistics since you'll calculating them after dropout has been applied, which won't be the case at test time?Cholent
@Cholent I guess not. Since we calculate BN per unit (for each internal feature) and moreover it is scaled to compensate for the dropout.Thoreau
@Cholent is correct. See mohammed adel's answer and this paper here: arxiv.org/pdf/1801.05134.pdf. In effect, the Batch Normalization layers learn to counteract covariate shift in the data that no longer exists when Dropout is turned off at test time.Risley
@Risley I havent read the paper. Top of my head i think if you have BN before dropout then that essentially screws up the intent of BN layer since the function of BN is to provide standardized data to next layer.Royster
@MiloMinderbinder Why not move both the dropout and batchnorm before the ReLU?Misshape
I had issues with the validation accuracy but was normalised by scheme 1Cartier
C
30

Usually, Just drop the Dropout(when you have BN):

  • "BN eliminates the need for Dropout in some cases cause BN provides similar regularization benefits as Dropout intuitively"
  • "Architectures like ResNet, DenseNet, etc. not using Dropout

For more details, refer to this paper [Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift] as already mentioned by @Haramoz in the comments.

Cappello answered 21/12, 2018 at 7:58 Comment(7)
what about MLPs is it useful to combine them.Vicariate
@DINATAKLIT When you really don't have enough training data, in my opinion, YES.Cappello
@xtulo do you mean this work once their is a small datest ? like if i have read that Batch normalization work better with large datasets! I'm bit confused:!Vicariate
@DINATAKLIT In your previous comment what about MLPs is it useful to combine them, did you mean that Is it useful to combine Dropout and BN when using MLPs? My feeling about this is that it mainly depends on the size of your model and the amount of training data you have.Cappello
@xtulo yes I means is it useful to combine Droupout and BN , yes I agree with your last answer.Vicariate
You can still use dropout even if BN is there. Depends on the design. This is an ongoing research. You can look at this paper: arxiv.org/abs/1506.02142Diandre
only applies when you have enough data. just think about an extreme case which is you only have a dataset of 10 samples. BN can't save you herePole
C
29

Conv - Activation - DropOut - BatchNorm - Pool --> Test_loss: 0.04261355847120285

Conv - Activation - DropOut - Pool - BatchNorm --> Test_loss: 0.050065308809280396

Conv - Activation - BatchNorm - Pool - DropOut --> Test_loss: 0.04911309853196144

Conv - Activation - BatchNorm - DropOut - Pool --> Test_loss: 0.06809622049331665

Conv - BatchNorm - Activation - DropOut - Pool --> Test_loss: 0.038886815309524536

Conv - BatchNorm - Activation - Pool - DropOut --> Test_loss: 0.04126095026731491

Conv - BatchNorm - DropOut - Activation - Pool --> Test_loss: 0.05142546817660332

Conv - DropOut - Activation - BatchNorm - Pool --> Test_loss: 0.04827788099646568

Conv - DropOut - Activation - Pool - BatchNorm --> Test_loss: 0.04722036048769951

Conv - DropOut - BatchNorm - Activation - Pool --> Test_loss: 0.03238215297460556


Trained on the MNIST dataset (20 epochs) with 2 convolutional modules (see below), followed each time with

model.add(Flatten())
model.add(layers.Dense(512, activation="elu"))
model.add(layers.Dense(10, activation="softmax"))

The Convolutional layers have a kernel size of (3,3), default padding, the activation is elu. The Pooling is a MaxPooling of the poolside (2,2). Loss is categorical_crossentropy and the optimizer is adam.

The corresponding Dropout probability is 0.2 or 0.3, respectively. The amount of feature maps is 32 or 64, respectively.

Edit: When I dropped the Dropout, as recommended in some answers, it converged faster but had a worse generalization ability than when I use BatchNorm and Dropout.

Captivate answered 12/5, 2020 at 18:53 Comment(4)
Because of stochastic nature of NNs it's not enough just play with one training. When you would make around 100 trainings and take average - results will be more accurate.Tabescent
This is a measure of the weight initialization as much as anything.Donella
Please preset your random seed and run at least 10+ times, otherwise, results of once training are not reliable.Cacciatore
The real question is not so much about weight initialization (not as big of a deal, typically, if there are enough iterations); instead of it is whether or not this ordering will hold true for other datasets beyond MNISTCzardom
P
19

I found a paper that explains the disharmony between Dropout and Batch Norm(BN). The key idea is what they call the "variance shift". This is due to the fact that dropout has a different behavior between training and testing phases, which shifts the input statistics that BN learns. The main idea can be found in this figure which is taken from this paper. enter image description here

A small demo for this effect can be found in this notebook.

Papst answered 22/11, 2019 at 20:56 Comment(4)
How does this answer the question?Classmate
The paper supplies 2 potential strategies: - Apply Dropout (only) after all BN layers - Change Dropout into a more variance-stable formAngelenaangeleno
@nbubis I think it answers it indirectly. It seems to suggest not to use them together at all("explains the disharmony between Dropout and Batch Norm(BN)").Devilkin
This is the answer for the question. Dropout changes the "standard deviation" of the distribution during training, but doesn't change the distribution during validation. Batch normalization does depend on the statistics of the distribution. So, if you have a dropout before a batch normalization, batch normalization will have different results during training and validation.Stipule
B
16

I read the recommended papers in the answer and comments from https://mcmap.net/q/116038/-ordering-of-batch-normalization-and-dropout

From Ioffe and Szegedy (2015)’s point of view, only use BN in the network structure. Li et al. (2018) give the statistical and experimental analyses, that there is a variance shift when the practitioners use Dropout before BN. Thus, Li et al. (2018) recommend applying Dropout after all BN layers.

From Ioffe and Szegedy (2015)’s point of view, BN is located inside/before the activation function. However, Chen et al. (2019) use an IC layer which combines dropout and BN, and Chen et al. (2019) recommends use BN after ReLU.

On the safety background, I use Dropout or BN only in the network.

Chen, Guangyong, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, and Shengyu Zhang. 2019. “Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks.” CoRR abs/1905.05928. http://arxiv.org/abs/1905.05928.

Ioffe, Sergey, and Christian Szegedy. 2015. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” CoRR abs/1502.03167. http://arxiv.org/abs/1502.03167.

Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang. 2018. “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift.” CoRR abs/1801.05134. http://arxiv.org/abs/1801.05134.

Berwick answered 9/7, 2020 at 3:25 Comment(0)
D
4

Based on the research paper for better performance we should use BN before applying Dropouts

Descombes answered 6/2, 2019 at 13:1 Comment(1)
The answer does not address the the full stack, asked in the question.Diandre
S
-1

ConV/FC - BN - Sigmoid/tanh - dropout. If activiation func is Relu or otherwise, the order of normalization and dropout depends on your task

Shad answered 23/7, 2020 at 9:53 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.