Why it's necessary to frozen all inner state of a Batch Normalization layer when fine-tuning
Asked Answered
E

1

7

The following content comes from Keras tutorial

This behavior has been introduced in TensorFlow 2.0, in order to enable layer.trainable = False to produce the most commonly expected behavior in the convnet fine-tuning use case.

Why we should freeze the layer when fine-tuning a convolutional neural network? Is it because some mechanisms in tensorflow keras or because of the algorithm of batch normalization? I run an experiment myself and I found that if trainable is not set to false the model tends to catastrophic forgetting what has been learned before and returns very large loss at first few epochs. What's the reason for that?

Embryotomy answered 21/7, 2020 at 14:26 Comment(0)
A
12

During training, varying batch statistics act as a regularization mechanism that can improve ability to generalize. This can help to minimize overfitting when training for a high number of iterations. Indeed, using a very large batch size can harm generalization as there is less variation in batch statistics, decreasing regularization.

When fine-tuning on a new dataset, batch statistics are likely to be very different if fine-tuning examples have different characteristics to examples in the original training dataset. Therefore, if batch normalization is not frozen, the network will learn new batch normalization parameters (gamma and beta in the batch normalization paper) that are different to what the other network paramaters have been optimised for during the original training. Relearning all the other network parameters is often undesirable during fine-tuning, either due to the required training time or small size of the fine-tuning dataset. Freezing batch normalization avoids this issue.

Andersonandert answered 21/7, 2020 at 17:46 Comment(6)
Does that mean the parameters in batch normalization layer (beta, gamma, moving mean and moving variance) will be updated drastically in comparison to other layers so that the original distribution is distorted and unsuitable for initial values in conv layer? If so, is it possible to pass a much smaller learning rate to batch normalization layer and larger learning rate to other layers to preserve the moving statistics and gamma and beta are only updated at small steps? Does it still harm the performance?Embryotomy
@Embryotomy If you use a lower learning rate for gamma and beta, they will still change. albeit more slowly, and other network parameters will have be relearned for their new values. Retraining batch normalization layers can improve performance; however, it is likely to require far more training/fine-tuning. It'd be like starting from a good initialization. If you're fine-tuning to minimize training, it's typically best to keep batch normalization frozen.Andersonandert
Thanks a lot! You have solved my doubts. Does this problem still occur if the dataset from the same domain (like split the dataset in two, one used to pre-train the model and the other fine-tuning) is used?Embryotomy
@Embryotomy If the dataset is randomly shuffled and then split for fine-tuning (which would be unusual), then batch statistics will be similar so it would not be essential to freeze batch normalization. Nevertheless, freezing batch normalization may still improve accuracy by removing gamma and beta update noise. If the fine tuning dataset is substantially different but similar to the original, changes to gamma and beta may be relatively small. However, even small changes will require some relearning by the other parameters.Andersonandert
That's not an approach I'll use. It's just I think if the incompatibility is caused by different batch statistics, using data under the same distribution should fix it. Your answer has opened a new perspective for me. Thanks a lot!Embryotomy
@Embryotomy You can transform batch statistics to another distribution. For example, by batch renormalization (arxiv.org/abs/1702.03275) or, at higher computational expense, by virtual batch normalization (invented for GANs: arxiv.org/abs/1606.03498). However, it's probably best to avoid the added complexity if you're just fine-tuning.Andersonandert

© 2022 - 2024 — McMap. All rights reserved.