Does batch normalisation work with a small batch size?
Asked Answered
H

3

10

I'm using batch normalization with batch size 10 for face detection.

Does batch normalization works with such small batch sizes? If not, then what else can i use for normalization?

Heyday answered 2/7, 2019 at 20:38 Comment(0)
F
4

Yes, it works for the smaller size, it will work even with the smallest possible size you set.

The trick is the bach size also adds to the regularization effect, not only the batch norm. I will show you few pics:

bs=10

We are on the same scale tracking the bach loss. The left-hand side is a module without the batch norm layer (black), the right-hand side is with the batch norm layer. Note how the regularization effect is evident even for the bs=10.

bs=64

When we set the bs=64 the batch loss regularization is super evident. Note the y scale is always [0, 4].

My examination was purely on nn.BatchNorm1d(10, affine=False) without learnable parameters gamma and beta i.e. w and b.

This is why when you have low batch size, it has sense to use the BatchNorm layer.

Freytag answered 5/7, 2019 at 10:56 Comment(3)
I'm not sure this is a settled question. I know in object detection models like RetinaNet they often freeze batch norm layers because they can only operate on 1 or 2 images at a time. Here's one example, but I also recall seeing this in Facebook's Detectron repository: github.com/yhenon/pytorch-retinanet/issues/24Motte
The question was the batch size of 10 where PyTorch batch norm usually works. In case of the smaller batch size, you may try the Layer norm, which is very popular nowadays, or even you may try running batch norm.Freytag
The question wasn't whether it will work; the question was whether it's optimal to use BN at all for that batch size. It has been shown that when using BN, too small batch size will lead to instabilities: towardsdatascience.com/…Artichoke
F
5

This question depends mainly on the depth of your neural network.

Batch normalization is useful for increasing the training of your data when there are a lot of hidden layers. It decreases the number of epochs required for training the model and helps to regulate the data. By standardizing the inputs to your network, you reduce the risk of chasing a 'moving target', which optimizes the learning process of the model.

My advice would be to include batch normalization layers in your code if you have a deep neural network. Reminder, you should probably include some Dropout in your layers as well.

Fulmar answered 3/7, 2019 at 4:8 Comment(6)
The proposal network is not deep (3 to 4 layers), but the rest of the networks are deep (6 to 9 layers). For proposal network, I feel the training process got slower when I removed the batch norms but the result is more reasonable (less False positives) (this is just a feeling, after 13K iterations)Heyday
I know deep networks learn poorly without batch norms. But I was afraid that normalizing with small batch size will mess up the hidden representationsHeyday
@Heyday The batch size does play a role in accuracy when using batch normalization, meaning your concern for normalizing on small batch sizes I understand. This is the case with almost all ML problems involving batch size though, as a higher batch size results in a more complete representation of your data. I would increase your batch size so it is not too small and leave the batch normalization process in your code. The run time for each epoch will take longer because of the bigger batch, but the overall accuracy should increase over fewer epochs.Fulmar
My GPU memory does not always allow larger batch size, I'm running code on a personal computer. Some times I have to run the code with only two sample per batch (especially concerning GANs) Though my question concerns batch sizes around 5 to 32Heyday
@Heyday and how big is your data set? For taxing ML methods that I have used, I have normally been able to run them on my CPU, just very very slowly.Fulmar
11000 images which will be very slow on cpuHeyday
F
4

Yes, it works for the smaller size, it will work even with the smallest possible size you set.

The trick is the bach size also adds to the regularization effect, not only the batch norm. I will show you few pics:

bs=10

We are on the same scale tracking the bach loss. The left-hand side is a module without the batch norm layer (black), the right-hand side is with the batch norm layer. Note how the regularization effect is evident even for the bs=10.

bs=64

When we set the bs=64 the batch loss regularization is super evident. Note the y scale is always [0, 4].

My examination was purely on nn.BatchNorm1d(10, affine=False) without learnable parameters gamma and beta i.e. w and b.

This is why when you have low batch size, it has sense to use the BatchNorm layer.

Freytag answered 5/7, 2019 at 10:56 Comment(3)
I'm not sure this is a settled question. I know in object detection models like RetinaNet they often freeze batch norm layers because they can only operate on 1 or 2 images at a time. Here's one example, but I also recall seeing this in Facebook's Detectron repository: github.com/yhenon/pytorch-retinanet/issues/24Motte
The question was the batch size of 10 where PyTorch batch norm usually works. In case of the smaller batch size, you may try the Layer norm, which is very popular nowadays, or even you may try running batch norm.Freytag
The question wasn't whether it will work; the question was whether it's optimal to use BN at all for that batch size. It has been shown that when using BN, too small batch size will lead to instabilities: towardsdatascience.com/…Artichoke
K
0

Batch norm can become less effective with smaller batch sizes, and in some cases can become completely unstable and fail.
Think about how batch norm works after training is done. It uses the running averages it found during training to do the normalization instead of statistics created from a batch of images. If during training you have a very small batch size, the statistics of a given batch can vary wildly from the running average that will be used during inference. As batch size increases, it becomes a better approximation of the statistics of the whole training set and closer to the behavior you will get during inference.

Keratoplasty answered 20/5 at 8:55 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.