Batch normalization when batch size=1

Asked 8/1, 2020 at 14:57 Answered 22/9, 2021 at 14:42

Solved python tensorflow keras deep-learning batch-normalization

What will happen when I use batch normalization but set batch_size = 1?

Because I am using 3D medical images as training dataset, the batch size can only be set to 1 because of GPU limitation. Normally, I know, when batch_size = 1, variance will be 0. And (x-mean)/variance will lead to error because of division by 0.

But why did errors not occur when I set batch_size = 1? Why my network was trained as good as I expected? Could anyone explain it?

Some people argued that:

The ZeroDivisionError may not be encountered because of two cases. First, the exception is caught in a try catch block. Second, a small rational number is added ( 1e-19 ) to the variance term so that it is never zero.

But some people disagree. They said that:

You should calculate mean and std across all pixels in the images of the batch. (So even batch_size = 1, there are still a lot of pixels in the batch. So the reason why batch_size=1 can still work is not because of 1e-19)

I have checked the Pytorch source code, and from the code I think the latter one is right.

Does anyone have different opinion???

Floppy answered 8/1, 2020 at 14:57 Comment(3)

The ZeroDivisionError may not be encountered because of two cases. First, the exception is caught in a try catch block. Second, a small rational number is added ( 1e-19 ) to the variance term so that it is never zero. Check the source code for this. – Memling 8/1, 2020 at 15:17

@ShubhamPanchal Nothing is added in computing variance itself, but 1e-3 is added to variance in normalizing, mainly for regularization – Janus 12/1, 2020 at 0:0

Updated answer; I missed an angle to your question. – Janus 19/5, 2020 at 23:12

variance will be 0

No, it won't; BatchNormalization computes statistics only with respect to a single axis (usually the channels axis, =-1 (last) by default); every other axis is collapsed, i.e. summed over for averaging; details below.

More importantly, however, unless you can explicitly justify it, I advise against using BatchNormalization with batch_size=1; there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended.

Small mini-batch alternatives: Batch Renormalization -- Layer Normalization -- Weight Normalization

Implementation details: from source code:

reduction_axes = list(range(len(input_shape)))
del reduction_axes[self.axis]

Eventually, tf.nn.monents is called with axes=reduction_axes, which performs a reduce_sum to compute variance. Then, in the TensorFlow backend, mean and variance are passed to tf.nn.batch_normalization to return train- or inference-normalized inputs.

In other words, if your input is (batch_size, height, width, depth, channels), or (1, height, width, depth, channels), then BN will run calculations over the 1, height, width, and depth dimensions.

Can variance ever be zero? - yes, if every single datapoint for any given channel slice (along every dimension) is the same. But this should be near-impossible for real data.

Other answers: first one is misleading:

a small rational number is added (1e-19) to the variance

This doesn't happen in computing variance, but it is added to variance when normalizing; nonetheless, it is rarely necessary, as variance is far from zero. Also, the epsilon term is actually defaulted to 1e-3 by Keras; it serves roles in regularizing, beyond mere avoiding zero-division.

Update: I failed to address an important piece of intuition with suspecting variance to be 0; indeed, the batch statistics variance is zero, since there is only one statistic - but the "statistic" itself concerns the mean & variance of the channel + spatial dimensions. In other words, the variance of the mean & variance (of the single train sample) is zero, but the mean & variance themselves aren't.

Janus answered 11/1, 2020 at 23:47 Comment(5)

'there are strong theoretical reasons against it, and multiple publications have shown BN performance degrade for batch_size under 32, and severely for <=8. In a nutshell, batch statistics "averaged" over a single sample vary greatly sample-to-sample (high variance), and BN mechanisms don't work as intended' Can you please share some links to this articles and evidence generally ? Made me curious. Thank you in advance. – Spagyric 25/5, 2021 at 14:15

"In other words, the variance of the mean & variance is zero, but the mean & variance themselves aren't." That's not how pretty much all practical BN implementations work. Stats are only computed once (collapsed over all dims except channels, as you rightly stated), not twice, as this wording seems to suggest (over all dims except batch, then over batch). The answer update is somewhat misleading and unwarranted, there is no need to compromise with the wrong side of the argument. – Est 20/12, 2021 at 0:52

@Est I don't follow. The sample variance of the statistics is zero for batch_size=1, as the variance of any single number (which is what results after collapsing all other dimensions) is zero. This was a possible root of confusion. – Janus 24/12, 2021 at 22:44

My point is that batch_size does not equate n_samples in any known BN implementation when there are other non-singular dimensions besides channels. n_samples is equal to the product of the collapsed axes' sizes. Clarifying this should be enough to clear the confusion. – Est 25/12, 2021 at 15:26

@Est Fair, but I prefer to address the plausible "close but incorrect" intuition. Updated wording. – Janus 26/12, 2021 at 14:16

when batch_size = 1, variance will be 0

No, because when you compute mean and variance for BN (for example using tf.nn.monents) you will be computing it over axis [0, 1, 2] (assuming you have NHWC tensor channels order).

From "Group Normalization" paper: https://arxiv.org/pdf/1803.08494.pdf

With batch_size=1 batch normalization is equal to instance normalization and it can be helpful in some tasks.

But if you are using sort of encoder-decoder and in some layer you have tensor with spatial size of 1x1 it will be a problem, because each channel only have only one value and mean of value will be equal to this value, so BN will zero out information.

Horntail answered 22/9, 2021 at 14:42 Comment(1)

This is mostly right and more terse than the most upvoted answer. The only thing I'd add is that, while in training time batchnorm with batch_size=1 equals instance norm, in the original papers (and in most default configs) IN doesn't use running stats in test time, whereas BN does. – Est 25/12, 2021 at 15:34

Recommended topics

Hot tags