Adding batch normalization decreases the performance

Asked 12/8, 2019 at 8:26 Answered 4/11, 2024 at 12:29

Solved python deep-learning pytorch batch-normalization

I'm using PyTorch to implement a classification network for skeleton-based action recognition. The model consists of three convolutional layers and two fully connected layers. This base model gave me an accuracy of around 70% in the NTU-RGB+D dataset. I wanted to learn more about batch normalization, so I added a batch normalization for all the layers except for the last one. To my surprise, the evaluation accuracy dropped to 60% rather than increasing But the training accuracy has increased from 80% to 90%. Can anyone say what am I doing wrong? or Adding batch normalization need not increase the accuracy?

The model with batch normalization

class BaseModelV0p2(nn.Module):

    def __init__(self, num_person, num_joint, num_class, num_coords):
        super().__init__()
        self.name = 'BaseModelV0p2'
        self.num_person = num_person
        self.num_joint = num_joint
        self.num_class = num_class
        self.channels = num_coords
        self.out_channel = [32, 64, 128]
        self.loss = loss
        self.metric = metric
        self.bn_momentum = 0.01

        self.bn_cv1 = nn.BatchNorm2d(self.out_channel[0], momentum=self.bn_momentum)
        self.conv1 = nn.Sequential(nn.Conv2d(in_channels=self.channels, out_channels=self.out_channel[0],
                                             kernel_size=3, stride=1, padding=1),
                                   self.bn_cv1,
                                    nn.ReLU(),
                                    nn.MaxPool2d(kernel_size=2, stride=2))

        self.bn_cv2 = nn.BatchNorm2d(self.out_channel[1], momentum=self.bn_momentum)
        self.conv2 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[0], out_channels=self.out_channel[1],
                                            kernel_size=3, stride=1, padding=1),
                                   self.bn_cv2,
                                nn.ReLU(),
                                nn.MaxPool2d(kernel_size=2, stride=2))

        self.bn_cv3 = nn.BatchNorm2d(self.out_channel[2], momentum=self.bn_momentum)
        self.conv3 = nn.Sequential(nn.Conv2d(in_channels=self.out_channel[1], out_channels=self.out_channel[2],
                                            kernel_size=3, stride=1, padding=1),
                                   self.bn_cv3,
                                  nn.ReLU(),
                                  nn.MaxPool2d(kernel_size=2, stride=2))

        self.bn_fc1 = nn.BatchNorm1d(256 * 2, momentum=self.bn_momentum)
        self.fc1 = nn.Sequential(nn.Linear(self.out_channel[2]*8*3, 256*2),
                                 self.bn_fc1,
                                 nn.ReLU(),
                                 nn.Dropout2d(p=0.5))  # TO check

        self.fc2 = nn.Sequential(nn.Linear(256*2, self.num_class))

    def forward(self, input):
        list_bn_layers = [self.bn_fc1, self.bn_cv3, self.bn_cv2, self.bn_cv1]
        # set the momentum  of the batch norm layers to given momentum value during trianing and 0 during evaluation
        # ref: https://discuss.pytorch.org/t/model-eval-gives-incorrect-loss-for-model-with-batchnorm-layers/7561
        # ref: https://github.com/pytorch/pytorch/issues/4741
        for bn_layer in list_bn_layers:
            if self.training:
                bn_layer.momentum = self.bn_momentum
            else:
                bn_layer.momentum = 0

        logits = []
        for i in range(self.num_person):
            out = self.conv1(input[:, :, :, :, i])

            out = self.conv2(out)

            out = self.conv3(out)

            logits.append(out)

        out = torch.max(logits[0], logits[1])
        out = out.view(out.size(0), -1)
        out = self.fc1(out)
        out = self.fc2(out)

        t = out

        assert not ((t != t).any())  # find out nan in tensor
        assert not (t.abs().sum() == 0)  # find out 0 tensor

        return out

Borneo answered 12/8, 2019 at 8:26 Comment(2)

were you able to test with updated momentum 0.1? – Carillonneur 18/8, 2019 at 20:24

Yeah! I tried with different set of momentum 1, 0.1, 0.01 but the results didn't change. – Borneo 19/8, 2019 at 21:6

My interpretation of the phenomenon you are observing,, is that instead of reducing the covariance shift, which is what the Batch Normalization is meant for, you are increasing it. In other words, instead of decrease the distribution differences between train and test, you are increasing it and that's what it is causing you to have a bigger difference in the accuracies between train and test. Batch Normalization does not assure better performance always, but for some problems it doesn't work well. I have several ideas that could lead to an improvement:

Increase the batch size if it is small, what would help the mean and std calculated in the Batch Norm layers to be more robust estimates of the population parameters.
Decrease the bn_momentum parameter a bit, to see if that also stabilizes the Batch Norm parameters.
I am not sure you should set bn_momentum to zero when test, I think you should just call model.train() when you want to train and model.eval() when you want to use your trained model to perform inference.
You could alternatively try Layer Normalization instead of Batch Normalization, cause it does not require accumulating any statistic and usually works well
Try regularizing a bit your model using dropout
Make sure you shuffle your training set in every epoch. Not shuffling the data set may lead to correlated batches that make the statistics in batch normalization cycle. That may impact your generalization I hope any of these ideas work for you

Lesialesion answered 12/8, 2019 at 9:8 Comment(13)

Completely agree at least with the model.train() and model.eval point, as well as the lower bn_momentum! Although, I feel that BN + Dropout in the same layer might lead to poorer performance, since you're essentially denying the normalization to take effect on all nodes, but then again, I haven't had much experience with Dropout in CNN settings. – Pelisse 12/8, 2019 at 12:55

As long as you don't apply normalisation after dropout, it works :D. I have use it this way many times. Though I see your point, it may be worth to do some experiments :D – Lesialesion 12/8, 2019 at 13:3

Thank you @Lesialesion for your answer. I'm still don't get how I increase the distribution difference? If you don't mind can you say more about that? It would be really helpful. Regarding the eval () and train() part. I have done that part based on the discussion in the pytorch forum. They recommended to edit the momentum. links: discuss.pytorch.org/t/… – Borneo 12/8, 2019 at 14:55

When you apply batchnorm, a mean and a std is computed for the output of each neuron. These statistics are accumulated using an exponential decay function (momentum). If by any chance these statistics are not correctly calculated, for example in the case of using a very small momentum which would increase a lot their variability, you would end up having very noisy batches and hence potentially increasing the covariance shift. Regarding changing the momentum for train and test, I can assure that not doing it and using train and eval works, I haven’t tried changing the momentum parameter. – Lesialesion 12/8, 2019 at 21:14

But honestly the fact of fixing the momentum to 0 at test time shifts the distribution. Just think about it, the model learnt to deal with batches normalized with 99% of the historical parameters and 1% of the batch parameters (momentum=0.01). If you now change it to have 100% of the historical parameters (momentum=0) you are, indeed, disturbing the distribution known by the model. – Lesialesion 12/8, 2019 at 21:17

By the way, are you shuffling the data before training? It could also break batch norm. – Lesialesion 12/8, 2019 at 21:17

@Lesialesion Thanks! I understand it better now. But not changing the momentum didn't help, the accuracy graph remains very similar. Regarding the shuffle. Yes, while initiating the PyTorch Dataloader object, I'm setting the 'shuffle' argument to be True. Should I not do that? – Borneo 13/8, 2019 at 8:32

I have never used that kind of helper tools, I do usually do in numpy manually. But it should :D – Lesialesion 13/8, 2019 at 10:15

@Senthil Kumar If that answer managed to help you, please, consider marking it as accepted answer :D. – Lesialesion 13/8, 2019 at 16:31

@ivallesp: "are you shuffling the data before training? It could also break batch norm". According to the original paper, it is advised to shuffle: "We enabled within-shard shuffling of the training data, which prevents the same examples from always appearing in a mini-batch together. This led to about 1% improvements in the validation accuracy" – Chris 5/6, 2020 at 14:45

Check section 2.2 of this paper: papers.nips.cc/paper/2018/file/… – Pyrometer 22/11, 2020 at 6:29

@Lesialesion Can you explain to me why we need to increase the momentum? The momentum in Pytorch is running = (1-momentum) * running + momentum * present_variable so I think we should reduce momentum and increase the number of epochs to make it adapt to the whole dataset – Eruct 20/12, 2020 at 11:22

Well it depends on the size of the data set and the number of training steps you do. A small momentum will not work well with a small dataset, and viceversa. I have never seen big generalization problems like the one stated in the original question that are due to large momentum, but I did with small ones. I didn't say you need to increase momentum, I said it might be the root of the problem in the colleague's case. – Lesialesion 21/12, 2020 at 15:5

The problem may be with your momentum. I see you are using 0.01.

Here is how I tried different betas to fit to points with momentum and with beta=0.01 I got bad results. Usually beta=0.1 is used.

Carillonneur answered 15/8, 2019 at 18:2 Comment(0)

It's almost happen because of two major reasons 1.non-stationary training'procedure and 2.train/test different distributions

If It's possible try other regularization technique's like Drop-out,I face to this problem and i found that my test and train distribution might be different so after i remove BN and use drop-out instead, got the reasonable result. read this for more
Use nn.BatchNorm2d(out_channels, track_running_stats=False) this disables the running statistics of the batches and uses the current batch’s mean and variance to do the normalization
In Training mode run some forward passes on data in with torch.no_grad() block. this stabilize the running_mean / running_std values
Use same batch_size in your dataset for both model.train() and model.eval()
Increase momentum of the BN. This means that the means and stds learned will be much more stable during the process of training

this is helpful whenever you use pre-trained model

   for child in model.children():
       for ii in range(len(child)):
           if type(child[ii])==nn.BatchNorm2d:
               child[ii].track_running_stats = False

Cowpoke answered 2/3, 2021 at 8:55 Comment(0)

This looks like a clear case of overfitting.
It likely has nothing to do with batch norm, and more to do with adding something to your model that allowed it to fit the training data better which caused overfitting.
Chances are that instead of adding bn, you could have added any other thing that allowed it to fit the training data a little too well and you'd see the same thing.
Try using only one dense layer or using dropout in the dense layers. Most people have moved away from dense layers because of how easily they can overfit.

Kimikokimitri answered 4/11, 2024 at 12:29 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags