Correct way of normalizing and scaling the MNIST dataset
Asked Answered
C

5

17

I've looked everywhere but couldn't quite find what I want. Basically the MNIST dataset has images with pixel values in the range [0, 255]. People say that in general, it is good to do the following:

  • Scale the data to the [0,1] range.
  • Normalize the data to have zero mean and unit standard deviation (data - mean) / std.

Unfortunately, no one ever shows how to do both of these things. They all subtract a mean of 0.1307 and divide by a standard deviation of 0.3081. These values are basically the mean and the standard deviation of the dataset divided by 255:

from torchvision.datasets import MNIST        
import torchvision.transforms as transforms 

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True)
print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))

This outputs the following

Min Pixel Value: 0 
Max Pixel Value: 255
Mean Pixel Value 33.31002426147461 
Pixel Values Std: 78.56748962402344
Scaled Mean: 0.13062754273414612 
Scaled Std: 0.30810779333114624

However clearly this does none of the above! The resulting data 1) will not be between [0, 1] and will not have mean 0 or std 1. In fact this is what we are doing:

[data - (mean / 255)] / (std / 255)

which is very different from this

[(scaled_data) - (mean/255)] / (std/255)

where scaled_data is just data / 255.

Coxcombry answered 4/9, 2020 at 18:13 Comment(0)
J
23

Euler_Salter

I may have stumbled upon this a little too late, but hopefully I can help a little bit.

Assuming that you are using torchvision.Transform, the following code can be used to normalize the MNIST dataset.

        train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('./data', train=True
        transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
        ])),

Usually, 'transforms.ToTensor()' is used to turn the input data in the range of [0,255] to a 3-dimensional Tensor. This function automatically scales the input data to the range of [0,1]. (This is equivalent to scaling the data down to 0,1)

Therefore, it makes sense that the mean and std used in the 'transforms.Normalize(...)' will be 0.1307 and 0.3081, respectively. (This is equivalent to normalizing zero mean and unit standard deviation.)

Please refer to the link below for better explanation.

https://pytorch.org/vision/stable/transforms.html

Josephson answered 23/4, 2021 at 16:46 Comment(1)
This answer addresses the key point being asked by OP's question.Kiaochow
V
12

I think you misunderstand one critical concept: these are two different, and inconsistent, scaling operations. You can have only one of the two:

  • mean = 0, stdev = 1
  • data range [0,1]

Think about it, considering the [0,1] range: if the data are all small positive values, with min=0 and max=1, then the sum of the data must be positive, giving a positive, non-zero mean. Similarly, the stdev cannot be 1 when none of the data can possibly be as much as 1.0 different from the mean.

Conversely, if you have mean=0, then some of the data must be negative.


You use only one of the two transformations. Which one you use depends on the characteristics of your data set, and -- ultimately -- which one works better for your model.

For the [0,1] scaling, you simply divide by 255.

For the mean=0, stdev=1 scaling, you perform the simple linear transformation you already know:

new_val = (old_val - old_mean) / old_stdev

Does that clarify it for you, or have I entirely missed your point of confusion?

Verbenia answered 4/9, 2020 at 18:26 Comment(2)
It clarifies it a lot BUT the issue is this: when values are kept in the range [0, 255] and the data is only normalized (without scaling) EVERYONE seem to be using the wrong mean and the wrong standard deviation. Basically they use the mean and the standard deviation that you'd find if you scaled the data to [0, 1].Coxcombry
The mean and std of the data without scaling (i.e. in the range [0, 255]) are 33.31 and 78.56. Instead, the scaled mean and std (dividing 33 and 78 by 255) are 0.1306 and 0.3081. For some reason, everyone uses these two scaled values EVEN IF they don't scale the data between [0, 1]. They are using scaled mean/std but their data is not scaled!Coxcombry
D
10

Purpose

Two of the most important reasons for features scaling are:

  1. You scale features to make them all of the same magnitude (i.e. importance or weight).

Example:

Dataset with two features: Age and Weight. The ages in years and the weights in grams! Now a fella in the 20th of his age and weights only 60Kg would translate to a vector = [20 yrs, 60000g], and so on for the whole dataset. The Weight Attribute will dominate during the training process. How is that, depends on the type of the algorithm you are using - Some are more sensitive than others: E.g. Neural Network where the Learning Rate for Gradient Descent get affected by the magnitude of the Neural Network Thetas (i.e. Weights), and the latter varies in correlation to the input (i.e. features) during the training process; also Feature Scaling improves Convergence. Another example is the K-Mean Clustering Algorithm requires Features of the same magnitude since it is isotropic in all directions of space. INTERESTING LIST.

  1. You scale features to speed up execution time.

This is straightforward: All these matrices multiplications and parameters summation would be faster with small numbers compared to very large number (or very large number produced from multiplying features by some other parameters..etc)


Types

The most popular types of Feature Scalers can be summarized as follows:

  1. StandardScaler: usually your first option, it's very commonly used. It works via standardizing the data (i.e. centering them), that's to bring them to a STD=1 and Mean=0. It gets affected by outliers, and should only be used if your data have Gaussian-Like Distribution.

enter image description here

  1. MinMaxScaler: usually used when you want to bring all your data point into a specific range (e.g. [0-1]). It heavily gets affected by outliers simply because it uses the Range.

enter image description here

  1. RobustScaler: It's "robust" against outliers because it scales the data according to the quantile range. However, you should know that outliers will still exist in the scaled data.

enter image description here

  1. MaxAbsScaler: Mainly used for sparse data.

enter image description here

  1. Unit Normalization: It basically scales the vector for each sample to have unit norm, independently of the distribution of the samples.

enter image description here


Which One & How Many

You need to get to know your dataset first. As per mentioned above, there are things you need to look at before, such as: the Distribution of the Data, the Existence of Outliers, and the Algorithm being utilized.

Anyhow, you need one scaler per dataset, unless there is a specific requirement, such that if there exist an algorithm that works only if data are within certain range and has mean of zero and standard deviation of 1 - all together. Nevertheless, I have never come across such case.


Key Takeaways

Drupelet answered 4/9, 2020 at 23:8 Comment(1)
Maybe my question wasn't very clear then. The problem is the fact that people use the wrong mean and the wrong standard deviation (or so it looks at least)Coxcombry
R
2

Well the data gets scaled to [0,1] using torchvision.transforms.ToTensor() and then the normalization (0.1306,0.3081) is applied. You can look about it in the Pytorch documentation : https://pytorch.org/vision/stable/transforms.html.
Hope that answers your question.

Rigging answered 29/1, 2023 at 5:19 Comment(0)
D
0

I would like to add that these transforms are executed only when they are accessed by a DataLoader instance.

The snippet OP has mentioned will produce the same result as it even if you include the transforms.

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,)) # calculated mean and std 
])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform = transform)

print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(trainset.data.min(), trainset.data.max()))
print('Mean Pixel Value {} \nPixel Values Std: {}'.format(trainset.data.float().mean(), trainset.data.float().std()))
print('Scaled Mean Pixel Value {} \nScaled Pixel Values Std: {}'.format(trainset.data.float().mean() / 255, trainset.data.float().std() / 255))

produces,

Min Pixel Value: 0 
Max Pixel Value: 255
Mean Pixel Value 33.31842041015625 
Pixel Values Std: 78.56748962402344
Scaled Mean Pixel Value 0.13066047430038452 
Scaled Pixel Values Std: 0.30810779333114624

but when accessed by a DataLoader, for example,

train_loader = DataLoader(
    trainset,
    batch_size=batch_size,
)
torch.manual_seed(5576)
for (x, y) in train_loader:
    print('Min Pixel Value: {} \nMax Pixel Value: {}'.format(x.min(), x.max()))
    print('Mean Pixel Value {} \nPixel Values Std: {}'.format(x.float().mean(), x.float().std()))
    break

produces this instead,

Min Pixel Value: -0.4242129623889923 
Max Pixel Value: 2.821486711502075
Mean Pixel Value -0.012694964185357094 
Pixel Values Std: 0.9848554730415344

which suggests that transforms are stored during dataset instantiation and are executed on the fly when accessed through DataLoader.

Desolate answered 27/6, 2023 at 17:25 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.