Normalize data before or after split of training and testing data?
Asked Answered
N

4

82

I want to separate my data into train and test set, should I apply normalization over data before or after the split? Does it make any difference while building predictive model?

Nonmaterial answered 23/3, 2018 at 7:13 Comment(0)
I
140

You first need to split the data into training and test set (validation set could be useful too).

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance. If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

Therefore, you should perform feature normalisation over the training data. Then perform normalisation on testing instances as well, but this time using the mean and variance of training explanatory variables. In this way, we can test and evaluate whether our model can generalize well to new, unseen data points.

For a more comprehensive read, you can read my article Feature Scaling and Normalisation in a nutshell


As an example, assuming we have the following data:

>>> import numpy as np
>>> 
>>> X, y = np.arange(10).reshape((5, 2)), range(5)

where X represents our features:

>>> X
[[0 1]
 [2 3]
 [4 5]
 [6 7]
 [8 9]]

and Y contains the corresponding label

>>> list(y)
>>> [0, 1, 2, 3, 4]

Step 1: Create training/testing sets

>>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

>>> X_train
[[4 5]
 [0 1]
 [6 7]]
>>>
>>> X_test
[[2 3]
 [8 9]]
>>>
>>> y_train
[2, 0, 3]
>>>
>>> y_test
[1, 4]

Step 2: Normalise training data

>>> from sklearn import preprocessing
>>> 
>>> normalizer = preprocessing.Normalizer()
>>> normalized_train_X = normalizer.fit_transform(X_train)
>>> normalized_train_X
array([[0.62469505, 0.78086881],
       [0.        , 1.        ],
       [0.65079137, 0.7592566 ]])

Step 3: Normalize testing data

>>> normalized_test_X = normalizer.transform(X_test)
>>> normalized_test_X
array([[0.5547002 , 0.83205029],
       [0.66436384, 0.74740932]])
Interpose answered 23/3, 2018 at 7:51 Comment(8)
I'm still puzzled about this. In the machine learning bible "Elements of Statistical Learning" it says that it is OK to perform any form of unsupervised preprocessing before splitting. The argument is that since you're not using the labels you are not biasing your estimator. Also, the basic assumption in any ML model is that the train, val, test splits are all samples from the same population. So the population mean (or variance or whatever moment) is unique, and whether we use our whole available dataset or a subset of it to estimate it will only affect how well we estimate itUndermine
stats.stackexchange.com/questions/239898/…Undermine
But then I also understand the other practical point of view, which is that in the real world we don't have access to the test set, so we shouldn't really use it even to calculate the population mean, etc.Undermine
The code in this answer doesn't do what it describes -subtract the mean or divide by the variance. Instead, it computes the length (L2 norm) of each row and divides each element in a row by the length. It's easy enough to inspect: the mean of the first column is 10/3 but the transformed data doesn't have a negative number ( 0 - 10/3 ) / (some positive number) in the second row of the first column. But we can inspect each row and see that the sum of square elements is 1. Also, the documentation says the same. scikit-learn.org/stable/modules/generated/…Jase
Finally, because L2 norm is only applied to rows individually, the train/test distinction is irrelevant. No information about the training data is used by preprocessing.Normalizer() when calling transform. It only needs the data provided to transform.Jase
Note that what this answer has to say about centering and scaling data, and train/test splits, is basically correct (although one typically divides by the standard deviation instead of the variance); preconditioning in this way can dramatically improve the speed of gradient-based optimizers. But the code provided does not center or scale the data in the way the text describes.Jase
Maybe it is a silly question but: why do we not normalize y_train, y_test?Heathendom
If we transfer training data to [0,1] using max and min of train data, then use this max and min of training data to transfer test data to [0,1], then, the transformed test data may be out of [0,1]. This can be occurred specifically when the max and min of data be greater or less than that of in training data respectively.Celestina
J
11

In the specific setting of a train/test split, we need to distinguish between two transformations:

  1. transformations that change the value of an observation (row) according to information about a feature (column) and
  2. transformations that change the value of an observation according to information about that observation alone.

Two common examples of (1) are mean-centering (subtracting the mean of the feature) or scaling to unit variance (dividing by the standard deviation). Subtracting the mean and dividing by the standard deviation is a common transformation. In sklearn, it is implemented in sklearn.preprocessing.StandardScaler. Importantly, this is not the same as Normalizer. See below for exhaustive detail.

An example of (2) is transforming a feature by taking the logarithm, or raising each value to a power (e.g. squaring).

Transformations of the first type are best applied to the training data, with the centering and scaling values retained and applied to the test data afterwards. This is because using information about the test set to train the model may bias model comparison metrics to be overly optimistic. This can result in over-fitting & selection of a bogus model.

Transformations of the second type can be applied without regard to train/test splits, because the modified value of each observation depends only on the data about the observation itself, and not on any other data or observation(s).


This question has garnered some misleading answers. The rest of this answer is dedicated to showing how and why they are misleading.

The term "normalization" is ambiguous, and different authors and disciplines will use the term "normalization" in different ways. In the absence of a specific articulation of what "normalization" means, I think it's best to approach the question in the most general sense possible.

In this view, the question is not about sklearn.preprocessing.Normalizer specifically. Indeed, the Normalizer class is not mentioned in the question. For that matter, no software, programming language or library is mentioned, either. Moreover, even if the intent is to ask about Normalizer, the answers are still misleading because they mischaracterize what Normalizer does.

Even within the same library, the terminology can be inconsistent. For example, PyTorch implements normalize torchvision.transforms.Normalize and torch.nn.functional.normalize. One of these can be used to create output tensors with mean 0 and standard deviation 1, while the other creates outputs that have a norm of 1.


What the Normalizer Class Does

The Normalizer class is an example of (2) because it rescales each observation (row) individually so that the sum-of-squares is 1 for every row. (In the corner-case that a row has sum-of-squares equal to 0, no rescaling is done.) The first sentence of the documentation for the Normalizer says

Normalize samples individually to unit norm.

This simple test code validates this understanding:

X = np.arange(10).reshape((5, 2))
normalizer = preprocessing.Normalizer()
normalized_all_X = normalizer.transform(X)
sum_of_squares = np.square(normalized_all_X).sum(1)
print(np.allclose(sum_of_squares,np.ones_like(sum_of_squares)))

This prints True because the result is an array of 1s, as described in the documentation.

The normalizer implements fit, transform and fit_transform methods even though some of these are just "pass-through" methods. This is so that there is a consistent interface across preprocessing methods, not because the methods' behaviors needs to distinguish between different data partitions.


Misleading Presentation 1

The Normalizer class does not subtract the column means

Another answer writes:

Don't forget that testing data points represent real-world data. Feature normalization (or data standardization) of the explanatory (or predictor) variables is a technique used to center and normalise the data by subtracting the mean and dividing by the variance.

Ok, so let's try this out. Using the code snippet from the answer, we have

X = np.arange(10).reshape((5, 2))

X_train = X[:3]
X_test = X[3:]

normalizer = preprocessing.Normalizer()
normalized_train_X = normalizer.fit_transform(X_train)
column_means_train_X = normalized_train_X.mean(0)

This is the value of column_means_train_X. It is not zero!

[0.42516214 0.84670847]

If the column means had been subtracted from the columns, then the centered column means would be 0.0. (This is simple to prove. The sum of n numbers x=[x1,x2,x3,...,xn] is S. The mean of those numbers is S / n. Then we have sum(x - S/n) = S - n * (S / n) = 0.)

We can write similar code to show that the columns have not been divided by the variance. (Neither have the columns been divided by the standard deviation, which would be the more usual choice).

Misleading Presentation 2

Applying the Normalizer class to the whole data set does not change the result.

If you take the mean and variance of the whole dataset you'll be introducing future information into the training explanatory variables (i.e. the mean and variance).

This claim is true as far as it goes, but it has absolutely no bearing on the Normalizer class. Indeed, Giorgos Myrianthous's chosen example is actually immune to the effect that they are describing.

If the Normalizer class did involve the means of the features, then we would expect that the normalize results will change depending on which of our data are included in the training set.

For example, the sample mean is a weighted sum of every observation in the sample. If we were computing column means and subtracting them, the results of applying this to all of the data would differ from applying it to only the training data subset. But we've already demonstrated that Normalizer doesn't subtract column means.

Furthermore, these tests show that applying Normalizer to all of the data or just some of the data makes no difference for the results.

If we apply this method separately, we have

[[0.         1.        ]
 [0.5547002  0.83205029]
 [0.62469505 0.78086881]]

[[0.65079137 0.7592566 ]
 [0.66436384 0.74740932]]

And if we apply it together, we have

[[0.         1.        ]
 [0.5547002  0.83205029]
 [0.62469505 0.78086881]
 [0.65079137 0.7592566 ]
 [0.66436384 0.74740932]]

where the only difference is that we have 2 arrays in the first case, due to partitioning. Let's just double-check that the combined arrays are the same:

normalized_train_X = normalizer.fit_transform(X_train)
normalized_test_X = normalizer.transform(X_test)
normalized_all_X = normalizer.transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),normalized_all_X )

No exception is raised; they're numerically identical.

But sklearn's transformers are sometimes stateful, so let's make a new object just to make sure this isn't some state-related behavior.

new_normalizer = preprocessing.Normalizer()
new_normalized_all_X = new_normalizer.fit_transform(X)
assert np.allclose(np.vstack((normalized_train_X, normalized_test_X)),new_normalized_all_X )

In the second case, we still have no exception raised.

We can conclude that for the Normalizer class, it makes no difference if the data are partitioned or not.

Jase answered 15/4, 2022 at 17:57 Comment(2)
Yes, and to add, the Normalizer.fit() method is syntactic sugar to provide a uniform API for these pre-processors in Scikit-Learn – if you look at the source code, it's a pass through method.Tarton
@Tarton Yes, you are correct. I decided to focus on testing the outputs of the method, instead of reviewing the code itself, in this demonstration because the outputs are immediate & verifiable. In other words, I wanted to establish what the expected behavior is and verify it using a reproducible example because this type of result is more immediately apparent to a novice, while digging through source code might be intimidating or opaque to a novice.Jase
B
4

you can use fit then transform learn

normalizer = preprocessing.Normalizer().fit(xtrain)

transform

xtrainnorm = normalizer.transform(xtrain) 
xtestnorm = normalizer.transform(Xtest) 
Baden answered 19/10, 2018 at 12:24 Comment(5)
This approach alings with this answer: datascience.stackexchange.com/a/54909/80221Fotheringhay
And the sklearn preprocessing docs: scikit-learn.org/stable/modules/…Fotheringhay
This question asks "Should I apply normalization over data before or after the split?" This answer doesn't address that question at all, it just provides a code snippet. What problem does this code solve? How does this code relate to the question? How does it address whether you should apply fit or transform to different partitions? The question doesn't mention python or sklearn or Normalizer. Why does this answer assume that it's about this specific class?Jase
@Jase isnt the code exactly doing that ? normalizer created from the splitted data xtrain, then that normalizer is used to transform both train and test data ?Neuberger
@Neuberger The question asks "should I apply normalization over data before or after the split? Does it make any difference while building predictive model?" A code snippet cannot answer the first question, because code won't know whether the code is making a statistical mistake. It doesn't answer the second question. because it does not comment on whether or not it makes any difference for a model.Jase
C
4

Ask yourself if your data will look different depending on whether you transform before or after your split. If you're doing a log2 transformation, the order doesn't matter because each value is transformed independently of the others. If you're scaling and centering your data, the order does matter because an outlier can drastically change the final distribution. You're allowing the test set to "spill over" and affect your training set, potentially causing overly optimistic performance measures.

For R uses, the caret package is good at handling test/train splits. You can add the argument preProcess = c("scale", "center") to the train function and it will automatically apply any transformation from the training data onto the test data.

Tl;dr - if the data is different depending on whether your normalize before or after your split, do it before

Cataphyll answered 28/5, 2020 at 17:48 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.