Difference between Standard scaler and MinMaxScaler
Asked Answered
C

4

38

What is the difference between MinMaxScaler() and StandardScaler().

mms = MinMaxScaler(feature_range = (0, 1)) (Used in a machine learning model)

sc = StandardScaler() (In another machine learning model they used standard-scaler and not min-max-scaler)

Cubby answered 9/7, 2018 at 2:42 Comment(0)
C
34

From ScikitLearn site:

StandardScaler removes the mean and scales the data to unit variance. However, the outliers have an influence when computing the empirical mean and standard deviation which shrink the range of the feature values as shown in the left figure below. Note in particular that because the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different: most of the data lie in the [-2, 4] range for the transformed median income feature while the same data is squeezed in the smaller [-0.2, 0.2] range for the transformed number of households.

StandardScaler therefore cannot guarantee balanced feature scales in the presence of outliers.

MinMaxScaler rescales the data set such that all feature values are in the range [0, 1] as shown in the right panel below. However, this scaling compress all inliers in the narrow range [0, 0.005] for the transformed number of households.

Coenosarc answered 9/7, 2018 at 2:58 Comment(0)
J
85

MinMaxScaler(feature_range = (0, 1)) will transform each value in the column proportionally within the range [0,1]. Use this as the first scaler choice to transform a feature, as it will preserve the shape of the dataset (no distortion).

StandardScaler() will transform each value in the column to range about the mean 0 and standard deviation 1, ie, each value will be normalised by subtracting the mean and dividing by standard deviation. Use StandardScaler if you know the data distribution is normal.

If there are outliers, use RobustScaler(). Alternatively you could remove the outliers and use either of the above 2 scalers (choice depends on whether data is normally distributed)

Additional Note: If scaler is used before train_test_split, data leakage will happen. Do use scaler after train_test_split

Jefferey answered 14/11, 2019 at 5:47 Comment(1)
Just found a good article which explain these scalers towardsdatascience.com/…Jefferey
C
34

From ScikitLearn site:

StandardScaler removes the mean and scales the data to unit variance. However, the outliers have an influence when computing the empirical mean and standard deviation which shrink the range of the feature values as shown in the left figure below. Note in particular that because the outliers on each feature have different magnitudes, the spread of the transformed data on each feature is very different: most of the data lie in the [-2, 4] range for the transformed median income feature while the same data is squeezed in the smaller [-0.2, 0.2] range for the transformed number of households.

StandardScaler therefore cannot guarantee balanced feature scales in the presence of outliers.

MinMaxScaler rescales the data set such that all feature values are in the range [0, 1] as shown in the right panel below. However, this scaling compress all inliers in the narrow range [0, 0.005] for the transformed number of households.

Coenosarc answered 9/7, 2018 at 2:58 Comment(0)
D
7

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range. Scaling the data means it helps to Normalize the data within a particular range.

When MinMaxScaler is used the it is also known as Normalization and it transform all the values in range between (0 to 1) formula is x = [(value - min)/(Max- Min)]

StandardScaler comes under Standardization and its value ranges between (-3 to +3) formula is z = [(x - x.mean)/Std_deviation]

Dolora answered 14/10, 2020 at 15:14 Comment(1)
If you could explain why the perform better when input variables are scaled it would be interesting.Monniemono
E
4

Before implementing MinMaxScaler or Standard Scaler you should know about the distribution of your dataset.

StandardScaler rescales a dataset to have a mean of 0 and a standard deviation of 1. Standardization is very useful if data has varying scales and the algorithm assumption about data having a gaussian distribution.

Normalization or MinMaxScaler rescale a dataset so that each value fall between 0 and 1. It is useful when data has varying scales and the algorithm does not make assumptions about the distribution. It is a good technique when we did not know about the distribution of data or when we know the distribution is not gaussian.

Edelsten answered 15/8, 2022 at 15:54 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.