Difference between standardscaler and Normalizer in sklearn.preprocessing

L

9

46

What is the difference between standardscaler and normalizer in sklearn.preprocessing module? Don't both do the same thing? i.e remove mean and scale using deviation?

Legitimatize answered 24/8, 2016 at 10:36 Comment(0)

A

51

From the Normalizer docs:

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.

And StandardScaler

Standardize features by removing the mean and scaling to unit variance

In other words Normalizer acts row-wise and StandardScaler column-wise. Normalizer does not remove the mean and scale by deviation but scales the whole row to unit norm.

Adamantine answered 24/8, 2016 at 15:12 Comment(1)

I am very new in data science. Read somewhere not sure if correct use case if you have more columns than row then use StandardScaler other wise normalizer – Symmetry 1/11, 2018 at 12:29

S

21

This visualization and article by Ben helps a lot in illustrating the idea.

The StandardScaler assumes your data is normally distributed within each feature. By "removing the mean and scaling to unit variance", you can see in the picture now they have the same "scale" regardless of its original one.

Supersession answered 3/11, 2017 at 4:47 Comment(1)

In order to apply StandardScaler to a feature it doesn't need to be normally distributed. – Lobeline 8/3 at 22:58

G

9

In addition to the excellent suggestion by @vincentlcy to view this article, there is now an example in the Scikit-Learn documentation here. An important difference is that Normalizer() is applied to each sample (i.e., row) rather than column. This may work only for certain datasets that fit its assumption of similar types of data in each column.

Goatish answered 7/5, 2019 at 19:52 Comment(0)

M

7

StandardScaler() standardizes features (such as the features of the person data i.e height, weight)by removing the mean and scaling to unit variance.

(unit variance: Unit variance means that the standard deviation of a sample as well as the variance will tend towards 1 as the sample size tends towards infinity.)

Normalizer() rescales each sample. For example rescaling each company's stock price independently of the other.

Some stocks are more expensive than others. To account for this, we normalize it. The Normalizer will separately transform each company's stock price to a relative scale.

Muoimuon answered 3/12, 2018 at 20:56 Comment(0)

Z

5

The main difference is that Standard Scalar is applied on Columns, while Normalizer is applied on rows, So make sure you reshape your data before normalizing it.

Zel answered 11/1, 2020 at 10:23 Comment(1)

I thought the normalizer has an axis parameter, and could therefore be applied to rows or columns... – Dactylo 4/4, 2020 at 1:4

A

4

Perhaps a helpful example:

With Normalizer, it seems that the default operation is to divide each data point in a row, by the $\ell_2$ norm of the row.

For example, given a row [4,1,2,2], the $\ell_2$ norm is: $\sqrt{4^2 + 1^2 + 2^2 + 2^2} = \sqrt{25} = 5$ .

The normalized row is then:

[4/5, 1/5, 2/5, 2/5]= [0.8, 0.2, 0.4, 0.4]

This is the first row of the example from the SKLearn docs.

Araujo answered 1/3, 2021 at 0:42 Comment(0)

R

2

StandardScaler standardizes features by removing the mean and scaling to unit variance, Normalizer rescales each sample.

Revocable answered 13/9, 2017 at 18:34 Comment(0)

R

2

Building off of the answer from @TerrenceJ, here is the code to manually calculate the Normalizer-transformed result from the example in the first SKLearn documentation (and note that this reflects the default "l2" normalization).

# create the original example
X = [[4, 1, 2, 2],
     [1, 3, 9, 3],
     [5, 7, 5, 1]]



# Manual Method:

# get the square root of the sum of squares for each record ("row")
import numpy as np
div = [np.sqrt(np.sum(np.power(X[i], 2))) for i in range(len(X))]

# divide each value by its record's respective square root of the sum of squares
np.array([X[k] / div[k] for k in range(len(X))])

# array([[0.8, 0.2, 0.4, 0.4],
#        [0.1, 0.3, 0.9, 0.3],
#        [0.5, 0.7, 0.5, 0.1]])



# SKLearn API Method:

from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(X)

# array([[0.8, 0.2, 0.4, 0.4],
#        [0.1, 0.3, 0.9, 0.3],
#        [0.5, 0.7, 0.5, 0.1]])

Routine answered 28/5, 2021 at 13:37 Comment(0)

L

0

If you are using the "predict" method for a support vector classifier, for example, to predict a category for a single row of data, then you would use normalize and not standard scaler (which requires columns of data, not a single row). For example I created a support vector classifier to predict recessions. And I wanted to input a single row of values for inflation, unemployment and GNP. Since I'm inputting a single row of three values, I cannot use standard scaler. I could only use it if I had multiple rows (i.e. a column of values for inflation, a column of values for unemployment and a column of values for GNP. See the program snippet below:

from sklearn import preprocessing
data = np.array([6, 4, 25460])
norm_data = preprocessing.normalize([data])

y_pred = OurSVM.predict(norm_data)

Liquidate answered 4/5, 2023 at 4:11 Comment(0)

Recommended topics

Hot tags