What is the difference between standardscaler and normalizer in sklearn.preprocessing module? Don't both do the same thing? i.e remove mean and scale using deviation?
From the Normalizer docs:
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1 or l2) equals one.
And StandardScaler
Standardize features by removing the mean and scaling to unit variance
In other words Normalizer acts row-wise and StandardScaler column-wise. Normalizer does not remove the mean and scale by deviation but scales the whole row to unit norm.
This visualization and article by Ben helps a lot in illustrating the idea.
The StandardScaler assumes your data is normally distributed within each feature. By "removing the mean and scaling to unit variance", you can see in the picture now they have the same "scale" regardless of its original one.
In addition to the excellent suggestion by @vincentlcy to view this article, there is now an example in the Scikit-Learn documentation here. An important difference is that Normalizer()
is applied to each sample (i.e., row) rather than column. This may work only for certain datasets that fit its assumption of similar types of data in each column.
StandardScaler() standardizes features (such as the features of the person data i.e height, weight)by removing the mean and scaling to unit variance.
(unit variance: Unit variance means that the standard deviation of a sample as well as the variance will tend towards 1 as the sample size tends towards infinity.)
Normalizer() rescales each sample. For example rescaling each company's stock price independently of the other.
Some stocks are more expensive than others. To account for this, we normalize it. The Normalizer will separately transform each company's stock price to a relative scale.
The main difference is that Standard Scalar is applied on Columns, while Normalizer is applied on rows, So make sure you reshape your data before normalizing it.
Perhaps a helpful example:
With Normalizer, it seems that the default operation is to divide each data point in a row, by the norm of the row.
For example, given a row [4,1,2,2]
, the norm is: .
The normalized row is then:
[4/5, 1/5, 2/5, 2/5]= [0.8, 0.2, 0.4, 0.4]
This is the first row of the example from the SKLearn docs.
StandardScaler standardizes features by removing the mean and scaling to unit variance, Normalizer rescales each sample.
Building off of the answer from @TerrenceJ, here is the code to manually calculate the Normalizer-transformed result from the example in the first SKLearn documentation (and note that this reflects the default "l2" normalization).
# create the original example
X = [[4, 1, 2, 2],
[1, 3, 9, 3],
[5, 7, 5, 1]]
# Manual Method:
# get the square root of the sum of squares for each record ("row")
import numpy as np
div = [np.sqrt(np.sum(np.power(X[i], 2))) for i in range(len(X))]
# divide each value by its record's respective square root of the sum of squares
np.array([X[k] / div[k] for k in range(len(X))])
# array([[0.8, 0.2, 0.4, 0.4],
# [0.1, 0.3, 0.9, 0.3],
# [0.5, 0.7, 0.5, 0.1]])
# SKLearn API Method:
from sklearn.preprocessing import Normalizer
Normalizer().fit_transform(X)
# array([[0.8, 0.2, 0.4, 0.4],
# [0.1, 0.3, 0.9, 0.3],
# [0.5, 0.7, 0.5, 0.1]])
If you are using the "predict" method for a support vector classifier, for example, to predict a category for a single row of data, then you would use normalize and not standard scaler (which requires columns of data, not a single row). For example I created a support vector classifier to predict recessions. And I wanted to input a single row of values for inflation, unemployment and GNP. Since I'm inputting a single row of three values, I cannot use standard scaler. I could only use it if I had multiple rows (i.e. a column of values for inflation, a column of values for unemployment and a column of values for GNP. See the program snippet below:
from sklearn import preprocessing
data = np.array([6, 4, 25460])
norm_data = preprocessing.normalize([data])
y_pred = OurSVM.predict(norm_data)
© 2022 - 2024 — McMap. All rights reserved.