Why scale across rows not columns for standardizing (preprocessing) of Data before clustering
Asked Answered
T

2

5

I am very confused and could not find a convincing answer on the internet to the following question regarding the data preprocessing clustering.

According to Python documentation, when we do preprocessing using the built-in command in sckit learn library given the data is formulated as N x D matrix where rows are the samples and columns are the features, we make the mean across the rows to be zero and at the same time standard deviation across rows are unity like the following:

X_scaled.mean(axis=0)
array([ 0.,  0.,  0.])

X_scaled.std(axis=0)
array([ 1.,  1.,  1.])

My question is shouldn't we make the mean across the column (features instead of samples) to be zero and the same thing for standard deviation since we are trying to standardize the features not the samples. Websites and other resources always standardize across rows but they never explain why?

Toxemia answered 25/6, 2018 at 22:30 Comment(4)
I would expect that you'd want to normalize the values for a given feature, across the samples. If you normalize a given sample's data across its features, you've tossed out a lot of information. That would be for comparing features, rather than for comparing samples for a feature.Muoimuon
But then this is a little misleading when people say features have different range so lets just scale them, it seems what they mean is to scale the features for a given sample because assume we have weight, Age and height for Mr A. A is 65 kg, 180cm and 20 years old they are different range I thought we make these features to be 0 mean and unit variance. Can you elaborate on this or if you think write your comment as an answerToxemia
Other than @JeffLearman 's answer, I may add that normalizing across rows (as the OP originally thought) is also a thing mostly in image processing, so that to make all the samples (images) to a standard color (pixel values) and has promising results there.Ludivinaludlew
You may also want to center each row in the scenario where each row (say a respondent of a survey) has its own central tendency (e.g., some respondents chronically respond with higher numbers on a scale of 1 - 10 relative to others)Impercipient
M
7

I would expect that you'd want to normalize the values for a given feature, across the samples. If you normalize a given sample's data across its features, you've tossed out a lot of information. That would be for comparing features (which rarely makes sense), rather than for comparing samples for a feature.

I don't know numpy or sklearn so take this with a grain of salt, but when normalizing, you want to normalize (using the same parameters) all data for a given feature, to bring all the values for that feature into the range of (-1 ... +1), with the mean as zero (or something like that). You'd do this separately for each feature, so they'll all end up in that range, with each feature's mean at zero.

Consider an example, if you normalized across all the features for a given sample.

        height weight age
person1 180    65     50
person2 140    45     50

If we normalize the values for person1 across the features, then do the same for person2, then person2 will seem to have a different age than person1!

If we normalize across the samples for a given column, then the relationships will still hold. Their ages will match; person1 will be taller, and person2 will weigh less. But all values for all features will fit within the distribution rules necessary for subsequent analysis.

Muoimuon answered 25/6, 2018 at 23:18 Comment(0)
P
5

There is a place for normalizing your samples. One example is when your features are counts. In this case, normalizing each sample to unit l1-norm effectively changes each feature to a percentage of the total count for that sample.

Sklearn's Normalizer is made for sample normalization and can normalize to l1 or l2 norm.

http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Normalizer.html

Passenger answered 26/6, 2018 at 15:12 Comment(1)
Good point. I couldn't think of a good example where one would normalize this way, but you came up with one.Muoimuon

© 2022 - 2024 — McMap. All rights reserved.