Standardization before or after categorical encoding?
Asked Answered
F

4

8

I'm working on a regression algorithm, in this case k-NearestNeighbors to predict a certain price of a product.

So I have a Training set which has only one categorical feature with 4 possible values. I've dealt with it using a one-to-k categorical encoding scheme which means now I have 3 more columns in my Pandas DataFrame with a 0/1 depending the value present.

The other features in the DataFrame are mostly distances like latitud - longitude for locations and prices, all numerical.

Should I standardize (Gaussian distribution with zero mean and unit variance) and normalize before or after the categorical encoding?

I'm thinking it might be benefitial to normalize after encoding so that every feature is to the estimator as important as every other when measuring distances between neighbors but I'm not really sure.

Federative answered 13/11, 2017 at 19:27 Comment(2)
You should try both and see what works for you well, given your choice of algorithm.Iglesias
I'm voting to close this question as off-topic because it is about machine learning rather than software development. You can ask these questions on Cross Validated or DataScience.SE.Brassiere
J
8

Seems like an open problem, thus I'd like to answer even though it's late. I am also unsure how much the similarity between the vectors would be affected, but in my practical experience you should first encode your features and then scale them. I have tried the opposite with scikit learn preprocessing.StandardScaler() and it doesn't work if your feature vectors do not have the same length: scaler.fit(X_train) yields ValueError: setting an array element with a sequence. I can see from your description that your data have a fixed number of features, but I think for generalization purposes (maybe you have new features in the future?), it's good to assume that each data instance has a unique feature vector length. For instance, I transform my text documents into word indices with Keras text_to_word_sequence (this gives me the different vector length), then I convert them to one-hot vectors and then I standardize them. I have actually not seen a big improvement with the standardization. I think you should also reconsider which of your features to standardize, as dummies might not need to be standardized. Here it doesn't seem like categorical attributes need any standardization or normalization. K-nearest neighbors is distance-based, thus it can be affected by these preprocessing techniques. I would suggest trying either standardization or normalization and check how different models react with your dataset and task.

Janeenjanek answered 29/10, 2018 at 12:50 Comment(0)
D
1

After. Just imagine that you have not numerical variables in your column but strings. You can't standardize strings - right? :)

But given what you wrote about categories. If they are represented with values, I suppose there is some kind of ranking inside. Probably, you can use raw column rather than one-hot-encoded. Just thoughts.

Dustan answered 15/11, 2017 at 17:17 Comment(0)
M
0

You generally want to standardize all your features so it would be done after the encoding (that is assuming that you want to standardize to begin with, considering that there are some machine learning algorithms that do not need features to be standardized to work well).

Mush answered 13/11, 2017 at 20:56 Comment(0)
C
0

So there is 50/50 voting on whether to standardize data or not. I would suggest, given the positive effects in terms of improvement gains no matter how small and no adverse effects, one should do standardization before splitting and training estimator

Castano answered 15/9, 2020 at 16:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.