Do I have to do one-hot-encoding separately for train and test dataset? [closed]
Asked Answered
W

1

20

I'm working on a classification problem and I've split my data into train and test set.

I have few categorical columns (around 4 -6) and I am thinking of using pd.get_dummies to convert my categorical values to OneHotEncoding.

My question is do I have to do OneHotEncoding separately for train and test split? If that's the case I guess I better use sklearn OneHotEncoder because it supports fit and transform methods, right?

Warnke answered 4/4, 2019 at 21:29 Comment(0)
P
42

Generally, you want to treat the test set as though you did not have it during training. Whatever transformations you do to the train set should be done to the test set before you make predictions. So yes, you should do the transformation separately, but know that you are applying the same transformation.

For example, if the test set is missing one of the categories, there should still be a dummy variable for the missing category (which would be found in the training set), since the model you train will still expect that dummy variable. If the test set has an extra category, this should probably be handled with some "other" category.

Similarly, when scaling continuous variables say to [0,1], you use the range of the train set when scaling the test set. This could mean that the newly scaled test variable is outside of [0,1].


For completeness, here's how the one-hot encoding might look:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

### Correct
train = pd.DataFrame(['A', 'B', 'A', 'C'])
test = pd.DataFrame(['B', 'A', 'D'])

enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(train)

enc.transform(train).toarray()
#array([[1., 0., 0.],
#       [0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 1.]])

enc.transform(test).toarray()
#array([[0., 1., 0.],
#       [1., 0., 0.],
#       [0., 0., 0.]])


### Incorrect
full = pd.concat((train, test))

enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(full)

enc.transform(train).toarray()
#array([[1., 0., 0., 0.],
#       [0., 1., 0., 0.],
#       [1., 0., 0., 0.],
#       [0., 0., 1., 0.]])

enc.transform(test).toarray()
#array([[0., 1., 0., 0.],
#       [1., 0., 0., 0.],
#       [0., 0., 0., 1.]])

Notice that for the incorrect approach there is an extra column for D (which only shows up in the test set). During training, we wouldn't know about D at all so there shouldn't be a column for it.

Pierian answered 4/4, 2019 at 22:27 Comment(8)
thank you. Also what if there's a new category in test that's not in the train? Will it be ignored?Warnke
I think there a couple of options in that case. You might already have an "other" category from your train set. This would be something like a combination of lower frequency categories which on their own isn't enough information for you to properly train a model (think of a variable called color with many uncommon colors). If you don't have this "other" category, you might be able to safely ignore that category. And by "ignore", I mean none of the dummy variables would be labeled as a 1 (this is effectively the same as an "other" category).Pierian
If I split the train into train, cv and test. So I have to vectorize all of them separately right?Warnke
I would think so, but you may want to have each cv set have similar compositions of the categorical variable. Meaning, if category A appears 20% of the time in the variable, then each cv set should roughly have A appear 20% of the time, and so on for all categories.Pierian
I didn't get what you meant by similar compositions? Can you explain a little more on that? I have used train_test_split with stratify to split my train into train, cv and test. Please explain. If possible can you include what you meant in the above code.Warnke
Suppose the variable has three categories A (20%), B (30%), and C (50%). During splitting, you should see similar percentages for each cv set. You normally don't have to worry about this because a random split will usually take care of this. You only need to be concerned if you have a smaller data set, one of the categories is less frequent (say 5%), or you are doing many cross-validation sets. Basically, you want each category to still appear enough times for fitting a model to make sense.Pierian
I got it. Thank you so much :)Warnke
Hi mickey. So basically what I'm doing is wrong. I apply get dummies before I split therefore I am assuming that in my test all the categorical levels are available in the test set.Honghonied

© 2022 - 2024 — McMap. All rights reserved.