Generally, you want to treat the test set as though you did not have it during training. Whatever transformations you do to the train set should be done to the test set before you make predictions. So yes, you should do the transformation separately, but know that you are applying the same transformation.
For example, if the test set is missing one of the categories, there should still be a dummy variable for the missing category (which would be found in the training set), since the model you train will still expect that dummy variable. If the test set has an extra category, this should probably be handled with some "other" category.
Similarly, when scaling continuous variables say to [0,1]
, you use the range of the train set when scaling the test set. This could mean that the newly scaled test variable is outside of [0,1]
.
For completeness, here's how the one-hot encoding might look:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
### Correct
train = pd.DataFrame(['A', 'B', 'A', 'C'])
test = pd.DataFrame(['B', 'A', 'D'])
enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(train)
enc.transform(train).toarray()
#array([[1., 0., 0.],
# [0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 1.]])
enc.transform(test).toarray()
#array([[0., 1., 0.],
# [1., 0., 0.],
# [0., 0., 0.]])
### Incorrect
full = pd.concat((train, test))
enc = OneHotEncoder(handle_unknown = 'ignore')
enc.fit(full)
enc.transform(train).toarray()
#array([[1., 0., 0., 0.],
# [0., 1., 0., 0.],
# [1., 0., 0., 0.],
# [0., 0., 1., 0.]])
enc.transform(test).toarray()
#array([[0., 1., 0., 0.],
# [1., 0., 0., 0.],
# [0., 0., 0., 1.]])
Notice that for the incorrect approach there is an extra column for D
(which only shows up in the test set). During training, we wouldn't know about D
at all so there shouldn't be a column for it.