One hot encoding train with values not present on test

Asked 15/9, 2019 at 16:22 Answered 15/9, 2019 at 16:55

I have a train and test set stored as Dataframes. I am trying to One-hot encode nominal features on my dataset. But I have the following issues:

In total there are 3 categorical features, but I don't not know what the values of each feature because the dataset is large.
The test set has values that are not present on the train set, so when I do one-hot encoding, the train set should have the vectors marked as 0 for the unseen values. But as I mentioned in 1, I don't know all the features.
I found I can use df = pd.get_dummies(df, prefix_sep='_') to do the one hot encoding, the command works on all categorical features, but I noticed that it moved the new features to the end of the train DataFrame, which I think is a problem because we don't know the indices of which feature. Also there is issue number 2, the new train/set should have the same indices.

Is there any automated way to do this? or a library perhaps?

EDIT

Thanks to the answers below, I was able to perform one hot encoding on many features. But the codes below gave the following issues:

I think scikit-learn strips the column headers and produced the result as an array and not as a DataFrame
Since the features are striped away, we have no knowledge of which vector belongs to which feature. Even if I perform df_scaled = pd.DataFrame(ct.fit_transform(data2)) to have the results stored in a Dataframe, the created dataframe df_scaledhas no headers, especially when the headers now changed after the pre-processing. Perhaps sklearn.preprocessing.OneHotEncoder has a method which keeps track of new features and their indices ??

Sarawak answered 15/9, 2019 at 16:22 Comment(0)

Instead of using pd.get_dummies, which has the drawbacks you identified, use sklearn.preprocessing.OneHotEncoder. It automatically fetches all nominal categories from your train data and then encodes your test data according to the categories identified in the training step. If there are new categories in the test data, it will just encode your data as 0's.

Example:

from sklearn.preprocessing import OneHotEncoder
import numpy as np

x_train = np.array([["A1","B1","C1"],["A2","B1","C2"]])
x_test = np.array([["A1","B2","C2"]]) # As you can see, "B2" is a new attribute for column B

ohe = OneHotEncoder(handle_unknown = 'ignore') #ignore tells the encoder to ignore new categories by encoding them with 0's
ohe.fit(x_train)
print(ohe.transform(x_train).toarray())
>>> array([[1., 0., 1., 1., 0.],
           [0., 1., 1., 0., 1.]])

To get a summary of the categories by column in the train set, do:

print(ohe.categories_)
>>> [array(['A1', 'A2'], dtype='<U2'), 
     array(['B1'], dtype='<U2'), 
     array(['C1', 'C2'], dtype='<U2')]

To map one hot encoded columns to categories, do:

print(ohe.get_feature_names())
>>> ['x0_A1' 'x0_A2' 'x1_B1' 'x2_C1' 'x2_C2']

Finally, this is how the encoder works on new test data:

print(ohe.transform(x_test).toarray())
>>> [[1. 0. 0. 0. 1.]] # 1 for A1, 0 for A2, 0 for B1, 0 for C1, 1 for C2

EDIT:

You seem to be worried about the fact that you lose the labels after doing the encoding. It is actually very easy to get back to these, just wrap the answer in a dataframe and specify the column names from ohe.get_feature_names():

pd.DataFrame(ohe.transform(x_test).toarray(), columns = ohe.get_feature_names())

Nonalcoholic answered 15/9, 2019 at 16:55 Comment(7)

Thank you for answer. The output of the command works on array, I have a Dataframe. How can I convert the result to DF? – Sarawak 15/9, 2019 at 19:36

@U.User, just do pd.DataFrame(ohe.transform(x_test).toarray()) if this is really something you need to have as a df – Nonalcoholic 16/9, 2019 at 1:13

Thank you for answer. Can you please read the edit about the header issue? when I one hot encode, the header get strips away, after the pre processing, the header are changed. If I want to but the new headers to the new train and test set accordingly, is this possible ? Sorry for inconvenience, but I found so many threads that require different libraries and none is actually consistent – Sarawak 16/9, 2019 at 21:31

Edited my answer, it's quite straightforward – Nonalcoholic 17/9, 2019 at 1:35

Truly; thank you for your assistance. Your approach seems to lead to memory issue as I have many observations. So I used the ColumnTransformer approach instead. But this seems a bit more complicated, although it gives a correct output. The issue here is I can't link the new feature names. Please read the edit. – Sarawak 18/9, 2019 at 13:56

Yes, toarray() is very expensive memory-wise, which is why by default sklearn returns a sparse matrix that does not have labels. However, it seems to me like your update is now quite a different question. I think you should remove it from here and open another question as it's getting difficult to follow – Nonalcoholic 18/9, 2019 at 14:25

Okay, If I create a new thread I will lose you haha. You're right. I will remove the update, create a new question and formally accept your answer. – Sarawak 18/9, 2019 at 14:31

pd.get_dummies should name the new columns in a way that allows you to tell which ones go with each categorical features. If you want to give it a custom set of prefixes to use, you can use the prefix argument. Then, you can look at the list of columns to see all the columns corresponding to each feature. (You don't need prefix_sep='_', that is the default.)

df = pd.get_dummies(df, prefix=['first_feature', 'second_feature', 'third_feature']
first_feature_column_names = [c for c in df.columns if c.startswith('first_feature_')]

You can also perform the one-hot encoding for one categorical feature at a time, if that will help you know what columns are for each feature.

df = pd.get_dummies(df, columns=['first_feature'])

As for your issue with some labels only being present in your test set or your training set: If df contains your training and test sets together (and you intend to separate them later with something like sklearn.model_selection.train_test_split), then any feature that exists only in your test set will have an all-zeroes column in your training set. Obviously this won't actually provide any value to your model, but it will keep your column indexes consistent. However, there's really no point in having one-hot columns where none of your training data has a non-zero value in that feature - it will have no effect on your model. You can avoid errors and inconsistent column indexes between training and test using sklearn.preprocessing.OneHotEncoder.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Normalizer
ct = ColumnTransformer([
    ("onehot", OneHotEncoder(handle_unknown='ignore'), ['first_feature', 'second_feature', 'third_feature']),
], remainder='passthrough')

df_train = ct.fit_transform(df_train)
df_test = ct.transform(df_test)

# Or simply

df = ct.fit_transform(df)

handle_unknown tells it to ignore (rather than throw an error for) any value that was not present in the initial training set.

Eliaeliades answered 15/9, 2019 at 16:51 Comment(2)

I don't really understand the concept, so handle_unknown = 'ignore' will mark any value that was not present in the initial training set. Knowing that the command was run first on the training set and next on test set. How does that work? It's like the reverse should happen: it should mark the unseen feature on THE TRAINING SET by 0s? – Sarawak 15/9, 2019 at 17:36

I've tried the commands of both answers. the 'fit_and_transform produces' an error, perhaps it's fit_transform? Also the result is converted to array instead of DataFrames. – Sarawak 15/9, 2019 at 17:37

Recommended topics

Hot tags