I have a train and test set stored as Dataframes. I am trying to One-hot encode nominal features on my dataset. But I have the following issues:
- In total there are 3 categorical features, but I don't not know what the values of each feature because the dataset is large.
- The test set has values that are not present on the train set, so when I do one-hot encoding, the train set should have the vectors marked as 0 for the unseen values. But as I mentioned in 1, I don't know all the features.
- I found I can use
df = pd.get_dummies(df, prefix_sep='_')
to do the one hot encoding, the command works on all categorical features, but I noticed that it moved the new features to the end of the train DataFrame, which I think is a problem because we don't know the indices of which feature. Also there is issue number 2, the new train/set should have the same indices.
Is there any automated way to do this? or a library perhaps?
EDIT
Thanks to the answers below, I was able to perform one hot encoding on many features. But the codes below gave the following issues:
- I think
scikit-learn
strips the column headers and produced the result as an array and not as a DataFrame - Since the features are striped away, we have no knowledge of which vector belongs to which feature. Even if I perform
df_scaled = pd.DataFrame(ct.fit_transform(data2))
to have the results stored in a Dataframe, the created dataframedf_scaled
has no headers, especially when the headers now changed after the pre-processing. Perhapssklearn.preprocessing.OneHotEncoder
has a method which keeps track of new features and their indices ??