One Hot Encoding preserve the NAs for imputation

Asked 10/3, 2021 at 11:25 Answered 18/10, 2022 at 15:59

Solved python scikit-learn nan missing-data one-hot-encoding

I am trying to use KNN for imputing categorical variables in python.

In order to do so, a typical way is to one hot encode the variables before. However sklearn OneHotEncoder() doesn't handle NAs so you need to rename them to something which creates a seperate variable.

Small reproducible example:

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer

#Create random pandas with categories to impute
data0 = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],["A","B"]])

original data frame:

data0
     1    2
0    A  NaN
1    B    A
2  NaN    A
3    A    B

Proceed with one hot encoding:

#Rename for sklearn OHE
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = enc_missing.fit_transform(data0)
# Perform OHE:
OHE = OneHotEncoder(sparse=False)
data_OHE = OHE.fit_transform(data1)

Data_OHE is now one hot encoded:

Data_OHE
array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 1., 0., 0.],
       [0., 0., 1., 1., 0., 0.],
       [1., 0., 0., 0., 1., 0.]])

But because of the seperate "missing" category - i dont have any nans to impute anymore.

My desired output of one hot encoding

array([[1,        0,      np.nan, np.nan],
       [0,        1,        1,       0   ],
       [np.nan, np.nan,     1,       0   ], 
       [1,        0,        0,       1   ]
       ])

Such that I keep nans for later imputation.

Do you know any way to do this?

From my understanding this is something that has been discussed in the scikit-learn Github repo here and here, i.e. to make OneHotEncoder handle this automatically with a handle_missing argument, but i am unsure of the status of their work.

Stanfield answered 10/3, 2021 at 11:25 Comment(0)

Handling of missing values in OneHotEncoder ended up getting merged in PR17317, but it operates by just treating the missing values as a new category (no option for other treatments, if I understand correctly).

One manual approach is described in this answer. The first step isn't strictly necessary now because of the above PR, but maybe filling with custom text will make it easier to find the column?

Drawback answered 10/3, 2021 at 15:38 Comment(5)

Thanks that was exactly what I was looking for! - hope to see this directly implemented in OneHotEncoder someday.. – Stanfield 11/3, 2021 at 15:3

Hi again Ben. I am able to perform the steps provided in your answer to make the one hot encoding. But how would you go about putting this into a sklearn pipeline? If I create my own MyOneHotEncoder class using BaseEstimator and TransformerMixin it cannot fit to test data due to: "Incompatible dimension between the fitted dataset and the one to be transformed". I suspect this is because the sklearn OneHotEncoder is now wrapped in my own class and thus the "handle_unknown = ignore" is not playing out its role anymore. Might be suitable for a new post instead of comment here.. Kr Kasper – Stanfield 12/3, 2021 at 10:31

I didn't pursue writing a custom estimator because it doesn't seem very easy and fairly niche. But if you've started, I (and others here probably) would be happy to answer specific new questions. – Drawback 12/3, 2021 at 19:20

But is it niche? I mean, I see many people who wants to impute a categorical variable using KNN (or some other method that requires OHE) and it seems this is the only way? Am I completely missing an alternative way to use KNN on categorical variables?. The pipeline setup is more because that's my go-to in order to ensure we don't have information leak between train/test when evaluating the model performance. Might post another question on my problems on how to fit this custom function into a pipeline. – Stanfield 14/3, 2021 at 15:56

I've setup a new question in #66635531 – Stanfield 15/3, 2021 at 9:19

Create a Pipeline:

from sklearn.pipeline import make_pipeline

model = make_pipeline(
    OneHotEncoder(),
    SimpleImputer(),
    Ridge()
)
model.fit(X_train, y_train)

Jeanniejeannine answered 18/10, 2022 at 15:59 Comment(0)

Recommended topics

Hot tags