I am trying to use KNN for imputing categorical variables in python.
In order to do so, a typical way is to one hot encode the variables before. However sklearn OneHotEncoder() doesn't handle NAs so you need to rename them to something which creates a seperate variable.
Small reproducible example:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
#Create random pandas with categories to impute
data0 = pd.DataFrame(columns=["1","2"],data = [["A",np.nan],["B","A"],[np.nan,"A"],["A","B"]])
original data frame:
data0
1 2
0 A NaN
1 B A
2 NaN A
3 A B
Proceed with one hot encoding:
#Rename for sklearn OHE
enc_missing = SimpleImputer(strategy="constant",fill_value="missing")
data1 = enc_missing.fit_transform(data0)
# Perform OHE:
OHE = OneHotEncoder(sparse=False)
data_OHE = OHE.fit_transform(data1)
Data_OHE is now one hot encoded:
Data_OHE
array([[1., 0., 0., 0., 0., 1.],
[0., 1., 0., 1., 0., 0.],
[0., 0., 1., 1., 0., 0.],
[1., 0., 0., 0., 1., 0.]])
But because of the seperate "missing" category - i dont have any nans to impute anymore.
My desired output of one hot encoding
array([[1, 0, np.nan, np.nan],
[0, 1, 1, 0 ],
[np.nan, np.nan, 1, 0 ],
[1, 0, 0, 1 ]
])
Such that I keep nans for later imputation.
Do you know any way to do this?
From my understanding this is something that has been discussed in the scikit-learn Github repo here
and here, i.e. to make OneHotEncoder handle this automatically with a handle_missing
argument, but i am unsure of the status of their work.