Classifiers that can handle missing values:
BaggingClassifier
DecisionTreeClassifier
DummyClassifier
HistGradientBoostingClassifier
RandomForestClassifier
PairwiseDifferenceLearningClassifier
Please note that the handeling of missing values was mainly introduced starting from Sklearn v1.4
Moreover, from my personal experience, using such a classifier results in higher performance than imputing the values myself. Plus, imputation can introduce data leakage if not done correctly...
Classifiers that cannot handle missing values:
['AdaBoostClassifier', 'BernoulliNB', 'CalibratedClassifierCV', 'CategoricalNB', 'ComplementNB', 'ExtraTreeClassifier', 'ExtraTreesClassifier', 'GaussianNB', 'GaussianProcessClassifier', 'GradientBoostingClassifier', 'KNeighborsClassifier', 'LabelPropagation', 'LabelSpreading', 'LinearDiscriminantAnalysis', 'LinearSVC', 'LogisticRegression', 'LogisticRegressionCV', 'MLPClassifier', 'MultinomialNB', 'NearestCentroid', 'NuSVC', 'PassiveAggressiveClassifier', 'Perceptron', 'QuadraticDiscriminantAnalysis', 'RadiusNeighborsClassifier', 'RidgeClassifier', 'RidgeClassifierCV', 'SGDClassifier', 'SVC']
Testing code
to repeat for newer sklearn versions:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Create a small dataset with missing values
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X[::10, 0] = np.nan # introduce missing values in the dataset
import sklearn
classifiers = dict(sklearn.utils.all_estimators(type_filter="classifier"))
classifiers.pop('ClassifierChain')
classifiers.pop('MultiOutputClassifier')
classifiers.pop('OneVsOneClassifier')
classifiers.pop('OneVsRestClassifier')
classifiers.pop('OutputCodeClassifier')
classifiers.pop('StackingClassifier')
classifiers.pop('VotingClassifier')
classifiers = {name: clf() for name, clf in classifiers.items()}
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Lists to store results
successful_classifiers = []
failed_classifiers = []
# Test each classifier
for name, clf in classifiers.items():
try:
# Attempt to fit the model without imputation
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
successful_classifiers.append(name)
except ValueError as e:
failed_classifiers.append(name)
# Print results
print("Classifiers that can handle missing values:")
print(successful_classifiers)
print("\nClassifiers that cannot handle missing values:")
print(failed_classifiers)
nan
values with this class also fount this but I still can not solve this issue. Probably this will help. – CatameniaNaN
be represented here? it is a common issue in which you need to decide how to handle them, you can either drop them or substitute them with mean or some other inidcator value – Enterectomy