classifiers in scikit-learn that handle nan/null
Asked Answered
H

6

88

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict.

X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]])
y_train = np.array([1, 2])
clf = RandomForestRegressor(X_train, y_train)
X_test = np.array([7, 8, np.nan])
y_pred = clf.predict(X_test) # Fails!

Can I not call predict with any scikit-learn algorithm with missing values?

Edit. Now that I think about this, it makes sense. It's not an issue during training but when you predict how do you branch when the variable is null? maybe you could just split both ways and average the result? It seems like k-NN should work fine as long as the distance function ignores nulls though.

Edit 2 (older and wiser me) Some gbm libraries (such as xgboost) use a ternary tree instead of a binary tree precisely for this purpose: 2 children for the yes/no decision and 1 child for the missing decision. sklearn is using a binary tree

Halfcock answered 19/5, 2015 at 5:2 Comment(6)
I also face this issue, I guess that you need to remove that nan values with this class also fount this but I still can not solve this issue. Probably this will help.Catamenia
The problem here is how should NaN be represented here? it is a common issue in which you need to decide how to handle them, you can either drop them or substitute them with mean or some other inidcator valueEnterectomy
I heard that some random forest models will ignore features with nan values and use a randomly selected substitute feature. This doesn't seem to be the default behaviour in scikit learn though. Does anyone have a suggestion of how to achieve this behaviour? It is attractive because you do not need to supply an imputed value.Onesided
@Onesided - Looks like "Elements of Statistical Learning" page 311 suggests this (using "surrogate variables") as an alternative to adding a missing category or an imputed value but I am not aware of any libraries doing this though...Halfcock
@Halfcock - Yes, the same book brought me here too. Does the libraries not implementing this approach allude to the fact that using surrogate variables is not as effective an approach?Cotquean
@Halfcock yay! older and wiser you!Pepsinate
S
43

I made an example that contains both missing values in training and the test sets

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]

# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)

for X_test in [X_test_1, X_test_2, X_test_3]:
    # Impute each test item, then predict
    X_test_imp = imp.transform(X_test)
    print(X_test, '->', clf.predict(X_test_imp))

# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]
Snakebird answered 19/5, 2015 at 7:26 Comment(7)
How do you handle the case when the values are really labels and not continuous?Alginate
I would be really interested to see how imputation works for categorical data.Atomize
super-sketchy method for many datasets, especially where data is not missing at random or where missingness is very high.Idiosyncrasy
Ok, it's imputing. But what about RandomForest which must handle nans without any imputing?Faction
@SamStorie @Atomize for categorical variables, you could replace NaNs with the most frequent category: SimpleImputer(strategy='most_frequent')Witch
@Faction this can be done in XGboost - example tutorialWitch
Not what was askedUnpopular
F
45

Short answer

Sometimes missing values are simply not applicable. Imputing them is meaningless. In these cases you should use a model that can handle missing values. Scitkit-learn's models cannot handle missing values. XGBoost can.


More on scikit-learn and XGBoost

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

Consider situtations when imputation doesn't make sense.

keep in mind this is a made-up example

Consider a dataset with rows of cars ("Danho Diesel", "Estal Electric", "Hesproc Hybrid") and columns with their properties (Weight, Top speed, Acceleration, Power output, Sulfur Dioxide Emission, Range).

Electric cars do not produce exhaust fumes - so the Sulfur dioxide emission of the Estal Electric should be a NaN-value (missing). You could argue that it should be set to 0 - but electric cars cannot produce sulfur dioxide. Imputing the value will ruin your predictions.

As mentioned in this article, scikit-learn's decision trees and KNN algorithms are not (yet) robust enough to work with missing values. If imputation doesn't make sense, don't do it.

Felicidadfelicie answered 25/2, 2019 at 15:19 Comment(0)
S
43

I made an example that contains both missing values in training and the test sets

I just picked a strategy to replace missing data with the mean, using the SimpleImputer class. There are other strategies.

from __future__ import print_function

import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer


X_train = [[0, 0, np.nan], [np.nan, 1, 1]]
Y_train = [0, 1]
X_test_1 = [0, 0, np.nan]
X_test_2 = [0, np.nan, np.nan]
X_test_3 = [np.nan, 1, 1]

# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X_train)

# Impute our data, then train
X_train_imp = imp.transform(X_train)
clf = RandomForestClassifier(n_estimators=10)
clf = clf.fit(X_train_imp, Y_train)

for X_test in [X_test_1, X_test_2, X_test_3]:
    # Impute each test item, then predict
    X_test_imp = imp.transform(X_test)
    print(X_test, '->', clf.predict(X_test_imp))

# Results
[0, 0, nan] -> [0]
[0, nan, nan] -> [0]
[nan, 1, 1] -> [1]
Snakebird answered 19/5, 2015 at 7:26 Comment(7)
How do you handle the case when the values are really labels and not continuous?Alginate
I would be really interested to see how imputation works for categorical data.Atomize
super-sketchy method for many datasets, especially where data is not missing at random or where missingness is very high.Idiosyncrasy
Ok, it's imputing. But what about RandomForest which must handle nans without any imputing?Faction
@SamStorie @Atomize for categorical variables, you could replace NaNs with the most frequent category: SimpleImputer(strategy='most_frequent')Witch
@Faction this can be done in XGboost - example tutorialWitch
Not what was askedUnpopular
S
15

If you are using DataFrame, you could use fillna. Here I replaced the missing data with the mean of that column.

df.fillna(df.mean(), inplace=True)
Salchunas answered 6/12, 2018 at 11:2 Comment(0)
H
2

Classifiers that can handle missing values:

BaggingClassifier

DecisionTreeClassifier

DummyClassifier

HistGradientBoostingClassifier

RandomForestClassifier

PairwiseDifferenceLearningClassifier

Please note that the handeling of missing values was mainly introduced starting from Sklearn v1.4

Moreover, from my personal experience, using such a classifier results in higher performance than imputing the values myself. Plus, imputation can introduce data leakage if not done correctly...

Classifiers that cannot handle missing values:

['AdaBoostClassifier', 'BernoulliNB', 'CalibratedClassifierCV', 'CategoricalNB', 'ComplementNB', 'ExtraTreeClassifier', 'ExtraTreesClassifier', 'GaussianNB', 'GaussianProcessClassifier', 'GradientBoostingClassifier', 'KNeighborsClassifier', 'LabelPropagation', 'LabelSpreading', 'LinearDiscriminantAnalysis', 'LinearSVC', 'LogisticRegression', 'LogisticRegressionCV', 'MLPClassifier', 'MultinomialNB', 'NearestCentroid', 'NuSVC', 'PassiveAggressiveClassifier', 'Perceptron', 'QuadraticDiscriminantAnalysis', 'RadiusNeighborsClassifier', 'RidgeClassifier', 'RidgeClassifierCV', 'SGDClassifier', 'SVC']

Testing code

to repeat for newer sklearn versions:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a small dataset with missing values
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X[::10, 0] = np.nan  # introduce missing values in the dataset
import sklearn
classifiers = dict(sklearn.utils.all_estimators(type_filter="classifier"))
classifiers.pop('ClassifierChain')
classifiers.pop('MultiOutputClassifier')
classifiers.pop('OneVsOneClassifier')
classifiers.pop('OneVsRestClassifier')
classifiers.pop('OutputCodeClassifier')
classifiers.pop('StackingClassifier')
classifiers.pop('VotingClassifier')
classifiers = {name: clf() for name, clf in classifiers.items()}


# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Lists to store results
successful_classifiers = []
failed_classifiers = []

# Test each classifier
for name, clf in classifiers.items():
    try:
        # Attempt to fit the model without imputation
        clf.fit(X_train, y_train)
        y_pred = clf.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        successful_classifiers.append(name)
    except ValueError as e:
        failed_classifiers.append(name)

# Print results
print("Classifiers that can handle missing values:")
print(successful_classifiers)
print("\nClassifiers that cannot handle missing values:")
print(failed_classifiers)
Harpoon answered 21/9, 2023 at 14:20 Comment(0)
K
2

Missing values were added with version 1.4.1 https://scikit-learn.org/dev/whats_new/v1.4.html#sklearn-ensemble

Keek answered 19/1, 2024 at 17:31 Comment(0)
S
0

For NoData located at the edge of a GeoTIFF image (which can obviously not be interpolated using the average of the values of neighbouring pixels), I masked it in a few lines of code. Please note that this was performed on one band (VH band of a Sentinel 1 image, which was first converted into an array). After I performed a Random Forest classification on my initial image, I did the following:

image[image>0]=1.0
image[image==0]=-1.0
RF_prediction=np.multiply(RF_prediction,image)
RF_prediction[RF_prediction<0]=-9999.0 #assign a NoData value

When saving it, do not forget to assign a NoData value:

class_ds = gdal.GetDriverByName('GTiff').Create('RF_classified.tif',img_ds.RasterXSize,\
                                              img_ds.RasterYSize,1,gdal.GDT_Float32)

RF_ds.SetGeoTransform(img_ds.GetGeoTransform())    
srs = osr.SpatialReference()
srs.ImportFromEPSG(32733)                
RF_ds.SetProjection(srs.ExportToWkt()) # export coords to file
RF_ds.GetRasterBand(1).SetNoDataValue(-9999.0) #set NoData value
RF_ds.GetRasterBand(1).WriteArray(RF_prediction)
RF_ds.FlushCache()                     # write to disk
RF_ds = None
Spartacus answered 12/6, 2020 at 9:7 Comment(0)

© 2022 - 2025 — McMap. All rights reserved.