sklearn cross_val_score() returns NaN values

S

10

11

i'm trying to predict next customer purchase to my job. I followed a guide, but when i tried to use cross_val_score() function, it returns NaN values.Google Colab notebook screenshot

Variables:

X_train is a dataframe
X_test is a dataframe
y_train is a list
y_test is a list

Code:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
X_train = X_train.reset_index(drop=True)
X_train
X_test = X_test.reset_index(drop=True)

y_train = y_train.astype('float')
y_test = y_test.astype('float')

models = []
models.append(("LR",LogisticRegression()))
models.append(("NB",GaussianNB()))
models.append(("RF",RandomForestClassifier()))
models.append(("SVC",SVC()))
models.append(("Dtree",DecisionTreeClassifier()))
models.append(("XGB",xgb.XGBClassifier()))
models.append(("KNN",KNeighborsClassifier()))´

for name,model in models:
   kfold = KFold(n_splits=2, random_state=22)
   cv_result = cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")
   print(name, cv_result)
>>
LR [nan nan]
NB [nan nan]
RF [nan nan]
SVC [nan nan]
Dtree [nan nan]
XGB [nan nan]
KNN [nan nan]

help me please!

Sorely answered 11/2, 2020 at 15:36 Comment(4)

When you have a NaN error, that means there is a number that are not given in the function. U should review that – Blindfish 11/2, 2020 at 15:45

Please include your code in formatted text with your question. A picture of code is about as useful as a picture of music... – Pie 11/2, 2020 at 15:46

There is probably a problem with your data. In documentation to sklearn.model_selection.cross_val_score, X_train can be a list, or an array, and in your case, X_train is a dataframe. Try to use X_train.values in cross_val_score instead of X_train. – Larisa 11/2, 2020 at 16:59

try with cv = 5. cv should be an int, not a kfold object. – Oblivion 19/2, 2021 at 16:33

S

1

Well thanks everyone for your answers. The answer of Anna helped me a lot!, but i don't used X_train.values, instead i assigned an unique ID to the Customers, then dropped Customers column and it works!

Now the models has this output :)

LR [0.73958333 0.74736842]
NB [0.60416667 0.71578947]
RF [0.80208333 0.82105263]
SVC [0.79166667 0.77894737]
Dtree [0.82291667 0.83157895]
XGB [0.85416667 0.85263158]
KNN [0.79166667 0.75789474]

Sorely answered 13/2, 2020 at 13:21 Comment(2)

Hey man, I'm dealing with this issue as well, could you explain a bit further as to what you did – Eminent 24/6, 2020 at 12:20

I am having the same issue, I have built a pipeline and fitting it to data. If I fit separately it gives me the accuracy score but if I try cross validation it is showing NaN values. Can u please share how to debug? – Hesperus 11/8, 2020 at 14:52

N

7

My case is a bit different. I was using cross_validate instead of cross_val_score with a list of performance metrics. Doing a 5 fold CV, I kept getting NaNs for all performance metrics for a RandomForestRegressor:

scorers = ['neg_mean_absolute_error', 'neg_root_mean_squared_error', 'r2', 'accuracy']

results = cross_validate(forest, X, y, cv=5, scoring=scorers, return_estimator=True)
results

Turns out, I stupidly included the 'accuracy' metric which is only used in classification. Instead of throwing an error, it looks like sklearn just returns NaNs for such cases

Nicobarese answered 19/2, 2021 at 16:27 Comment(2)

I've just done the same. A validity check within Sklearn would seem sensible. – Happening 13/5, 2021 at 14:22

This error can come up if you are using cross_validate in general, the pos_label parameter has been removed and moved instead to the make_scorer function, so for scores like precision and recall, you will need to create your own scorer with make_scorer(recall_score, pos_label='pos') to get it working, otherwise your scores will be NaNs. – Jadda 20/12, 2023 at 14:35

V

5

I fixed the issue on my side. I was using a custom metric (Area Under Curve Precision-Recall (AUCPR))

def pr_auc_score(y, y_pred, **kwargs):
  classes = list(range(y_pred.shape[1]))
  if len(classes) == 2:
      precision, recall, _ = precision_recall_curve(y, y_pred[:,1],
                                                    **kwargs)
  else:
    Y = label_binarize(y, classes=classes)
    precision, recall, _ = precision_recall_curve(Y.ravel(), y_pred.ravel(),
                                                  **kwargs)
  return auc(recall, precision)

The problem is, for a binary problem, y_pred contains only the predicted probability of the label 1, so y_pred's shape is (n_sample,). When I try to call the method : y_pred.shape[1], it raises an error.

The solution: inside cross_validate, use the parameter error_score="raise". This will allow you to detect the error.

Viper answered 12/4, 2022 at 14:49 Comment(0)

S

1

Well thanks everyone for your answers. The answer of Anna helped me a lot!, but i don't used X_train.values, instead i assigned an unique ID to the Customers, then dropped Customers column and it works!

Now the models has this output :)

LR [0.73958333 0.74736842]
NB [0.60416667 0.71578947]
RF [0.80208333 0.82105263]
SVC [0.79166667 0.77894737]
Dtree [0.82291667 0.83157895]
XGB [0.85416667 0.85263158]
KNN [0.79166667 0.75789474]

Sorely answered 13/2, 2020 at 13:21 Comment(2)

Hey man, I'm dealing with this issue as well, could you explain a bit further as to what you did – Eminent 24/6, 2020 at 12:20

I am having the same issue, I have built a pipeline and fitting it to data. If I fit separately it gives me the accuracy score but if I try cross validation it is showing NaN values. Can u please share how to debug? – Hesperus 11/8, 2020 at 14:52

C

1

I know this is answered already but for others who still cannot figure out the problem, this is for you...

Check if you y data type is a int or not. It will return nan if your date type for the y value is an object

How to check

y.dtype

How to change the data type

y = y.astype(int)

Courtesy answered 13/8, 2021 at 16:20 Comment(0)

P

0

For my case, I had a time delta data type inside my numpy array that resulted in the error

Partitive answered 16/7, 2020 at 5:32 Comment(0)

L

0

I face to face with that problem. I solved this way; i convert X_train and y_train to DataFrame.

cross_val_score(model,X_train,y_train, cv = kfold,scoring = "accuracy")

Lepore answered 9/5, 2021 at 18:49 Comment(0)

R

0

Try doing encoding of categorical columns before passing to cross_val_score. It worked for me.

Retrusion answered 17/11, 2022 at 15:20 Comment(1)

Your answer could be improved with additional supporting information. Please edit to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers in the help center. – Heliotype 19/11, 2022 at 16:35

T

0

I got this error when cross validating, and it was because there was still a NaN in my data.

I found this because cross validate doesn't show the error, so I tried training train a model on the whole dataset, without cross validating.

When I tried LogisticRegression().fit(X,y), the error actually was displayed as being caused by a NaN in the data.

Tempera answered 30/7, 2024 at 13:34 Comment(0)

F

-2

The cross_val_score method returns NaN when there are null values in your dataset.

Either use a model which can deal with missing values or remove all the null values from your dataset and try again.

Fraise answered 13/7, 2020 at 12:31 Comment(0)

A

-2

For me using xtrain.values, ytrain.values worked as the cross validation needs the input to be an array and not dataframe.

Addy answered 27/1, 2021 at 12:40 Comment(0)

How to check

How to change the data type

Recommended topics

Hot tags