Using cross validation and AUC-ROC for a logistic regression model in sklearn
Asked Answered
R

2

11

I'm using the sklearn package to build a logistic regression model and then evaluate it. Specifically, I want to do so using cross validation, but can't figure out the right way to do so with the cross_val_score function.

According to the documentation and some examples I saw, I need to pass the function the model, the features, the outcome, and a scoring method. However, the AUC doesn't need predictions, it needs probabilities, so it can try different threshold values and calculate the ROC curve based on that. So what's the right approach here? This function has 'roc_auc' as a possible scoring method, so I'm assuming it's compatible with it, I'm just not sure about the right way to use it. Sample code snippet below.

from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

features = ['a', 'b', 'c']
outcome = ['d']
X = df[features]
y = df[outcome]
crossval_scores = cross_val_score(LogisticRegression(), X, y, scoring='roc_auc', cv=10)

Basically, I don't understand why I need to pass y to my cross_val_score function here, instead of probabilities calculated using X in a logistic regression model. Does it just do that part on its own?

Rubinrubina answered 17/5, 2017 at 23:17 Comment(1)
Has your question been addressed? If so, you should mark the correct answer with the checkbox beside it. Otherwise, what can be clarified?Braddock
P
8

All supervised learning methods (including logistic regression) need the true y values to fit a model.

After fitting a model, we generally want to:

  • Make predictions, and
  • Score those predictions (usually on 'held out' data, such as by using cross-validation)

cross_val_score gives you cross-validated scores of a model's predictions. But to score the predictions it first needs to make the predictions, and to make the predictions it first needs to fit the model, which requires both X and (true) y.

cross_val_score as you note accepts different scoring metrics. So if you chose f1-score for example, the model predictions generated during cross-val-score would be class predictions (from the model's predict() method). And if you chose roc_auc as your metric, the model predictions used to score the model would be probability predictions (from the model's predict_proba() method).

Pyrolysis answered 18/5, 2017 at 5:26 Comment(0)
B
4

cross_val_score trains models on inputs with true values, performs predictions, then compares those predictions to the true values—the scoring step. That's why you pass in y: it's the true values, the "ground truth".

The roc_auc_score function that is called by specifying scoring='roc_auc' relies on both y_true and y_pred: the ground truth and the predicted values based on X for your model.

Braddock answered 18/5, 2017 at 4:58 Comment(1)
Does cross_val_score use predict_proba or decision_function under the hood? I am searching in documentation but I don't have find something relevant.Charteris

© 2022 - 2024 — McMap. All rights reserved.