How to evaluate cost function for scikit learn LogisticRegression?
Asked Answered
W

3

6

After using sklearn.linear_model.LogisticRegression to fit a training data set, I would like to obtain the value of the cost function for the training data set and a cross validation data set.

Is it possible to have sklearn simply give me the value (at the fit minimum) of the function it minimized?

The function is stated in the documentation at http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression (depending on the regularization one has chosen). But I can't find how to get sklearn to give me the value of this function.

I would have thought this is what LogisticRegression.score does, but that simply returns the accuracy (the fraction of data points its prediction classifies correctly).

I have found sklearn.metrics.log_loss, but of course this is not the actual function being minimized.

Wiersma answered 12/3, 2016 at 11:12 Comment(0)
C
13

Unfortunately there is no "nice" way to do so, but there is a private function _logistic_loss(w, X, y, alpha, sample_weight=None) in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/logistic.py, thus you can call it by hand

from sklearn.linear_model.logistic import _logistic_loss
print _logistic_loss(clf.coef_, X, y, 1 / clf.C)

where clf is your learned LogisticRegression

Cuticula answered 12/3, 2016 at 16:9 Comment(8)
Wow, I'm shocked this is so difficult! Even this function is (strictly speaking) only used by the "newton-cg" solver. Is it not typical practice to use the actual cost function when studying overfitting vs underfitting or doing a grid search for optimal fitter settings (like clf.C)?Wiersma
No, you do not consider cost function in most settings. There are very few situations where you would be interested in it (afaik only when you test optimization procedure). For analysis of overfitting etc. you usually use typical metrics like logloss or accuracy, not the internal cost. In particular - most learning methods do not care about cost function - they rely solely on its gradientCuticula
Interesting! I would have thought that the cost is the most informative metric, which is precisely why one minimizes it. Thanks for the info.Wiersma
No, cost is a surrogate for actual metric over test set. This why you usually have some kind of regularization. Consequently you are interested in this actual metric, not the cost which is just a surrogate (which is requires because you do not have access to test set during training, but you do have such access during evaluation).Cuticula
@Wiersma I believe having access to the cost is useful. Because the cost function is a surrogate to your actual metric, it is useful to see whether or not your actual metric is getting better as your cost is minimized. This can give intuition into whether or not you should pick one cost function (model) over another or whether or you should change your optimization algorithm.Gunn
It's strange I do not have access to "logistic" (and consequently neither to "_logistic_loss" ). As the matter of fact when I print the children functions under "sklearn.linear_model", I only have 'LogisticRegression' and 'LogisticRegressionCV'.Argilliferous
Things moved around since then, and library was rewritten a lot. Look at ._loss.loss.HalfBinomialLoss for the current location. Note that this is regulrisation free loss, and then the L2 penalty is added in 10 different ways depending on which solver is usedCuticula
link not workingEduction
A
0

I used below code to calculate cost value.

import numpy as np

cost = np.sum((reg.predict(x) - y) ** 2)

where reg is your learned LogisticRegression

Allheal answered 29/10, 2020 at 22:30 Comment(1)
This looks like the squared error, which is not actually the cost function used during minimization. Scikit has tools to get metrics like this. My question was about obtaining the actual cost in order to better understand the minimization procedure.Wiersma
G
0

I have the following suggestions. You can write the codes for the loss function of logistic regression as a function. After you get your predicted labels of data, you can revoke your defined function to calculate the cost values.

Glochidiate answered 9/6, 2021 at 3:21 Comment(2)
seems more like a commentModigliani
This is doable, but if scikit ever changes the cost function it uses, the function you created will be obsolete and give the wrong values. My goal was to get the actual values used during minimization.Wiersma

© 2022 - 2024 — McMap. All rights reserved.