Comparing AUC, log loss and accuracy scores between models

Very briefly, with links (as parts of this have already been discussed elsewhere)...

How can model 1 be the best in terms of logloss (the logloss is the closest to 0) since it performs the worst (in terms of accuracy). What does that mean ?

Although loss is a proxy for the accuracy (or vice versa), it is not a very reliable one in that matter. A closer look at the specific mechanics between accuracy and loss may be useful here; consider the following SO threads (disclaimer: answers are mine):

Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy? (despite the title, it is a general exposition, and not confined to Keras in particular)

To elaborate a little:

Assuming a sample with true label y=1, a probabilistic prediction from the classifier of p=0.51, and a decision threshold of 0.5 (i.e. for p>0.5 we classify as 1, otherwise as 0), the contribution of this sample to the accuracy is 1/n (i.e. positive), while the loss is

-log(p) = -log(0.51) = 0.6733446

Now, assume another sample again with true y=1, but now with a probabilistic prediction of p=0.99; the contribution to the accuracy will be the same, while the loss now will be:

-log(p) = -log(0.99) = 0.01005034

So, for two samples that are both correctly classified (i.e. they contribute positively to the accuracy by the exact same quantity), we have a rather huge difference in the corresponding losses...

Although what you present here seems rather extreme, it shouldn't be difficult to imagine a situation where many samples of y=1 will be around the area of p=0.49, hence giving a relatively low loss but a zero contribution to accuracy nonetheless...

How come does model 6 have lower AUC score than e.g. model 5, when model 6 has better accuracy. What does that mean ?

This one is easier.

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds. So, the AUC does not actually measure the performance of a particular deployed model (which includes the chosen decision threshold), but the averaged performance of a family of models across all thresholds (the vast majority of which are of course of not interest to you, as they will be never used).

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:

Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

[...]

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

Emphasis mine - see also On the dangers of AUC...

Simple advice: don't use it.

Is there a way to say which of these 6 models is the best ?

Depends of the exact definition of "best"; if "best" means best for my own business problem that I am trying to solve (not an irrational definition for an ML practitioner), then it is the one that performs better according to the business metric appropriate for your problem that you have defined yourself. This can never be the AUC, and normally it is also not the loss...

Recommended topics

Hot tags