How to calculate a partial Area Under the Curve (AUC)
Asked Answered
B

8

16

In scikit learn you can compute the area under the curve for a binary classifier with

roc_auc_score( Y, clf.predict_proba(X)[:,1] )

I am only interested in the part of the curve where the false positive rate is less than 0.1.

Given such a threshold false positive rate, how can I compute the AUC only for the part of the curve up the threshold?

Here is an example with several ROC-curves, for illustration:

Illustration of ROC-curves plot for several types of a classifier.

The scikit learn docs show how to use roc_curve

>>> import numpy as np
>>> from sklearn import metrics
>>> y = np.array([1, 1, 2, 2])
>>> scores = np.array([0.1, 0.4, 0.35, 0.8])
>>> fpr, tpr, thresholds = metrics.roc_curve(y, scores, pos_label=2)
>>> fpr
array([ 0. ,  0.5,  0.5,  1. ])
>>> tpr
array([ 0.5,  0.5,  1. ,  1. ])
>>> thresholds
array([ 0.8 ,  0.4 ,  0.35,  0.1 ]

Is there a simple way to go from this to the partial AUC?


It seems the only problem is how to compute the tpr value at fpr = 0.1 as roc_curve doesn't necessarily give you that.

Bolivar answered 16/9, 2016 at 17:51 Comment(0)
H
10

Python sklearn roc_auc_score() now allows you to set max_fpr. In your case you can set max_fpr=0.1, the function will calculate the AUC for you. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

Herbage answered 25/11, 2018 at 4:11 Comment(0)
R
15

Say we start with

import numpy as np
from sklearn import  metrics

Now we set the true y and predicted scores:

y = np.array([0, 0, 1, 1])

scores = np.array([0.1, 0.4, 0.35, 0.8])

(Note that y has shifted down by 1 from your problem. This is inconsequential: the exact same results (fpr, tpr, thresholds, etc.) are obtained whether predicting 1, 2 or 0, 1, but some sklearn.metrics functions are a drag if not using 0, 1.)

Let's see the AUC here:

>>> metrics.roc_auc_score(y, scores)
0.75

As in your example:

fpr, tpr, thresholds = metrics.roc_curve(y, scores)
>>> fpr, tpr
(array([ 0. ,  0.5,  0.5,  1. ]), array([ 0.5,  0.5,  1. ,  1. ]))

This gives the following plot:

plot([0, 0.5], [0.5, 0.5], [0.5, 0.5], [0.5, 1], [0.5, 1], [1, 1]);

enter image description here

By construction, the ROC for a finite-length y will be composed of rectangles:

  • For low enough threshold, everything will be classified as negative.

  • As the threshold increases continuously, at discrete points, some negative classifications will be changed to positive.

So, for a finite y, the ROC will always be characterized by a sequence of connected horizontal and vertical lines leading from (0, 0) to (1, 1).

The AUC is the sum of these rectangles. Here, as shown above, the AUC is 0.75, as the rectangles have areas 0.5 * 0.5 + 0.5 * 1 = 0.75.

In some cases, people choose to calculate the AUC by linear interpolation. Say the length of y is much larger than the actual number of points calculated for the FPR and TPR. Then, in this case, a linear interpolation is an approximation of what the points in between might have been. In some cases people also follow the conjecture that, had y been large enough, the points in between would be interpolated linearly. sklearn.metrics does not use this conjecture, and to get results consistent with sklearn.metrics, it is necessary to use rectangle, not trapezoidal, summation.

Let's write our own function to calculate the AUC directly from fpr and tpr:

import itertools
import operator

def auc_from_fpr_tpr(fpr, tpr, trapezoid=False):
    inds = [i for (i, (s, e)) in enumerate(zip(fpr[: -1], fpr[1: ])) if s != e] + [len(fpr) - 1]
    fpr, tpr = fpr[inds], tpr[inds]
    area = 0
    ft = zip(fpr, tpr)
    for p0, p1 in zip(ft[: -1], ft[1: ]):
        area += (p1[0] - p0[0]) * ((p1[1] + p0[1]) / 2 if trapezoid else p0[1])
    return area

This function takes the FPR and TPR, and an optional parameter stating whether to use trapezoidal summation. Running it, we get:

>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, True)
(0.75, 0.875)

We get the same result as sklearn.metrics for the rectangle summation, and a different, higher, result for trapezoid summation.

So, now we just need to see what would happen to the FPR/TPR points if we would terminate at an FPR of 0.1. We can do this with the bisect module

import bisect

def get_fpr_tpr_for_thresh(fpr, tpr, thresh):
    p = bisect.bisect_left(fpr, thresh)
    fpr = fpr.copy()
    fpr[p] = thresh
    return fpr[: p + 1], tpr[: p + 1]

How does this work? It simply checks where would be the insertion point of thresh in fpr. Given the properties of the FPR (it must start at 0), the insertion point must be in a horizontal line. Thus all rectangles before this one should be unaffected, all rectangles after this one should be removed, and this one should be possibly shortened.

Let's apply it:

fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, 0.1)
>>> fpr_thresh, tpr_thresh
(array([ 0. ,  0.1]), array([ 0.5,  0.5]))

Finally, we just need to calculate the AUC from the updated versions:

>>> auc_from_fpr_tpr(fpr, tpr), auc_from_fpr_tpr(fpr, tpr, True)
0.050000000000000003, 0.050000000000000003)

In this case, both the rectangle and trapezoid summations give the same results. Note that in general, they will not. For consistency with sklearn.metrics, the first one should be used.

Rafaelrafaela answered 25/9, 2016 at 13:6 Comment(11)
I am a little confused as it seems all the material I can find online says we should use the trapezoidal rule. See stats.stackexchange.com/questions/145566/… for example. "We can very easily calculate the area under the ROC curve, using the formula for the area of a trapezoid:"Bolivar
@eleanora That is discussing the case where the curve is continuous. That is not the case here. As I showed above, the 0.75 result (which is what sklearn.metrics.roc_auc_score returns), is obtained by the rectangular summation - the (wrong) result by trapezoids would have been different. For continuous curves and fine enough granularity, the difference between rectangles and trapezoids eventually diminishes. Notwithstanding, I see why this is confusing, and will add an explanation. (Unfortunately, I'll be able to do it only a bit later).Rafaelrafaela
Thank you. When you look at the roc curve it also seems they are joining points with straight lines and not horizontal ones. Definitely confusing.Bolivar
@eleanora Right, I agree. I'm planning on writing a long explanation why the horizontal lines are right here, and why they're doing the trapezoid lines there. Again, my apologies, will be only be able to do so after work (it's not a short explanation).Rafaelrafaela
Your threshold is the opposite of what I expected I think. That is 0.1 should give the auc for false positive rate up to 0.1.Bolivar
@eleanora See update with both rectangular + trapezoid alternatives shown, an explanation, and a change to terminate at 0.1.Rafaelrafaela
Looking at scikit-learn.org/stable/modules/generated/… I see "Compute Area Under the Curve (AUC) using the trapezoidal rule"Bolivar
@eleanora I'll look at the sources, then, as it doesn't match the output above. Note that I included a trapezoid version too.Rafaelrafaela
Let us continue this discussion in chat.Rafaelrafaela
@eleanora Trying to contact you on chat.Rafaelrafaela
FYI - This implementation didnt give right results for me. I think its due to your get_fpr_tpr_for_thresh function not finding the right TPR intersection point for the given FPR level. I used real data and even the AUC computation didnt match up with the threshold =1. I've added my implementation in another answer below that I tested and seems to work.Pleach
H
10

Python sklearn roc_auc_score() now allows you to set max_fpr. In your case you can set max_fpr=0.1, the function will calculate the AUC for you. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

Herbage answered 25/11, 2018 at 4:11 Comment(0)
A
6

Calculate your fpr and tpr values only over the range [0.0, 0.1].

Then, you can use numpy.trapz to evaluate the partial AUC (pAUC) like so:

pAUC = numpy.trapz(tpr_array, fpr_array)

This function uses the composite trapezoidal rule to evaluate the area under the curve.

Allergen answered 24/9, 2016 at 17:16 Comment(9)
Thank you. Would you mind filling in the last bit. That is how to compute the fpr and tpr values only over the range [0.0, 0.1].Bolivar
I don't think trapezoidal integration is applicable here at all - there's absolutely no reason it will approximate the true integral, which is inherently rectangular.Rafaelrafaela
@AmiTavory Are you sure? See faculty.psau.edu.sa/filedownload/… for example.lBolivar
@eleanora Nope, am not sure (didn't have enough time to go over the math behind your question), but I think so. Your link hangs forever (at least by me), BTW.Rafaelrafaela
It seems the only problem is computing the tpr value at fpr = 0.1 which you don't get fron roc_curve.Bolivar
@AmiTavory The trapezoidal approximation is indeed not the "true" integration, as it is an approximation. But it is often better than rectangular to follow the shape of the curve since rectangles require a lot of samples to avoid under/over-estimating the area. An accurate result would be to have the function of the curve to calculate it's integral, but in most cases it is unknown and only a limited amount of samples are evaluated with classifiers (ex: in intervals of 0.01). Since the data is limited, trapezoidal has shown to be a better approximation in my experience.Allergen
@eleanora It depends on your situation. If you already have your fpr and tpr values in the range [0,1], then you simply need to filter them out using something like numpy.where with the condition fpr < 0.1. If you only have your binary prediction results (ex: 0 or 1 for class C1 or C2) you will first need to determine whether each of these predictions are right or wrong in term of FP, TP, FN or TN. Then, it would be easy to calculate the tpr and fpr for any given threshold. I can guide you according to your situation.Allergen
If you filter using numpy.where don't you then to estimate the value at fpr = 0.1?Bolivar
@eleanora I would take the closest value to 0.1 and the one just above to interpolate the value at fpr=0.1. Since you do not have it directly, you would have to resort to this kind of approximation.Allergen
H
1

That depends on whether the FPR is the x-axis or y-axis (independent or dependent variable).

If it's x, the calculation is trivial: calculate only over the range [0.0, 0.1].

If it's y, then you first need to solve the curve for y = 0.1. This partitions the x-axis into areas you need to calculate, and those that are simple rectangles with a height of 0.1.

For illustration, assume that you find the function exceeding 0.1 in two ranges: [x1, x2] and [x3, x4]. Calculate the area under the curve over the ranges

[0, x1]
[x2, x3]
[x4, ...]

To this, add the rectangles under y=0.1 for the two intervals you found:

area += (x2-x1 + x4-x3) * 0.1

Is that what you need to move you along?

Haeckel answered 16/9, 2016 at 19:12 Comment(4)
I have only ever used a function to compute the AUC. The fpr is on the X axis (see example in the question) but I don't know how to compute the AUC.Bolivar
You calculate using the same function. If it works only on the whole curve, then crop your curve data at X=0.1 before you call the function.Haeckel
predict_proba gives you a probability of being in class 1 for every vector. How would you crop this suitably?Bolivar
How do you compute the tpr value at fpr=0.1 ?Bolivar
P
1

I implemented the current best answer and it did not give the right results in all circumstances. I reimplemented and tested the implementation below. I also leveraged the inbuilt trapezoidal AUC function vs. recreating that from scratch.

def line(x_coords, y_coords):
    """
    Given a pair of coordinates (x1,y2), (x2,y2), define the line equation. Note that this is the entire line vs. t
    the line segment.

    Parameters
    ----------
    x_coords: Numpy array of 2 points corresponding to x1,x2
    x_coords: Numpy array of 2 points corresponding to y1,y2

    Returns
    -------
    (Gradient, intercept) tuple pair
    """    
    if (x_coords.shape[0] < 2) or (y_coords.shape[0] < 2):
        raise ValueError('At least 2 points are needed to compute'
                         ' area under curve, but x.shape = %s' % p1.shape)
    if ((x_coords[0]-x_coords[1]) == 0):
        raise ValueError("gradient is infinity")
    gradient = (y_coords[0]-y_coords[1])/(x_coords[0]-x_coords[1])
    intercept = y_coords[0] - gradient*1.0*x_coords[0]
    return (gradient, intercept)

def x_val_line_intercept(gradient, intercept, x_val):
    """
    Given a x=X_val vertical line, what is the intersection point of that line with the 
    line defined by the gradient and intercept. Note: This can be further improved by using line
    segments.

    Parameters
    ----------
    gradient
    intercept

    Returns
    -------
    (x_val, y) corresponding to the intercepted point. Note that this will always return a result.
    There is no check for whether the x_val is within the bounds of the line segment.
    """    
    y = gradient*x_val + intercept
    return (x_val, y)

def get_fpr_tpr_for_thresh(fpr, tpr, thresh):
    """
    Derive the partial ROC curve to the point based on the fpr threshold.

    Parameters
    ----------
    fpr: Numpy array of the sorted FPR points that represent the entirety of the ROC.
    tpr: Numpy array of the sorted TPR points that represent the entirety of the ROC.
    thresh: The threshold based on the FPR to extract the partial ROC based to that value of the threshold.

    Returns
    -------
    thresh_fpr: The FPR points that represent the partial ROC to the point of the fpr threshold.
    thresh_tpr: The TPR points that represent the partial ROC to the point of the fpr threshold
    """    
    p = bisect.bisect_left(fpr, thresh)
    thresh_fpr = fpr[:p+1].copy()
    thresh_tpr = tpr[:p+1].copy()
    g, i = line(fpr[p-1:p+1], tpr[p-1:p+1])
    new_point = x_val_line_intercept(g, i, thresh)
    thresh_fpr[p] = new_point[0]
    thresh_tpr[p] = new_point[1]
    return thresh_fpr, thresh_tpr

def partial_auc_scorer(y_actual, y_pred, decile=1):
    """
    Derive the AUC based of the partial ROC curve from FPR=0 to FPR=decile threshold.

    Parameters
    ----------
    y_actual: numpy array of the actual labels.
    y_pred: Numpy array of The predicted probability scores.
    decile: The threshold based on the FPR to extract the partial ROC based to that value of the threshold.

    Returns
    -------
    AUC of the partial ROC. A value that ranges from 0 to 1.
    """        
    y_pred = list(map(lambda x: x[-1], y_pred))
    fpr, tpr, _ = roc_curve(y_actual, y_pred, pos_label=1)
    fpr_thresh, tpr_thresh = get_fpr_tpr_for_thresh(fpr, tpr, decile)
    return auc(fpr_thresh, tpr_thresh)
Pleach answered 11/2, 2018 at 6:36 Comment(0)
G
1

For a large enough number of points in the fpr and tpr arrays you might be able to ignore the edge effects. At least as a first pass to think through the problem lets do that. Lets call the false positive rate threshold fprt. Take a step back and ignore that its an ROC curve for now. We can exclude data where fpr>fprt because we don't need the area under that part of the curve. We can plot that using

i = fpr <= fprt
roc_display = RocCurveDisplay(fpr=fpr[i], tpr=tpr[i]).plot()

We can get the area of that using

pauc_approx = auc(fpr[i], tpr[i])

Now this might be good enough. The problem is on the right side of the graph where we excluded data. In your example if the fprt is 0.1 and there was fpr data at ... 0.07, 0.09, 0.12 ... we would cut off the area gathering at 0.09, but our fprt is 0.1, losing some area we should have gathered. We can fix that though by adding that slice back in as a rectangle:

max_i = np.argmax(fpr[i])
pauc_extra = (fprt-fpr[i][max_i]) * tpr[i][max_i]
pauc_better = pauc_approx + pauc_extra

Here is an example from some of my data. It has around 2000 samples. Here is the full ROC curve. Full ROC Curve

Here is the curve with the fpr data > 0.10 excluded: ROC for fpr <= 0.10

The area as calculated by pauc_approx on this data is 0.014035 . You can see that the graph does not extend all the way to x=0.10. It turns out to be 0.096153 the y value there is 0.250417. So we can work out the rectangle and add that to the area: pauc_extra = (fprt-fpr[i][max_i]) * tpr[i][max_i] is (0.10 - 0.09615384615384616)*0.25041736227045075 equals an area of 0.0009631437010401953 to add to our pauc_approx to get a better estimate of the area.

Not asked as part of the original question but this approach can be expanded to the case of a TPR threshold, which is what I need. Below is the example chart from Wikipedia for partial AUROC. Take a look at this graph geometrically and you can figure out that we can exclude data for both TPR and FPR not meeting the thresholds and then need to shift the data down on the y axis by the TPR threshold. Using that new data we can calculate the appropriate area under that portion of the curve as shown. Corrections on the right side can be added for more accuracy.

https://en.wikipedia.org/wiki/File:Two_way_pAUC.png

Goldfish answered 23/12, 2022 at 13:5 Comment(0)
G
1

The max_fpr parameter in roc_auc_score() doesn't work directly because the partial AUC (pAUC) calculated is standardized. You will have to reverse calculate pAUC based on the standardized pAUC.

Grosz answered 13/7, 2023 at 15:11 Comment(0)
M
0

@eleanora Think your impulse to use sklearn's generic metrics.auc method is correct (that's what I've done). Should be straightforward once you get your tpr and fpr point sets (and you can use scipy's interpolation methods to approximate exact points in either series).

Makeweight answered 23/8, 2017 at 21:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.