Conversion of IsolationForest decision score to probability algorithm
Asked Answered
K

3

6

I am looking to create a generic function to convert the output decision_scores of sklearn's IsolationForest into true probabilities [0.0, 1.0].

I am aware of, and have read, the original paper and I understand mathematically that the output of that function is not a probability, but is instead an average of the path length constructed by each base estimator to isolate an anomaly.

Problem

I want to convert that output to a probability in the form of a tuple (x,y) where x=P(anomaly) and y=1-x.

Current Approach

def convert_probabilities(predictions, scores):
    from sklearn.preprocessing import MinMaxScaler

    new_scores = [(1,1) for _ in range(len(scores))]

    anomalous_idxs = [i for i in (range(len(predictions))) if predictions[i] == -1]
    regular_idxs = [i for i in (range(len(predictions))) if predictions[i] == 1]

    anomalous_scores = np.asarray(np.abs([scores[i] for i in anomalous_idxs]))
    regular_scores = np.asarray(np.abs([scores[i] for i in regular_idxs]))

    scaler = MinMaxScaler()

    anomalous_scores_scaled = scaler.fit_transform(anomalous_scores.reshape(-1,1))
    regular_scores_scaled = scaler.fit_transform(regular_scores.reshape(-1,1))

    for i, j in zip(anomalous_idxs, range(len(anomalous_scores_scaled))):
        new_scores[i] = (anomalous_scores_scaled[j][0], 1-anomalous_scores_scaled[j][0])
    
    for i, j in zip(regular_idxs, range(len(regular_scores_scaled))):
        new_scores[i] = (1-regular_scores_scaled[j][0], regular_scores_scaled[j][0])

    return new_scores

modified_scores = convert_probabilities(model_predictions, model_decisions)

Minimum, Reproducible Example

import pandas as pd
from sklearn.datasets import make_classification, load_iris
from sklearn.ensemble import IsolationForest
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split

# Get data
X, y = load_iris(return_X_y=True, as_frame=True)
anomalies, anomalies_classes = make_classification(n_samples=int(X.shape[0]*0.05), n_features=X.shape[1], hypercube=False, random_state=60, shuffle=True)
anomalies_df = pd.DataFrame(data=anomalies, columns=X.columns)

# Split into train/test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=60)

# Combine testing data
X_test['anomaly'] = 1
anomalies_df['anomaly'] = -1
X_test = X_test.append(anomalies_df, ignore_index=True)
y_test = X_test['anomaly']
X_test.drop('anomaly', inplace=True, axis=1)

# Build a model
model = IsolationForest(n_jobs=1, bootstrap=False, random_state=60)

# Fit it
model.fit(X_train)

# Test it
model_predictions = model.predict(X_test)
model_decisions = model.decision_function(X_test)

# Print results
for a,b,c in zip(y_test, model_predictions, model_decisions):
    print_str = """
    Class: {} | Model Prediction: {} | Model Decision Score: {}
    """.format(a,b,c)

    print(print_str)

Problem

modified_scores = convert_probabilities(model_predictions, model_decisions)

# Print results
for a,b in zip(model_predictions, modified_scores):
    ans = False
    if a==-1:
        if b[0] > b[1]:
            ans = True
        else:
            ans = False
    elif a==1:
        if b[1] > b[0]:
            ans=True
        else:
            ans=False
    print_str = """
    Model Prediction: {} | Model Decision Score: {} | Correct: {}
    """.format(a,b, str(ans))

    print(print_str)

Shows some odd results, such as:

Model Prediction: 1 | Model Decision Score: (0.17604259932311161, 0.8239574006768884) | Correct: True
Model Prediction: 1 | Model Decision Score: (0.7120367886017022, 0.28796321139829784) | Correct: False
Model Prediction: 1 | Model Decision Score: (0.7251531538304419, 0.27484684616955807) | Correct: False
Model Prediction: -1 | Model Decision Score: (0.16776449326185877, 0.8322355067381413) | Correct: False
Model Prediction: 1 | Model Decision Score: (0.8395087028516501, 0.1604912971483499) | Correct: False

Model Prediction: 1 | Model Decision Score: (0.0, 1.0) | Correct: True

How could it be possible for the prediction to be -1 (anomaly), but the probability to only be 37%? Or for the prediction to be 1 (normal), but the probability is 26%?

Note, the toy dataset is labeled but an unsupervised anomaly detection algorithm obviously assumes no labels.

Kuhl answered 2/5, 2021 at 15:4 Comment(3)
Have you plotted the calibration curve? Or attempted to calibrate, for example using isotonic regression? Ref scikit-learn.org/stable/modules/calibration.htmlPhocine
How would that work, since this isn't true classification but is instead an unsupervised approach? @JonNordbyKuhl
One would have to use a labeled validation set (but not labeled training set).Phocine
K
0

Though months later, there is an answer to this question.

A paper was published in 2011 that attempted to show research on just this topic; unifying anomaly scores into probabilities.

In fact, the pyod library has a common predict_proba method, which gives an option to use this unifying method.

Here is a code implementation of that (influenced from their source):

def convert_probabilities(data, model):
    decision_scores = model.decision_function(data)
    probs = np.zeros([data.shape[0], int(model.classes)])
    pre_erf_score = ( decision_scores - np.mean(decision_scores) ) / ( np.std(decision_scores) * np.sqrt(2) )
    erf_score = erf(pre_erf_score)
    probs[:, 1] = erf_score.clip(0, 1).ravel()
    probs[:, 0] = 1 - probs[:, 1]
    return probs

(For reference, pyod does have an Isolation Forest implementation)

Kuhl answered 14/10, 2021 at 20:54 Comment(2)
quick question, did you run into "IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed" when you used the convert_probabilities function? Tried the function out and got that error message.Patin
I did not -- perhaps post a question with some sample data etc. and I can try to answer it? @PatinKuhl
N
0

You have three different issues here. First, there is no guarantee that the lower the score you receive from the IsolationForest, the probability of the sample being an outlier is also higher. I mean that if for a bunch of samples you get model_decision scores in ranges (-0.3 : -0.2) and (0.1 : 0.2) that does not necessarily mean that the probability of the first batch being an outliers is higher (but ususally it would be).

The second issue is the actual mapping function from scores to probabilities. So assuming that the lower scores correspond to lower probability of being regular sample (and higher probability of the sample to be an anomaly), the mapping from scores to probabilities not necessarily would be a linear function (such as MinMaxScaler). It may happen that for your data you will need to find your own function. It can be a piecewise linear function as @Jon Nordby suggested. I personally prefer using logistic function to map from scores into probabilities. In this case it can be especially beneficial to use as model_decisions is centered around zero, and negative values indicate anomaly. So you can use something like

def logf(x, alfa=10): 
    return 1/(1 + np.exp( -alfa * x ))

for mapping from scores to probabilities. Alpha parameter controls how tight the values are packed around the decision boundary. Again, this is not necessarily the best mapping function, it just something that I like to use.

Last issue is connected to the first one, and probably answers your question. Even if generally the scores correlate with the probability of not being anomaly, it does not guarantee that for all samples this would be true. So it might happen that a certain point with a score 0.1 would be an anomaly, and the one with -0.1 is a normal point that was mistakenly detected as anomaly. The decision if the sample is anomaly is made by whether model_decisions smaller than zero. For the samples with scores close to zero, the probability of mistake is higher.

Nosh answered 10/5, 2021 at 15:7 Comment(2)
Alpha parameter controls how tight the values are packed around the decision boundary...don't you need to know the model's decision boundary to do this appropriately?Kuhl
No, the model decision boundary is zero (set by IsolationForest). Alpha controls the "width" : logf(-0.1, 1)=0.47, logf(-0.1, 10)=0.269Nosh
K
0

Though months later, there is an answer to this question.

A paper was published in 2011 that attempted to show research on just this topic; unifying anomaly scores into probabilities.

In fact, the pyod library has a common predict_proba method, which gives an option to use this unifying method.

Here is a code implementation of that (influenced from their source):

def convert_probabilities(data, model):
    decision_scores = model.decision_function(data)
    probs = np.zeros([data.shape[0], int(model.classes)])
    pre_erf_score = ( decision_scores - np.mean(decision_scores) ) / ( np.std(decision_scores) * np.sqrt(2) )
    erf_score = erf(pre_erf_score)
    probs[:, 1] = erf_score.clip(0, 1).ravel()
    probs[:, 0] = 1 - probs[:, 1]
    return probs

(For reference, pyod does have an Isolation Forest implementation)

Kuhl answered 14/10, 2021 at 20:54 Comment(2)
quick question, did you run into "IndexError: too many indices for array: array is 1-dimensional, but 2 were indexed" when you used the convert_probabilities function? Tried the function out and got that error message.Patin
I did not -- perhaps post a question with some sample data etc. and I can try to answer it? @PatinKuhl
R
-1

Why this is happening

You are observing nonsensical probabilities because you are fitting a different scaler for the inliers and for the outliers. As a result, if the range of your decision scores is [0.5, 1.5] for inliers, you will map these scores to probabilities [0, 1]. Additionally, if the range of the decision scores is [-1.5, -0.5] for outliers, then you will be mapping these scores to probabilities [0, 1] as well. You end up having probability of being inliers set to 1 if the decision score is 1.5 OR -0.5. This is obviously not what you want to have, you want an observation that has decision score -0.5 to have a lower probability than the observation that has decision score 1.5.

First option

The first solution is to fit one single scaler for all your scores. This will also considerably simplify your conversion function as following:

def convert_probabilities(predictions, scores):

    scaler = MinMaxScaler()

    scores_scaled = scaler.fit_transform(scores.reshape(-1,1))
    new_scores = np.concatenate((1-scores_scaled, scores_scaled), axis=1)

    return new_scores

This will be a tuple of (probability of being an outlier, probability of being an inlier) with the desired properties.

Limitation of this approach

One of the main limitations of this approach is that there is no guaranty that the probability cut-off between inliers and outliers will be 0.5, which is the most intuitive choice. You might end up with a scenario like "if the probability of being an inlier is less than 60%, then the model predicts it is an outlier".

Second option

The second option is closer to what you wanted to do. You do indeed fit one scaler for each category, however, unlike what you did, both scalers do not return values in the same range. You can set outliers to get scaled to [0, 0.5] and outliers to get scaled to [0.5, 1]. This has the benefit that it would create an intuitive decision boundary at 0.5, where all probabilities above are inliers and vice-versa. It would then look like this:

def convert_probabilities(predictions, scores):

    scaler_inliers = MinMaxScaler((0.5, 1))
    scaler_outliers = MinMaxScaler((0, 0.5))

    scores_inliers_scaled = scaler_inliers.fit_transform(scores[predictions == 1].reshape(-1,1))
    scores_outliers_scaled = scaler_outliers.fit_transform(scores[predictions == -1].reshape(-1,1))
    scores_scaled = np.zeros((len(scores), 1))
    scores_scaled[predictions == 1] = scores_inliers_scaled
    scores_scaled[predictions == -1] = scores_outliers_scaled
    new_scores = np.concatenate((1-scores_scaled, scores_scaled), axis=1)

    return new_scores

Limitation of this approach

The main limitation is how you bring back together both scalers. In the code example above, both are connected at 0.5, which means that the "best outlier" and "worst inlier" have the same probability of 0.5. However, they do not have the same decision score. So one option is to change the scaling ranges to [0, 0.49], and [0.51, 1]` or so, but as you can see, this is getting even more arbitrary.

Remit answered 2/5, 2021 at 20:41 Comment(5)
But scaling all of the scores together doesn't work properly, which is why I tried separating them. Using this approach, I still get situations where P(anomaly) < 1-P(anomaly) and yet the prediction is -1 (anomaly).Kuhl
I added another solution option which I believe takes care of that concern.Remit
Thanks for the update. I believe the principal concern is that we know, regardless of what the distribution of numbers is, that more positive == higher probability of an inlier, more negative == higher probability of an outlier. The difficulty is understanding how to map those probabilities given the decision boundary the model has learned. I cant conceive of a way to do that.Kuhl
@wundermahn, I'm not quite sure what you are looking for at this stage. The answers I provided correct the issues you mentioned in your question description. Like I mentioned they are not ideal, but there is no way to answer it properly given that isolation forests are a non probabilistic algorithm. Any solution you might come up with to get probabilities, therefore, will have its flaws and will be highly arbitrary.Remit
Sure. Isolation Forests are not probabilistic. I note that in the question. In order to derive the true "probabilities" the decision boundary learned by the model is going to have to be taken into account to address one of your earlier points. The goal of the question is to result in an output that provides a map to what a probability would be. I appreciate your time nonetheless.Kuhl

© 2022 - 2024 — McMap. All rights reserved.