Scikit-learn predict_proba gives wrong answers
Asked Answered
H

3

50

This is a follow-up question from How to know what classes are represented in return array from predict_proba in Scikit-learn

In that question, I quoted the following code:

>>> import sklearn
>>> sklearn.__version__
'0.13.1'
>>> from sklearn import svm
>>> model = svm.SVC(probability=True)
>>> X = [[1,2,3], [2,3,4]] # feature vectors
>>> Y = ['apple', 'orange'] # classes
>>> model.fit(X, Y)
>>> model.predict_proba([1,2,3])
array([[ 0.39097541,  0.60902459]])

I discovered in that question this result represents the probability of the point belonging to each class, in the order given by model.classes_

>>> zip(model.classes_, model.predict_proba([1,2,3])[0])
[('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]

So... this answer, if interpreted correctly, says that the point is probably an 'orange' (with a fairly low confidence, due to the tiny amount of data). But intuitively, this result is obviously incorrect, since the point given was identical to the training data for 'apple'. Just to be sure, I tested the reverse as well:

>>> zip(model.classes_, model.predict_proba([2,3,4])[0])
[('apple', 0.60705475211840931), ('orange', 0.39294524788159074)]

Again, obviously incorrect, but in the other direction.

Finally, I tried it with points that were much further away.

>>> X = [[1,1,1], [20,20,20]] # feature vectors
>>> model.fit(X, Y)
>>> zip(model.classes_, model.predict_proba([1,1,1])[0])
[('apple', 0.33333332048410247), ('orange', 0.66666667951589786)]

Again, the model predicts the wrong probabilities. BUT, the model.predict function gets it right!

>>> model.predict([1,1,1])[0]
'apple'

Now, I remember reading something in the docs about predict_proba being inaccurate for small datasets, though I can't seem to find it again. Is this the expected behaviour, or am I doing something wrong? If this IS the expected behaviour, then why does the predict and predict_proba function disagree one the output? And importantly, how big does the dataset need to be before I can trust the results from predict_proba?

-------- UPDATE --------

Ok, so I did some more 'experiments' into this: the behaviour of predict_proba is heavily dependent on 'n', but not in any predictable way!

>>> def train_test(n):
...     X = [[1,2,3], [2,3,4]] * n
...     Y = ['apple', 'orange'] * n
...     model.fit(X, Y)
...     print "n =", n, zip(model.classes_, model.predict_proba([1,2,3])[0])
... 
>>> train_test(1)
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
>>> for n in range(1,10):
...     train_test(n)
... 
n = 1 [('apple', 0.39097541289393828), ('orange', 0.60902458710606167)]
n = 2 [('apple', 0.98437355278112448), ('orange', 0.015626447218875527)]
n = 3 [('apple', 0.90235408180319321), ('orange', 0.097645918196806694)]
n = 4 [('apple', 0.83333299908143665), ('orange', 0.16666700091856332)]
n = 5 [('apple', 0.85714254878984497), ('orange', 0.14285745121015511)]
n = 6 [('apple', 0.87499969631893626), ('orange', 0.1250003036810636)]
n = 7 [('apple', 0.88888844127886335), ('orange', 0.11111155872113669)]
n = 8 [('apple', 0.89999988018127364), ('orange', 0.10000011981872642)]
n = 9 [('apple', 0.90909082368682159), ('orange', 0.090909176313178491)]

How should I use this function safely in my code? At the very least, is there any value of n for which it will be guaranteed to agree with the result of model.predict?

Hildegardhildegarde answered 10/6, 2013 at 6:19 Comment(0)
M
21

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

Mymya answered 17/6, 2013 at 7:28 Comment(5)
That's interesting Bilal... I don't actually need the probabilities for my purpose, just the ordering. I think this is the answer I'm looking for.Hildegardhildegarde
Yes interesting. I had the same problem and used this method for ordering. It gave me better results than predict_proba()Mymya
Note that LinearSVC() will yield similar predictions as SVC(kernel='linear') but not SVC(kernel='rbf') which is the default kernel for SVC.Pantheas
Giving me the same result of predict_probaParcenary
It is stated that .decision_function() yields the confidence. For me it was not in [-1,3]. See #26478500Heeler
P
26

predict_probas is using the Platt scaling feature of libsvm to callibrate probabilities, see:

So indeed the hyperplane predictions and the proba calibration can disagree, especially if you only have 2 samples in your dataset. It's weird that the internal cross validation done by libsvm for scaling the probabilities does not fail (explicitly) in this case. Maybe this is a bug. One would have to dive into the Platt scaling code of libsvm to understand what's happening.

Pantheas answered 10/6, 2013 at 8:34 Comment(1)
Just adding to this: In principle the cross-validation should agree with the decision boundary for large n.Margenemargent
M
21

if you use svm.LinearSVC() as estimator, and .decision_function() (which is like svm.SVC's .predict_proba()) for sorting the results from most probable class to the least probable one. this agrees with .predict() function. Plus, this estimator is faster and gives almost the same results with svm.SVC()

the only drawback for you might be that .decision_function() gives a signed value sth like between -1 and 3 instead of a probability value. but it agrees with the prediction.

Mymya answered 17/6, 2013 at 7:28 Comment(5)
That's interesting Bilal... I don't actually need the probabilities for my purpose, just the ordering. I think this is the answer I'm looking for.Hildegardhildegarde
Yes interesting. I had the same problem and used this method for ordering. It gave me better results than predict_proba()Mymya
Note that LinearSVC() will yield similar predictions as SVC(kernel='linear') but not SVC(kernel='rbf') which is the default kernel for SVC.Pantheas
Giving me the same result of predict_probaParcenary
It is stated that .decision_function() yields the confidence. For me it was not in [-1,3]. See #26478500Heeler
C
0

Food for thought here. I think i actually got predict_proba to work as is. Please see code below...

# Test data
TX = [[1,2,3], [4,5,6], [7,8,9], [10,11,12], [13,14,15], [16,17,18], [19,20,21], [22,23,24]]
TY = ['apple', 'orange', 'grape', 'kiwi', 'mango','peach','banana','pear']

VX2 = [[16,17,18], [19,20,21], [22,23,24], [13,14,15], [10,11,12], [7,8,9], [4,5,6], [1,2,3]]
VY2 = ['peach','banana','pear','mango', 'kiwi', 'grape', 'orange','apple']

VX2_df = pd.DataFrame(data=VX2) # convert to dataframe
VX2_df = VX2_df.rename(index=float, columns={0: "N0", 1: "N1", 2: "N2"})
VY2_df = pd.DataFrame(data=VY2) # convert to dataframe
VY2_df = VY2_df.rename(index=float, columns={0: "label"})

# NEW - in testing
def train_model(classifier, feature_vector_train, label, feature_vector_valid, valid_y, valid_x, is_neural_net=False):

    # fit the training dataset on the classifier
    classifier.fit(feature_vector_train, label)

    # predict the top n labels on validation dataset
    n = 5
    #classifier.probability = True
    probas = classifier.predict_proba(feature_vector_valid)
    predictions = classifier.predict(feature_vector_valid)

    #Identify the indexes of the top predictions
    #top_n_predictions = np.argsort(probas)[:,:-n-1:-1]
    top_n_predictions = np.argsort(probas, axis = 1)[:,-n:]

    #then find the associated SOC code for each prediction
    top_socs = classifier.classes_[top_n_predictions]

    #cast to a new dataframe
    top_n_df = pd.DataFrame(data=top_socs)

    #merge it up with the validation labels and descriptions
    results = pd.merge(valid_y, valid_x, left_index=True, right_index=True)
    results = pd.merge(results, top_n_df, left_index=True, right_index=True)

    conditions = [
        (results['label'] == results[0]),
        (results['label'] == results[1]),
        (results['label'] == results[2]),
        (results['label'] == results[3]),
        (results['label'] == results[4])]
    choices = [1, 1, 1, 1, 1]
    results['Successes'] = np.select(conditions, choices, default=0)

    print("Top 5 Accuracy Rate = ", sum(results['Successes'])/results.shape[0])
    print("Top 1 Accuracy Rate = ", metrics.accuracy_score(predictions, valid_y))

train_model(naive_bayes.MultinomialNB(), TX, TY, VX2, VY2_df, VX2_df)

Output: Top 5 Accuracy Rate = 1.0 Top 1 Accuracy Rate = 1.0

Couldn't get it to work for my own data though :(

Chemosmosis answered 28/2, 2019 at 2:33 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.