sklearn svm area under ROC less than 0.5 for training data
Asked Answered
G

1

6

I am using sklearn v 0.13.1 svm in order to try and solve a binary classification problem. I use kfold cross validation and compute the area under the roc curve (roc_auc) to test the quality of my model. However, for some folds the roc_auc is less than 0.5, even for the training data. Shouldn't that be impossible? Shouldn't it always be possible for the algorithm to at least reach 0.5 on the data it is being trained on?

Here's my code:

classifier = svm.SVC(kernel='poly', degree=3, probability=True, max_iter=100000)
kf = cross_validation.KFold(len(myData), n_folds=3, indices=False)
for train, test in kf:
    Fit = classifier.fit(myData[train], classVector[train])

    probas_ = Fit.predict_proba(myData[test])
    fpr, tpr, thresholds = roc_curve(classVector[test], probas_[:,1])
    roc_auc = auc(fpr, tpr)

    probas_ = Fit.predict_proba(myData[train])
    fpr2, tpr2, thresholds2 = roc_curve(classVector[train], probas_[:,1])
    roc_auc2 = auc(fpr2, tpr2)

    print "Training auc: ", roc_auc2, " Testing auc: ", roc_auc

The output looks like this:

    Training auc: 0.423920939062  Testing auc: 0.388436883629
    Training auc: 0.525472613736  Testing auc: 0.565581854043
    Training auc: 0.470917930528  Testing auc: 0.259344660194

Is the results of an area under the curve less than 0.5 meaningful? In principle, if both the train and test values are <0.5 I could just invert the prediction for every point, but I am worried somthing is going wrong. I thought that even if I gave it completely random data, the algorithm should reach 0.5 on the training data?

Gouveia answered 5/2, 2014 at 20:20 Comment(0)
C
3

Indeed you could invert your predictions, and this is why your AUROCs are < 0.5. It is normally not a problem to do so, just make sure to be consistent and either always or never reverse them. Make sure you do that both on the training and test sets.

The reason for this problem could be that the classifier.fit or the roc_curve functions misinterpreted the classVector you passed. It is probably better to fix that instead - read their doc to learn what data they expect exactly. In particular, you didn't specify what label is positive. See the pos_label argument to roc_curve and make sure y_true was properly specified.

However, what is worrisome is that some of your AUROCs are > 0.5 on the training set, and most of them are close to it. It probably means that your classifier performs not much better than random.

Cruickshank answered 6/2, 2014 at 7:49 Comment(4)
Hi,thanks a lot for the reply. I tried using pos_label, but it doesn't solve the problem. If I use pos_label=1 I get the output shown above. If I use pos_label=0 I get the inverted output (i.e. 1 - value shown), which is what I would expect. My y_true are all 0 or 1 and associated with the proper events. Is there another way that the svm might get confused? I have been through the documentation but can't find any indication of there being a way to get roc_auc <0.5. I know the classifier isn't performing too well in general, I am just trying to make sure I understand the toolkit...Gouveia
Could be anything from weird correlations of the data to the usage of an non-optimal kernel. Impossible to say without a minimal reproducible code.Cruickshank
@Gouveia I am having a similar problem with LogisticRegression. Did you find where the AUC < 0.5 comes from ?Systaltic
@jibounet Please edit the question with a reproducible example if you want a chance to get an answer beyond what I've written here. stackoverflow.com/help/mcveCruickshank

© 2022 - 2024 — McMap. All rights reserved.