I am using sklearn v 0.13.1 svm in order to try and solve a binary classification problem. I use kfold cross validation and compute the area under the roc curve (roc_auc) to test the quality of my model. However, for some folds the roc_auc is less than 0.5, even for the training data. Shouldn't that be impossible? Shouldn't it always be possible for the algorithm to at least reach 0.5 on the data it is being trained on?
Here's my code:
classifier = svm.SVC(kernel='poly', degree=3, probability=True, max_iter=100000)
kf = cross_validation.KFold(len(myData), n_folds=3, indices=False)
for train, test in kf:
Fit = classifier.fit(myData[train], classVector[train])
probas_ = Fit.predict_proba(myData[test])
fpr, tpr, thresholds = roc_curve(classVector[test], probas_[:,1])
roc_auc = auc(fpr, tpr)
probas_ = Fit.predict_proba(myData[train])
fpr2, tpr2, thresholds2 = roc_curve(classVector[train], probas_[:,1])
roc_auc2 = auc(fpr2, tpr2)
print "Training auc: ", roc_auc2, " Testing auc: ", roc_auc
The output looks like this:
Training auc: 0.423920939062 Testing auc: 0.388436883629
Training auc: 0.525472613736 Testing auc: 0.565581854043
Training auc: 0.470917930528 Testing auc: 0.259344660194
Is the results of an area under the curve less than 0.5 meaningful? In principle, if both the train and test values are <0.5 I could just invert the prediction for every point, but I am worried somthing is going wrong. I thought that even if I gave it completely random data, the algorithm should reach 0.5 on the training data?