Keras, auc on validation set during training does not match with sklearn auc

I am using my test set as a validation set. I used similar approach as How to compute Receiving Operating Characteristic (ROC) and AUC in keras?

The issue is that my val_auc during the training is around 0.85, how ever, when I use

fpr, tpr, _ = roc_curve(test_label, test_prediction)
roc_auc = auc(fpr, tpr)

I get the auc of 0.60. I understand that they use different formulation and also streaming auc might be different than the one that sklearn calculate. however the difference is very large and I cant figure out what cause this difference.

# define roc_callback, inspired by https://github.com/keras-team/keras/issues/6050#issuecomment-329996505
def auc_roc(y_true, y_pred):
    # any tensorflow metric
    value, update_op = tf.contrib.metrics.streaming_auc(y_pred, y_true)

    # find all variables created for this metric
    metric_vars = [i for i in tf.local_variables() if 'auc_roc' in i.name.split('/')[1]]

    # Add metric variables to GLOBAL_VARIABLES collection.
    # They will be initialized for new session.
    for v in metric_vars:
        tf.add_to_collection(tf.GraphKeys.GLOBAL_VARIABLES, v)

    # force to update metric values
    with tf.control_dependencies([update_op]):
        value = tf.identity(value)
        return value

clf = Sequential()

clf.add(LSTM(units = 128, input_shape = (windowlength, trainX.shape[2]), return_sequences = True))#, kernel_regularizer=regularizers.l2(0.01)))

clf.add(Dropout(0.2))

clf.add(LSTM(units = 64, return_sequences = False))#, kernel_regularizer=regularizers.l2(0.01)))

clf.add(Dropout(0.2))

clf.add(Dense(units = 128, activation = 'relu'))
clf.add(Dropout(0.2))

clf.add(Dense(units = 128, activation = 'relu'))

clf.add(Dense(units = 1, activation = 'sigmoid'))
clf.compile(loss='binary_crossentropy', optimizer = 'adam', metrics = ['acc', auc_roc])

my_callbacks = [EarlyStopping(monitor='auc_roc', patience=50, verbose=1, mode='max')]
clf.fit(trainX, trainY, batch_size = 1000, epochs = 80, class_weight = class_weights, validation_data = (testX, testY), 
        verbose = 2, callbacks=my_callbacks)
y_pred_pro = model.predict_proba(testX)
print (roc_auc_score(y_test, y_pred_pro))

I really appreciate if anyone can guide me to the right direction.

First of all, tf.contrib.metrics.streaming_auc is deprecated, use tf.metrics.auc instead.

As you have mentioned, TF uses a different method to calculate the AUC than Scikit-learn.
TF uses an approximate method. Quoting its documentation:

To discretize the AUC curve, a linearly spaced set of thresholds is used to compute pairs of recall and precision values.

This will almost always give a higher AUC score than the actual score. Furthermore, the thresholds parameter defaults to 200, which is low if your dataset is large. Increasing it should make the score more accurate, but no matter how much high you set it, it will always have some error.

Scikit-learn, on the other hand, computes the "true" AUC score using a different method.

I don't know exactly why TF uses an approximate method, but I guess because it's much more memory efficient and faster. Also, although it's overestimating the score, it's very likely that it will preserve the relative order of models: if one model has a better approximate AUC than another, then its true AUC will -very probably- be also better.

Recommended topics

Hot tags