How to use F-score as error function to train neural networks?

H

4

6

I am pretty new to neural networks. I am training a network in tensorflow, but the number of positive examples is much much less than negative examples in my dataset (it is a medical dataset). So, I know that F-score calculated from precision and recall is a good measure of how well the model is trained. I have used error functions like cross-entropy loss or MSE before, but they are all based on accuracy calculation (if I am not wrong). But how do I use this F-score as an error function? Is there a tensorflow function for that? Or I have to create a new one?

Thanks in advance.

Humphries answered 17/11, 2018 at 18:20 Comment(0)

E

5

I think you are confusing model evaluation metrics for classification with training losses.

Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.

For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.

Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting

If you want to weigh positive and negative cases differently, two methods are

oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.

I recommend just using simple oversampling in practice.

Epitaph answered 17/11, 2018 at 21:24 Comment(0)

B

9

It appears approaches for optimising directly for these types of metrics have been devised and used successfully, improving scoring and or training times:

https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/77289

https://www.kaggle.com/c/human-protein-atlas-image-classification/discussion/70328

https://www.kaggle.com/rejpalcz/best-loss-function-for-f1-score-metric

One such method involves using the sums of probabilities, in place of counts, for the sets of true positives, false positives, and false negative metrics. For example F-beta loss (the generalisation of F1) can be calculated in with Torch in Python as follows:

def forward(self, y_logits, y_true):
    y_pred = self.sigmoid(y_logits)
    TP = (y_pred * y_true).sum(dim=1)
    FP = ((1 - y_pred) * y_true).sum(dim=1)
    FN = (y_pred * (1 - y_true)).sum(dim=1)
    fbeta = (1 + self.beta**2) * TP / ((1 + self.beta**2) * TP + (self.beta**2) * FN + FP + self.epsilon)
    fbeta = fbeta.clamp(min=self.epsilon, max=1 - self.epsilon)
    return 1 - fbeta.mean()

An alternative method is described in this paper:

https://arxiv.org/abs/1608.04802

The approach taken optimises for a lower bound on the statistic. Other metrics such as AUROC and AUCPR are also discussed. An implementation in TF of such an approach can be found here:

https://github.com/tensorflow/models/tree/master/research/global_objectives

Bel answered 31/8, 2019 at 18:50 Comment(2)

Hi, the github link dosent seem to work. Is there anyplace I can find the code for this for tensorflow? – Bein 20/12, 2020 at 0:3

Went to back to a random Dec 2019 commit: github.com/tensorflow/models/blob/… – Southern 23/6 at 17:22

E

5

I think you are confusing model evaluation metrics for classification with training losses.

Accuracy, precision, F-scores etc. are evaluation metrics computed from binary outcomes and binary predictions.

For model training, you need a function that compares a continuous score (your model output) with a binary outcome - like cross-entropy. Ideally, this is calibrated such that it is minimised if the predicted mean matches the population mean (given covariates). These rules are called proper scoring rules, and the cross-entropy is one of them.

Also check the thread is-accuracy-an-improper-scoring-rule-in-a-binary-classification-setting

If you want to weigh positive and negative cases differently, two methods are

oversample the minority class and correct predicted probabilities when predicting on new examples. For fancier methods, check the under sampling module of imbalanced-learn to get an overview.
use a different proper scoring rule for training loss. This allows to e.g. build in asymmetry in how you treat positive and negative cases while preserving calibration. Here is review of the subject.

I recommend just using simple oversampling in practice.

Epitaph answered 17/11, 2018 at 21:24 Comment(0)

E

0

the loss value and accuracy is a different concept. The loss value is used for training the NN. However, accuracy or other metrics is to value the training result.

Eladiaelaeoptene answered 20/4, 2020 at 14:33 Comment(0)

M

0

This question reminds me of Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure".

Your model minimizes the error based on the loss function. Common loss functions like Mean Squared Error have well-studied behavior and performance. I would recommend sticking with those, as optimizing for a metric may elicit unwanted model behavior.

As an example, I once used R-squared as a loss function for a time-series regression task instead of Mean Squared Error. As a result, the model's predictions almost completely ignored outliers (or overfit on them, I don't remember exactly) in the dataset, which was not optimal for my task. Returning to Mean Squared Error yielded better results. Perhaps it may be the same with yours.

Murry answered 11/10 at 19:19 Comment(0)

Recommended topics

Hot tags