Macro VS Micro VS Weighted VS Samples F1 Score

Asked 18/4, 2019 at 6:26 Answered 14/3 at 22:54

Solved python python-3.x machine-learning scikit-learn metrics

In sklearn.metrics.f1_score, the f1 score has a parameter called "average". What does macro, micro, weighted, and samples mean? Please elaborate, because in the documentation, it was not explained properly. Or simply answer the following:

Why is "samples" best parameter for multilabel classification?
Why is micro best for an imbalanced dataset?
what's the difference between weighted and macro?

Savoie answered 18/4, 2019 at 6:26 Comment(5)

I've tried, nothing comes out. – Savoie 18/4, 2019 at 6:37

Read the documentation of the sklearn.metrics.f1_score function properly and you will get your answer. – Complacence 18/4, 2019 at 6:56

Sorry but I did. "because in the documentation, it was not explained properly" – Savoie 18/4, 2019 at 7:29

where did you see that "micro is best for imbalanced data" and "samples best for multilabel classification"? – Salyers 18/4, 2019 at 9:2

Answers to your questions here: datascience.stackexchange.com/a/24051/17844 – Afb 4/8, 2021 at 14:49

The question is about the meaning of the average parameter in sklearn.metrics.f1_score.

As you can see from the code:

average=micro says the function to compute f1 by considering total true positives, false negatives and false positives (no matter of the prediction for each label in the dataset)
average=macro says the function to compute f1 for each label, and returns the average without considering the proportion for each label in the dataset.
average=weighted says the function to compute f1 for each label, and returns the average considering the proportion for each label in the dataset.
average=samples says the function to compute f1 for each instance, and returns the average. Use it for multilabel classification.

Acoustic answered 19/4, 2019 at 8:43 Comment(0)

I found a really helpful article explaining the differences more thoroughly and with examples: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

Unfortunately, it doesn't tackle the 'samples' parameter and I did not experiment with multi-label classification yet, so I'm not able to answer question number 1. As for the others:

Where does this information come from? If I understood the differences correctly, micro is not the best indicator for an imbalanced dataset, but one of the worst since it does not include the proportions. As described in the article, micro-f1 equals accuracy which is a flawed indicator for imbalanced data. For example: The classifier is supposed to identify cat pictures among thousands of random pictures, only 1% of the data set consists of cat pictures (imbalanced data set). Even if it does not identify a single cat picture, it has an accuracy / micro-f1-score of 99%, since 99% of the data was correctly identified as not cat pictures.
Trying to put it in a nutshell: Macro is simply the arithmetic mean of the individual scores, while weighted includes the individual sample sizes. I recommend the article for details, I can provide more examples if needed.

I know that the question is quite old, but I hope this helps someone. Please correct me if I'm wrong. I've done some research, but am not an expert.

Mercurate answered 13/7, 2021 at 18:54 Comment(1)

"micro is not the best indicator for an imbalanced dataset", this is not always true. You can keep the negative labels out of micro-average. E.g. sklearn f1_score function provided labels/pos_label parameters to control this. In many NLP tasks, like NER, micro-average f1 is always the best metrics to use. – Nauplius 13/3, 2022 at 17:44

Here is an example calculation (without the samples option)

Macro average prefers the "under-represented" classes. Therefore encourages the model / algorithm / evaluation to pay attention more to the "small" classes, than if they were treated according to how many data points they have, which is what micro-f1 is doing.

Concrete example:

Assume: there are two classes of emails (important, junk)
Also assume your data has 1000 emails
Also assume the data has only 10 important emails

Now:

Let's assume we build an email-classifier that predicts the label "junk" all the time.

I.e.:

Pred_junk=1000
Pred_important = 0

The results would be:

TP_junk = 990
TP_important = 0
FP_junk = 10
FP_important = 0
FN_junk = 0
FN_important = 10
Recall_junk = TP/ (TP + FN) = 990/990 = 1
Recall_important = 0/10 = 0
Precision_junk = TP / (TP + FP) = 990/1000 = 0.99
Precision_important = UNDEFINDED ( ~> 0/0) can be set to 1

So now:

F1_junk = 2 * Precision_junk * Recall_junk / (Precision_junk + Recall_junk) = 1.98/1.99 = 0.995
F1_important = 0

Now, Macro F1 is:

F1_Macro = 1/2 F1_junk + 1/2 F1_important = 0.4975

Weighted-F1 would be:

F1_weighted = 999/1000 F1_junk + 10/1000 F1_important = 0.994

While Micro F1 is calculated by putting all TP, and FP together

TP_total = TP_junk + TP_important = 990
FP_total = FP_junk + FP_important = 10
FN_total = FP_junk + FP_important = 10

Micro-F1 is therefore:

Precision_micro = 990 / 1000 = 0.99
Recall_micro = 990 /1000 = 0.99
F1_micro_total = 0.99

So, what you can see is that the F1 score gets penalised very heavily in the Macro setting, as it "weighs" every class the same regardless of how often it appears in the dataset. In our case this leads to a massive difference in F1 scores, as you can see. Therefore the Macro F1 score is much better to tackle class imbalance, as it penalises the model / algorithm for performing poorly on the under-represented dataset.

Hope this helps.

Genetic answered 14/3 at 22:54 Comment(0)

Recommended topics

Hot tags