Here is an example calculation (without the samples option)
Macro average prefers the "under-represented" classes. Therefore encourages the model / algorithm / evaluation to pay attention more to the "small" classes, than if they were treated according to how many data points they have, which is what micro-f1 is doing.
Concrete example:
- Assume: there are two classes of emails (important, junk)
- Also assume your data has 1000 emails
- Also assume the data has only 10 important emails
Now:
Let's assume we build an email-classifier that predicts the label "junk" all the time.
I.e.:
- Pred_junk=1000
- Pred_important = 0
The results would be:
- TP_junk = 990
- TP_important = 0
- FP_junk = 10
- FP_important = 0
- FN_junk = 0
- FN_important = 10
- Recall_junk = TP/ (TP + FN) = 990/990 = 1
- Recall_important = 0/10 = 0
- Precision_junk = TP / (TP + FP) = 990/1000 = 0.99
- Precision_important = UNDEFINDED ( ~> 0/0) can be set to 1
So now:
Now, Macro F1 is:
- F1_Macro = 1/2 F1_junk + 1/2 F1_important = 0.4975
Weighted-F1 would be:
- F1_weighted = 999/1000 F1_junk + 10/1000 F1_important = 0.994
While Micro F1 is calculated by putting all TP, and FP together
- TP_total = TP_junk + TP_important = 990
- FP_total = FP_junk + FP_important = 10
- FN_total = FP_junk + FP_important = 10
Micro-F1 is therefore:
- Precision_micro = 990 / 1000 = 0.99
- Recall_micro = 990 /1000 = 0.99
- F1_micro_total = 0.99
So, what you can see is that the F1 score gets penalised very heavily in the Macro setting, as it "weighs" every class the same regardless of how often it appears in the dataset. In our case this leads to a massive difference in F1 scores, as you can see. Therefore the Macro F1 score is much better to tackle class imbalance, as it penalises the model / algorithm for performing poorly on the under-represented dataset.
Hope this helps.