What is the use of base_score in xgboost multiclass working?
Asked Answered
P

1

6

I am trying to explore the working of Xgboost binary classification as well as for multi-class. In case of binary class, i observed that base_score is considered as starting probability and it also showed major impact while calculating Gain and Cover.

In case of multi-class, i am not able to figure out the importance of base_score parameter because it showed me the same value of Gain and Cover for different(any) values of base_score.

Also i am unable to find out why factor of 2 is there while calculating cover for multi-class i.e. 2*p*(1-p)

Can someone help me on these two parts?

Proctology answered 12/6, 2020 at 18:56 Comment(7)
Applying base_score to multi-class classifiers is discussed here: #47596986 (does this help you with 'part 1' of your question?)Maximalist
Thanks for the comment. But the explaination given in link is for binary class problem.Proctology
Yes, you need to read the entire page to find the relevant part: "Your answer for the two-class (binary) case wouldn't make any sense for multiclass. See the discussion they linked to on the equivalent base_margin default in multiclass #1380, where xgboost (pre-2017) used to make the default assumption that base_score = 1/nclasses, which is a-priori really dubious if there's a class imbalance, but they say "if you use enough training steps this goes away", which is not good for out-of-the-box performance in data exploration." For further discussion: github.com/dmlc/xgboost/issues/2222Maximalist
I am agree with the point that base_score=1/nclasses. But i observe one thing that in case of binary class , our base score is used as initial probability and hence impact on gain and cover value. While in case of multiclass ,despite of passing any value as base score in R (.5,.6,.7) it is always overwritten by 1/nclasses and also it will get added to odds of last leaf node. Can you please explain the fact that why it is added at the end of leaf node in case of multiclass and consider as starting probs in case of binary class?Proctology
Right! I finally understand - sorry for misinterpreting your question - using the iris dataset I tried setting base_score to various values and saw no difference after a single round of training with low eta (all predictions were 0.333). I also tried setting base_margin and saw no difference in the first training round. Setting base_score / base_margin works as expected for linear / binary classifiers but it did not work for multi-class predictions (neither softprob or softmax) in this test case. If nobody else responds, it would be great if you raised this issue on the xgboost github.Maximalist
Hopefully my answer can help to explain what is going on. Please comment if something is not clear.Lakieshalakin
Also I feel like xgboost documentation is doing very poorly on explaining what is happening under the hood. I'm really surprised that what I'm saying here is not mentioned explicitly in the docs.Lakieshalakin
L
8

To answer your question let's look what does multi-class classification really does in xgboost using multi:softmax objective and, say, 6 classes.

Say, you want to train a classifier specifying num_boost_round=5. How many trees would you expect xgboost to train for you? Correct answer is 30 trees. The reason is because softmax expecting for each training row to have num_classes=6 different scores, so that xgboost can compute gradients/hessian w.r.t. each of these 6 scores and use them to build a new tree for each of the scores (effectively updating 6 parallel models in order to output 6 updated scores per sample).

In order to ask xgboost classifier output the final 6 values for each sample e.g. from test set you will need to call bst.predict(xg_test, output_margin=True) (where bst is your classifier and xg_test is e.g. test set). The output of regular bst.predict(xg_test) is effectively same as picking the class with the highest value of 6 in bst.predict(xg_test, output_margin=True).

You can look at all the trees using bst.trees_to_dataframe() function if you are interested (where bst is your trained classifier).

Now to the question what does base_score do in multi:softmax case. Answer is - it is added as a starting score for each of 6 classes' scores before any trees were added. So if you, e.g. apply base_score=42. you will be able to observe that all values in bst.predict(xg_test, output_margin=True) will also increase by 42. In the same time for softmax increasing scores for all classes by equal amount doesn't change anything, so because of that in the case of multi:softmax applying base_score different from 0 doesn't have any visible effect.

Compare this behavior to binary classification. While almost same as multi:softmax with 2 classes, the big difference is that xgboost is only trying to produce 1 score for class 1, leaving score for class 0 equal to 0.0. Because of that when you use base_score in binary classification it is only added to the score of class 1 thus increasing starting prediction probability for class 1. In theory with multiple classes it would be meaningful to e.g. pass multiple base scores (one per class), which you can't do using base_score. Instead of that you can use set_base_margin functionality applied to the training set, but it is not working very conveniently with default predict, so after that you'll need to always use it with output_margin=True and adding same values as ones you used in set_base_margin for your training data (if you want to use set_base_margin in multi-class case you'll need to flatten the margin values as suggested here).

Example of how it all works:

import numpy as np
import xgboost as xgb
TRAIN = 1000
TEST = 2
F = 10

def gen_data(M):
    np_train_features = np.random.rand(M, F)
    np_train_labels = np.random.binomial(2, np_train_features[:,0])
    return xgb.DMatrix(np_train_features, label=np_train_labels)

def regenerate_data():
    np.random.seed(1)
    return gen_data(TRAIN), gen_data(TEST)

param = {}
param['objective'] = 'multi:softmax'
param['eta'] = 0.001
param['max_depth'] = 1
param['nthread'] = 4
param['num_class'] = 3


def sbm(xg_data, original_scores):
    xg_data.set_base_margin(np.array(original_scores * xg_data.num_row()).reshape(-1, 1))

num_round = 3

print("#1. No base_score, no set_base_margin")
xg_train, xg_test = regenerate_data()
bst = xgb.train(param, xg_train, num_round)
print(bst.predict(xg_test, output_margin=True))
print(bst.predict(xg_test))
print("Easy to see that in this case all scores/margins have 0.5 added to them initially, which is default value for base_score here for some bizzare reason, but it doesn't really affect anything, so no one cares.")
print()
bst1 = bst

print("#2. Use base_score")
xg_train, xg_test = regenerate_data()
param['base_score'] = 5.8
bst = xgb.train(param, xg_train, num_round)
print(bst.predict(xg_test, output_margin=True))
print(bst.predict(xg_test))
print("In this case all scores/margins have 5.8 added to them initially. And it doesn't really change anything compared to previous case.")
print()
bst2 = bst

print("#3. Use very large base_score and screw up numeric precision")
xg_train, xg_test = regenerate_data()
param['base_score'] = 5.8e10
bst = xgb.train(param, xg_train, num_round)
print(bst.predict(xg_test, output_margin=True))
print(bst.predict(xg_test))
print("In this case all scores/margins have too big number added to them and xgboost thinks all probabilities are equal so picks class 0 as prediction.")
print("But the training actually was fine - only predict is being affect here. If you set normal base margins for test set you can see (also can look at bst.trees_to_dataframe()).")
xg_train, xg_test = regenerate_data() # if we don't regenerate the dataframe here xgboost seems to be either caching it or somehow else remembering that it didn't have base_margins and result will be different.
sbm(xg_test, [0.1, 0.1, 0.1])
print(bst.predict(xg_test, output_margin=True))
print(bst.predict(xg_test))
print()
bst3 = bst

print("#4. Use set_base_margin for training")
xg_train, xg_test = regenerate_data()
# only used in train/test whenever set_base_margin is not applied.
# Peculiar that trained model will remember this value even if it was trained with
# dataset which had set_base_margin. In that case this base_score will be used if
# and only if test set passed to `bst.predict` didn't have `set_base_margin` applied to it.
param['base_score'] = 4.2
sbm(xg_train, [-0.4, 0., 0.8])
bst = xgb.train(param, xg_train, num_round)
sbm(xg_test, [-0.4, 0., 0.8])
print(bst.predict(xg_test, output_margin=True))
print(bst.predict(xg_test))
print("Working - the base margin values added to the classes skewing predictions due to low eta and small number of boosting rounds.")
print("If we don't set base margins for `predict` input it will use base_score to start all scores with. Bizzare, right? But then again, not much difference on what to add here if we are adding same value to all classes' scores.")
xg_train, xg_test = regenerate_data() # regenerate test and don't set the base margin values
print(bst.predict(xg_test, output_margin=True))
print(bst.predict(xg_test))
print()
bst4 = bst

print("Trees bst1, bst2, bst3 are almost identical, because there is no difference in how they were trained. bst4 is different though.")
print(bst1.trees_to_dataframe().iloc[1,])
print()
print(bst2.trees_to_dataframe().iloc[1,])
print()
print(bst3.trees_to_dataframe().iloc[1,])
print()
print(bst4.trees_to_dataframe().iloc[1,])

The output for this is the following:

#1. No base_score, no set_base_margin
[[0.50240415 0.5003637  0.49870378]
 [0.49863306 0.5003637  0.49870378]]
[0. 1.]
Easy to see that in this case all scores/margins have 0.5 added to them initially, which is default value for base_score here for some bizzare reason, but it doesn't really affect anything, so no one cares.

#2. Use base_score
[[5.8024044 5.800364  5.798704 ]
 [5.798633  5.800364  5.798704 ]]
[0. 1.]
In this case all scores/margins have 5.8 added to them initially. And it doesn't really change anything compared to previous case.

#3. Use very large base_score and screw up numeric precision
[[5.8e+10 5.8e+10 5.8e+10]
 [5.8e+10 5.8e+10 5.8e+10]]
[0. 0.]
In this case all scores/margins have too big number added to them and xgboost thinks all probabilities are equal so picks class 0 as prediction.
But the training actually was fine - only predict is being affect here. If you set normal base margins for test set you can see (also can look at bst.trees_to_dataframe()).
[[0.10240632 0.10036398 0.09870315]
 [0.09863247 0.10036398 0.09870315]]
[0. 1.]

#4. Use set_base_margin for training
[[-0.39458954  0.00102317  0.7973728 ]
 [-0.40044016  0.00102317  0.7973728 ]]
[2. 2.]
Working - the base margin values added to the classes skewing predictions due to low eta and small number of boosting rounds.
If we don't set base margins for `predict` input it will use base_score to start all scores with. Bizzare, right? But then again, not much difference on what to add here if we are adding same value to all classes' scores.
[[4.2054105 4.201023  4.1973724]
 [4.1995597 4.201023  4.1973724]]
[0. 1.]

Trees bst1, bst2, bst3 are almost identical, because there is no difference in how they were trained. bst4 is different though.
Tree                 0
Node                 1
ID                 0-1
Feature           Leaf
Split              NaN
Yes                NaN
No                 NaN
Missing            NaN
Gain       0.000802105
Cover          157.333
Name: 1, dtype: object

Tree                 0
Node                 1
ID                 0-1
Feature           Leaf
Split              NaN
Yes                NaN
No                 NaN
Missing            NaN
Gain       0.000802105
Cover          157.333
Name: 1, dtype: object

Tree                 0
Node                 1
ID                 0-1
Feature           Leaf
Split              NaN
Yes                NaN
No                 NaN
Missing            NaN
Gain       0.000802105
Cover          157.333
Name: 1, dtype: object

Tree                0
Node                1
ID                0-1
Feature          Leaf
Split             NaN
Yes               NaN
No                NaN
Missing           NaN
Gain       0.00180733
Cover         100.858
Name: 1, dtype: object
Lakieshalakin answered 18/6, 2020 at 6:3 Comment(6)
Thank you for your detailed explanation! In my attempts to answer @jayantphor's question I tried setting base_margin and using output_margin=True with a sample dataset but wasn't able to see the effect I expected - perhaps I needed to flatten the base_margin values as you said. Are you able to provide a reproducible example illustrating how to effectively set base_margin values for a multiclass XGBoost classification problem (in R or python)?Maximalist
Thanks @Alexander and jared_mamrot. I observed the following things so far:- 1. Cover and value get impacted by base_ score and no separate addition of base _score is done , in binary xgboost. 2. However, in multi-class the cover ,gain and value don't get affected irrespective of whatever base_score we use. Further, the base score is added to the last leaf values.Proctology
Still I am not able to figure out the below two points:- 1. We know that the base_score should get cancelled out why normalizing the values for probability calculation in multi-class case. But the gains and cover value should incorporate this value? 2.That is while modeling p1_unadj=exp(z1), p2_unadj=exp(z2), p3_unadj=exp(z3) say as 3 outcomes. Then, p1_adj = p1_unadj/sum(p1_unadj, p2_unadj,p3_unadj). Here the base_score could get cancelled out, but the tree nodes should have an effect of base_score and gain,cover value should be different for different base_score value. ?Proctology
@Maximalist - set_base_margin is implemented in such a way that it works very different from base_score even though you might think that the first is just a more generic way of doing the second. I'll try to prepare some reasonable examples.Lakieshalakin
@Proctology - the behavior you describe is exactly caused by the fact that binary classification is outputting one score/output/margin (all same thing) per sample (which in multi:softmax notation would translate to class "one" score, while class "zero" score is fixed at 0.0 value). Because of that asymmetry, base_score has effect on binary classification. In multiclass adding same base_score to all num_classes scores is not affecting the gradient calculation (gradient of loss w.r.t. scores) so you end up seeing same gain, same cover values (and overall no effect on the training).Lakieshalakin
@Maximalist - added reproducible exampleLakieshalakin

© 2022 - 2024 — McMap. All rights reserved.