Calculation of expected_value in SHAP explanations of XGBoost Classifier

Asked 18/9, 2023 at 9:28 Answered 31/7, 2024 at 15:35

Solved python machine-learning xgboost shap

How do we make sense of SHAP explainer.expected_value? Why is it not the same with y_train.mean() after sigmoid transformation?

Below is a summary of the code for quick reference. Full code available in this notebook: https://github.com/MenaWANG/ML_toy_examples/blob/main/explain%20models/shap_XGB_classification.ipynb

model = xgb.XGBClassifier()
model.fit(X_train, y_train)
explainer = shap.Explainer(model)
shap_test = explainer(X_test)
shap_df = pd.DataFrame(shap_test.values)

#For each case, if we add up shap values across all features plus the expected value, we can get the margin for that case, which then can be transformed to return the predicted prob for that case:
np.isclose(model.predict(X_test, output_margin=True),explainer.expected_value + shap_df.sum(axis=1))
#True

But why isn't the below true? Why after sigmoid transformation, the explainer.expected_value is not the same with y_train.mean() for XGBoost classifiers?

expit(explainer.expected_value) == y_train.mean()
#False

Alejoa answered 18/9, 2023 at 9:28 Comment(0)

SHAP is guaranteed to be additive in raw space (logits). To understand why additivity in raw scores doesn't extend to additivity in class predictions you may think for a while why exp(x+y) != exp(x) + exp(y)

Re: Just keen to understand how was explainer.expected_value calculated for XGBoost classifier. Do you happen to know?

As I stated in comments expected value comes either from the model trees or from your data.

Let's try reproducible:

from sklearn.model_selection import train_test_split
import xgboost
import shap

X, y = shap.datasets.adult()
X_display, y_display = shap.datasets.adult(display=True)

# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
d_train = xgboost.DMatrix(X_train, label=y_train)
d_test = xgboost.DMatrix(X_test, label=y_test)

params = {
    "eta": 0.01,
    "objective": "binary:logistic",
    "subsample": 0.5,
    "base_score": np.mean(y_train),
    "eval_metric": "logloss",
}
model = xgboost.train(
    params,
    d_train,
    num_boost_round=5000,
    evals=[(d_test, "test")],
    verbose_eval=100,
    early_stopping_rounds=20,
)

Case 1. No data available, trees only.

explainer = shap.TreeExplainer(model)
ev_trees = explainer.expected_value[0]

from shap.explainers._tree import XGBTreeModelLoader

xgb_loader = XGBTreeModelLoader(model)
ts = xgb_loader.get_trees()

v = []
for t in ts:
    v.append(t.values[0][0])
sv = sum(v)

import struct
from scipy.special import logit
size = struct.calcsize('f')
buffer = model.save_raw().lstrip(b'binf')
v = struct.unpack('f', buffer[0:0+size])[0]
# if objective "binary:logistic" or "reg:logistic" 
bv = logit(v)

ev_trees_raw = sv+bv

np.isclose(ev_trees, ev_trees_raw)

True

Case 2. Background data set supplied.

background = X_train[:100]

explainer = shap.TreeExplainer(model, background)
ev_background = explainer.expected_value

Take a note that:

np.isclose(ev_trees, ev_background)

False

but

d_train_background = xgboost.DMatrix(background, y_train[:100])
preds = model.predict(d_train_background, pred_contribs = True)

np.isclose(ev_background, preds.sum(1).mean())

True

or simply

output_margin = model.predict(d_train_background, output_margin=True)
np.isclose(ev_background, output_margin.mean())

True

Cotton answered 18/9, 2023 at 10:5 Comment(2)

I appreciate the response. Indeed, the test for additivity goes fine, as shown in the code posted with the question. Just keen to understand how was explainer.expected_value calculated for XGBoost classifier. Do you happen to know? Thanks again! :) (BTW, the explainer.expected_value for RandomForestClassifier() is y_train.mean(), code here: github.com/MenaWANG/ML_toy_examples/blob/main/explain%20models/… I find these differences very interesting) – Alejoa 21/9, 2023 at 9:37

Expected values are average of raw scores over background data set or supplied trees (if no background dataset is supplied) – Cotton 21/9, 2023 at 11:13

Some more explanatory code:

import json
import numpy as np
import shap
import xgboost as xgb
from scipy.special import expit, logit

print('shap.__version__:',shap.__version__)
print('xgb.__version__:',xgb.__version__)
print()

X, y = shap.datasets.adult()

estimator = xgb.XGBClassifier(
    objective='binary:logistic',
    n_estimators=200)

estimator.fit(X,y)

explainer = shap.TreeExplainer(
    model=estimator,
    feature_perturbation='tree_path_dependent',
    model_output='raw')

print("estimator.get_params()['n_estimators']:",estimator.get_params()['n_estimators'])
print('explainer.model.tree_limit:',explainer.model.tree_limit)
print()

print("float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']):",float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']))
print('expit(explainer.model.base_offset):',expit(explainer.model.base_offset))
print('expit(explainer.expected_value):',expit(explainer.expected_value))
print()

shap_values = explainer(X)

# phi is taken from shap.explainers._tree.TreeExplainer.shap_value
# Also see https://xgboost.readthedocs.io/en/stable/prediction.html
phi = explainer.model.original_model.predict(
    xgb.DMatrix(X),
    #iteration_range=(0, explainer.model.tree_limit),
    pred_contribs=True,
    approx_contribs=False,
    validate_features=False)

print('expit(estimator.get_booster().predict(xgb.DMatrix(X),pred_contribs=True))[0,-1]:',expit(estimator.get_booster().predict(xgb.DMatrix(X),pred_contribs=True))[0,-1])
print('expit(phi[0, -1]):',expit(phi[0, -1]))
print('expit(explainer.expected_value):',expit(explainer.expected_value))
print()

print('expit(phi[0].sum()):',expit(phi[0].sum()))
print('estimator.predict_proba(X.loc[[0]])[0,1]:',estimator.predict_proba(X.loc[[0]])[0,1])
print()

# https://xgboost.readthedocs.io/en/latest/tutorials/intercept.html
print('X.shape:',X.shape)
print('phi.shape:',phi.shape,'(extra column for expected value aka intercept)')
print('np.all(phi[:,-1] == explainer.expected_value):',np.all(phi[:,-1] == explainer.expected_value),'(the expected value is the same for all predictions)')

(I don't know how XGBoost calculates base_score.)

Output:

shap.__version__: 0.46.0
xgb.__version__: 2.1.0

estimator.get_params()['n_estimators']: 200
explainer.model.tree_limit: 200

float(json.loads(estimator.get_booster().save_config())['learner']['learner_model_param']['base_score']): 0.26177529
expit(explainer.model.base_offset): [0.26177529]
expit(explainer.expected_value): [0.26177529]

expit(estimator.get_booster().predict(xgb.DMatrix(X),pred_contribs=True))[0,-1]: 0.21074083
expit(phi[0, -1]): 0.21074083
expit(explainer.expected_value): 0.21074083

expit(phi[0].sum()): 0.000256332
estimator.predict_proba(X.loc[[0]])[0,1]: 0.00025633152

X.shape: (32561, 12)
phi.shape: (32561, 13) (extra column for expected value aka intercept)
np.all(phi[:,-1] == explainer.expected_value): True (the expected value is the same for all predictions)

Rixdollar answered 31/7, 2024 at 15:35 Comment(0)

Case 1. No data available, trees only.

Case 2. Background data set supplied.

Recommended topics

Hot tags