SHAP is guaranteed to be additive in raw space (logits). To understand why additivity in raw scores doesn't extend to additivity in class predictions you may think for a while why exp(x+y) != exp(x) + exp(y)
Re: Just keen to understand how was explainer.expected_value calculated for XGBoost classifier. Do you happen to know?
As I stated in comments expected value comes either from the model trees or from your data.
Let's try reproducible:
from sklearn.model_selection import train_test_split
import xgboost
import shap
X, y = shap.datasets.adult()
X_display, y_display = shap.datasets.adult(display=True)
# create a train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=7)
d_train = xgboost.DMatrix(X_train, label=y_train)
d_test = xgboost.DMatrix(X_test, label=y_test)
params = {
"eta": 0.01,
"objective": "binary:logistic",
"subsample": 0.5,
"base_score": np.mean(y_train),
"eval_metric": "logloss",
}
model = xgboost.train(
params,
d_train,
num_boost_round=5000,
evals=[(d_test, "test")],
verbose_eval=100,
early_stopping_rounds=20,
)
Case 1. No data available, trees only.
explainer = shap.TreeExplainer(model)
ev_trees = explainer.expected_value[0]
from shap.explainers._tree import XGBTreeModelLoader
xgb_loader = XGBTreeModelLoader(model)
ts = xgb_loader.get_trees()
v = []
for t in ts:
v.append(t.values[0][0])
sv = sum(v)
import struct
from scipy.special import logit
size = struct.calcsize('f')
buffer = model.save_raw().lstrip(b'binf')
v = struct.unpack('f', buffer[0:0+size])[0]
# if objective "binary:logistic" or "reg:logistic"
bv = logit(v)
ev_trees_raw = sv+bv
np.isclose(ev_trees, ev_trees_raw)
True
Case 2. Background data set supplied.
background = X_train[:100]
explainer = shap.TreeExplainer(model, background)
ev_background = explainer.expected_value
Take a note that:
np.isclose(ev_trees, ev_background)
False
but
d_train_background = xgboost.DMatrix(background, y_train[:100])
preds = model.predict(d_train_background, pred_contribs = True)
np.isclose(ev_background, preds.sum(1).mean())
True
or simply
output_margin = model.predict(d_train_background, output_margin=True)
np.isclose(ev_background, output_margin.mean())
True
explainer.expected_value
calculated for XGBoost classifier. Do you happen to know? Thanks again! :) (BTW, theexplainer.expected_value
for RandomForestClassifier() is y_train.mean(), code here: github.com/MenaWANG/ML_toy_examples/blob/main/explain%20models/… I find these differences very interesting) – Alejoa