Shap value dimensions are different for RandomForest and XGB why/how? Is there something one can do about this?

import xgboost.sklearn as xgb import pandas as pd from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split import shap bc = load_breast_cancer() cancer_df = pd.DataFrame(bc['data'], columns=bc['feature_names']) cancer_df['target'] = bc['target'] cancer_df = cancer_df.iloc[0:50, :] target = cancer_df['target'] cancer_df.drop(['target'], inplace=True, axis=1) X_train, X_test, y_train, y_test = train_test_split(cancer_df, target, test_size=0.33, random_state = 42) xg = xgb.XGBClassifier() xg.fit(X_train, y_train) rf = RandomForestClassifier() rf.fit(X_train, y_train) xg_pred = xg.predict(X_test) rf_pred = rf.predict(X_test) rf_explainer = shap.TreeExplainer(rf, X_train) xg_explainer = shap.TreeExplainer(xg, X_train) rf_vals = rf_explainer.shap_values(X_train) xg_vals = xg_explainer.shap_values(X_train) print('Random Forest') print(type(rf_vals)) print(type(rf_vals[0])) print(rf_vals[0].shape) print(rf_vals[1].shape) print('XGBoost') print(type(xg_vals)) print(xg_vals.shape)

For binary classification:

SHAP values for XGBClassifier (sklearn API) are raw values for 1 class (one dimensional)
SHAP values for RandomForestClassifier are probabilities for 0 and 1 class (two dimensional).

DEMO

from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from shap import TreeExplainer
from scipy.special import expit

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

xgb = XGBClassifier(
    max_depth=5, n_estimators=100, eval_metric="logloss", use_label_encoder=False
).fit(X_train, y_train)
xgb_exp = TreeExplainer(xgb)
xgb_sv = np.array(xgb_exp.shap_values(X_test))
xgb_ev = np.array(xgb_exp.expected_value)

print("Shape of XGB SHAP values:", xgb_sv.shape)

rf = RandomForestClassifier(max_depth=5, n_estimators=100).fit(X_train, y_train)
rf_exp = TreeExplainer(rf)
rf_sv = np.array(rf_exp.shap_values(X_test))
rf_ev = np.array(rf_exp.expected_value)

print("Shape of RF SHAP values:", rf_sv.shape)

Shape of XGB SHAP values: (143, 30)
Shape of RF SHAP values: (2, 143, 30)

Interpretaion:

XGBoost (143,30) dimensions:

143: number of samples in test

30: number of features

RF (2,143,30) dimensions:

2: number of output classes

143: number of samples

30: number of features

To compare xgboost SHAP values to predicted probabilities, and thus classes, you may try adding SHAP values to base (expected) values. For 0th datapoint in test it will be:

xgb_pred = expit(xgb_sv[0,:].sum() + xgb_ev)
assert np.isclose(xgb_pred, xgb.predict_proba(X_test)[0,1])

To compare RF SHAP values to predicted probabilities for 0th datapoint:

rf_pred = rf_sv[1,0,:].sum() + rf_ev[1]
assert np.isclose(rf_pred, rf.predict_proba(X_test)[0,1])

Note, this analysis applies to (i) sklearn API and (ii) binary classification.

Recommended topics

Hot tags