Difference between shap.TreeExplainer and shap.Explainer bar charts
Asked Answered
A

1

6

For the code given below, I am getting different bar plots for the shap values.

In this example, I have a dataset of 1000 train samples with 9 classes and 500 test samples. I then use the random forest as the classifier and generate a model. When I go about generating the shap bar plots I get different results in these two senarios:

shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)

enter image description here

and then:

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)

enter image description here

Can you explain what is the difference between the two plots and which one to use for feature importance?

Here is my code:

from sklearn.datasets import make_classification
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import pickle
import joblib
import warnings
import shap
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV

f, (ax1,ax2) = plt.subplots(nrows=1, ncols=2,figsize=(20,8))
# Generate noisy Data
X_train,y_train = make_classification(n_samples=1000, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          #weights=[0.5,0.5], 
                          random_state=17)

X_test,y_test = make_classification(n_samples=500, 
                          n_features=50, 
                          n_informative=9, 
                          n_redundant=0, 
                          n_repeated=0, 
                          n_classes=10, 
                          n_clusters_per_class=1,
                          class_sep=9,
                          flip_y=0.2,
                          #weights=[0.5,0.5], 
                          random_state=17)

model = RandomForestClassifier()

parameter_space = {
    'n_estimators': [10,50,100],
    'criterion': ['gini', 'entropy'],
    'max_depth': np.linspace(10,50,11),
}

clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True) # model
my_model = clf.fit(X_train,y_train)
print(f'Best Parameters: {clf.best_params_}')

# save the model to disk
filename = f'Testt-RF.sav'
pickle.dump(clf, open(filename, 'wb'))

shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)

shap.plots.bar(shap_values)

Thanks for your help and time!

Adit answered 12/8, 2022 at 4:17 Comment(1)
You can't draw X_train and X_test from different distributions. This is not the way ML is supposed to work. What you're doing is akin learning English and then going to a Chinese exam (or training your NN on cats and then trying to predict dogs). You draw a dataset once and then split it with train_test_split like I do in the answer.Goldenrod
G
8

There are 2 problems with your code:

  1. It's not reproducible
  2. You seem to be missing some important concepts in SHAP package, namely what data is used to "train" the explainer ("true to model" or "true to data" explanation) and what data is used to predict SHAP values.

As far as the first one is concerned, you may find many tutorials and even books online.

Concerning the second:

shap_values_Tree_tr = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)
shap.summary_plot(shap_values_Tree_tr, X_train)

is different to:

explainer2 = shap.Explainer(clf.best_estimator_.predict, X_test)
shap_values = explainer2(X_test)

because:

  1. first uses trained trees to predict; whereas second uses supplied X_test dataset to calculate SHAP values.
  2. Moreover, when you say
shap.Explainer(clf.best_estimator_.predict, X_test)

I'm pretty sure it's not the whole dataset X_test used for training your explainer, but rather a 100 datapoints subset of it.

  1. Finally,
shap.TreeExplainer(clf.best_estimator_).shap_values(X_train)

is different to

explainer2(X_test)

in that in the first case you're predicting (and averaging) for X_train, whereas in the second you're predicting (and averaging) for X_test. It's easy to confirm that when you compare the shapes.

So, how to reconcile the two? See the below for a reproducible example:

1. Imports, model, and data to train explainers on:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from shap import maskers
from shap import TreeExplainer, Explainer

X, y = make_classification(1500, 10)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=1000, random_state=42) 

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

background = maskers.Independent(X_train, 10) # data to train both explainers on

2. Compare explainers:

exp = TreeExplainer(clf, background)
sv = exp.shap_values(X_test)

exp2 = Explainer(clf, background)
sv2 = exp2(X_test)

np.allclose(sv[0], sv2.values[:,:,0])

True

I perhaps should have stated this from the very beginning: the 2 are guaranteed to show the same results (if used correctly), as Explainer class is a superset of TreeExplainer (it uses the latter when it sees a tree model).

Please ask questions if something is not clear.

Goldenrod answered 14/8, 2022 at 9:31 Comment(9)
Thanks for your answer. As I have already trained the classifier on the RandomForest [clf = GridSearchCV(model, parameter_space, cv = 5, scoring = "accuracy", verbose = True)]to gain insights as to the model, the correct methodology to use the shap values if I wanted to see what effect class 6 has on the model, I would use the commands shap_values = shap.TreeExplainer(clf.best_estimator_).shap_values(X_train); shap.summary_plot(shap_values[6], X_train)?Adit
Python has a 0 based numbering; so, a 6th element in an array accessed with 5.Goldenrod
I know this may seem that I am digressing from the original question, but when I try to get the waterfall plot, only if I run the commands explainer2 = shap.Explainer(clf.best_estimator_.predict, X_train); shap_values = explainer2(X_train) I can easily get the waterfall plot for shap.plots.waterfall(shap_values_recalled[6]). If I just use explainer = Explainer(clf.best_estimator_); shap_values_tr1 = explainer.shap_values(X_train) I get an error if I run shap.plots.waterfall(shap_values[6]). I don't know why.Adit
You're welcome to ask this as a questionGoldenrod
I submitted a new question: #73357415Adit
I just noticed that you do not use .predict in your examples for exp and exp2. Can you explain why not? Thanks!Adit
SHAP Explainer and TreeExplainer take model type as first argument. You may wish to check docs.Goldenrod
Let us continue this discussion in chat.Adit
At point 2 (compare explainers), I cannot access sv2.values[:,:,0]. It generates an error: too many values for array: array is 2-dimensional, but 3 were indexed. Some tip about it?Pl

© 2022 - 2024 — McMap. All rights reserved.