What do maskers really do in SHAP package and fit them to train or test?
Asked Answered
E

1

12

I have been trying to work with the shap package. I want to determine the shap values from my logistic regression model. Contrary to the TreeExplainer, the LinearExplainer requires a so-called masker. What exactly does this masker do and what is the difference between the independent and partition maskers?

Also, I am interested in the important features from the test-set. Do I then fit the masker on the training set or the test set? Below you can see a snippet of code.

model = LogisticRegression(random_state = 1)
model.fit(X_train, y_train)

masker = shap.maskers.Independent(data = X_train)
**or**
masker = shap.maskers.Independent(data = X_test)

explainer = shap.LinearExplainer(model, masker = masker)
shap_val = explainer(X_test)```

Electrocardiograph answered 10/3, 2021 at 8:20 Comment(0)
Z
21

Masker class provides a background data to "train" your explainer against. I.e., in:

explainer = shap.LinearExplainer(model, masker = masker)

you're using background data determined by masker (you may see what data is used by accessing masker.data attribute). You may read more about "true to model" or "true to data" explanations here or here.

Given above, calculation-wise you may do both:

masker = shap.maskers.Independent(data = X_train)

or


masker = shap.maskers.Independent(data = X_test)
explainer = shap.LinearExplainer(model, masker = masker)

but conceptually, imo the following makes more sense:

masker = shap.maskers.Independent(data = X_train)
explainer = shap.LinearExplainer(model, masker = masker)

This is akin usual train/test paradigm, where you train your model (and explainer) on train data, and try to predict (and explain) your test data.


Unrelated to the question. An alternative to masker, which samples data for you, would be to explicitly provide background that may allow comparing 2 datapoints: a point against which compare, and the point of interest, like in this notebook. In such a manner one may find out why 2 seemingly similar datapoints were classified differently.

Zamarripa answered 25/3, 2021 at 14:41 Comment(3)
Cheers! Makes more sense now.Electrocardiograph
I wonder if Shap value from the same model object should be different based on different test data, or should one model has only one shap value?Cousingerman
@Cousingerman They will be different. You may google for "true to model" or "true to data" discussions on github or check out this articleZamarripa

© 2022 - 2024 — McMap. All rights reserved.