Question 1

I'm new to using shap, so I'm still trying to get my head around it. Basically, I have a simple sklearn.ensemble.RandomForestClassifier fit using model.fit(X_train,y_train), and so on. After training, I'd like to obtain the Shap values to explain predictions on unseen data. Based on the docs and other tutorials, this seems to be the way to go:

explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)

However, this takes a long time to run (about 18 hours for my data). If I replace the model.predict with just model in the first line, i.e:

explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)

It significantly reduces the runtime (down to about 40 minutes). So that leaves me to wonder what I'm actually getting in the second case?

To reiterate, I just want to be able to explain new predictions, and it seems strange to me that it would be this expensive - so I'm sure I'm doing something wrong.

Question 2

I think your question already contains a hint:

explainer = shap.Explainer(model.predict, X_train)
shap_values = explainer.shap_values(X_test)

is expensive and most probably is a kind of an exact algo to calculate Shapely values out of a function.

explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)

averages readily available predictions from trained model.

To prove the first claim (second is the fact of the matter) let's study source code for Explainer class.

Class definition:

class Explainer(Serializable):""" Uses Shapley values to explain any machine learning model or python function.This is the primary explainer interface for the SHAP library. It takes any combinationof a model and masker and returns a callable subclass object that implementsthe particular estimation algorithm that was chosen."""def __init__(self, model, masker=None, link=links.identity, algorithm="auto", output_names=None, feature_names=None, linearize_link=True,seed=None, **kwargs):""" Build a new explainer for the passed model.Parameters----------model : object or functionUser supplied function or model object that takes a dataset of samples andcomputes the output of the model for those samples.

So, now you know one can provide either a model or a function as the first argument.

In case Pandas is supplied as masker:

        if safe_isinstance(masker, "pandas.core.frame.DataFrame") or \((safe_isinstance(masker, "numpy.ndarray") or sp.sparse.issparse(masker)) and len(masker.shape) == 2):if algorithm == "partition":self.masker = maskers.Partition(masker)else:self.masker = maskers.Independent(masker)

Finally, if callable is supplied:

                elif callable(self.model):if issubclass(type(self.masker), maskers.Independent):if self.masker.shape[1] <= 10:algorithm = "exact"else:algorithm = "permutation"

Hopefully, you see now why the first one is an exact one (and thus takes long to calculate).

Now to your question(s):

What is the correct way to obtain explanations for predictions using Shap?

and

So that leaves me to wonder what I'm actually getting in the second case?

If you have a model (tree, linear, whatever) which is supported by SHAP use:

explainer = shap.Explainer(model, X_train)
shap_values = explainer.shap_values(X_test)

These are SHAP values extracted from a model and this is why SHAP came into existence.

If it's not supported, use 1st one.

Both should give similar results.

What is the correct way to obtain explanations for predictions using Shap?

Related Q&A

value error when using numpy.savetxt

How do I retrieve key from value in django choices field?

Errno13, Permission denied when trying to read file

How to write large JSON data?

Using absolute_import and handling relative module name conflicts in python [duplicate]

Setting results of torch.gather(...) calls

Why does PyCharm use double backslash to indicate escaping?

Execute Shell Script from Python with multiple pipes

Python: Is there a way to split a string of numbers into every 3rd number?

How to override the dir method for a class?