Question 1

Let's say I'm trying to predict an apartment price. So, I have a lot of labeled data, where on each apartment I have features that could affect the price like:

city
street
floor
year built
socioeconomic status
square feet
etc.

And I train a model, let's say XGBOOST. Now, I want to predict the price of a new apartment. Is there a good way to show what is "good" in this apartment, and what is bad, and by how much (scaled 0-1)?

For example: The floor number is a "strong" feature (i.e. - in this area this floor number is desired, thus affects positively on the price of the apartment), but the socioeconomic status is a weak feature (i.e. the socioeconomic status is low and thus affects negatively on the price of the apartment).

What I want is to illustrate more or less why my model decided on this price, and I want the user to get a feel of the apartment value by those indicators.

I thought of exhaustive search on each feature - but I'm afraid that will take too much time.

Is there a more brilliant way of doing this?

Any help would be much appreciated...

Question 2

Happy news for you, there is.

A package called "SHAP" (SHapley Additive exPlanation) was recently released just for that purpose. Here's a link to the github.

It supports visualization of complicated models (which are hard to intuitively explain) like boosted trees (and XGBOOST in particular!)

It can show you "real" feature importance which is better than the "gain", "weight", and "cover" xgboost supplies as they are not consistent.

You can read all about why SHAP is better for feature evaluation here.

It will be hard to give you code that will work for you, but there is a good documentation and you should write one that suits you.

Here's the guide lines of building your first graph:

import shap
import xgboost as xgb# Assume X_train and y_train are both features and labels of data samplesdtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names, weight=weights_trn)# Train your xgboost model
bst = xgb.train(params0, dtrain, num_boost_round=2500, evals=watchlist, early_stopping_rounds=200)# "explainer" object of shap
explainer = shap.TreeExplainer(bst)# "Values you explain, I took them from my training set but you can "explain" here what ever you want
shap_values = explainer.shap_values(X_test)shap.summary_plot(shap_values, X_test)
shap.summary_plot(shap_values, X_test, plot_type="bar")

To plot the "Why a certain sample got its score" you can either use built in SHAP function for it (only works on a Jupyter Notebook). Perfect example here

I personally wrote a function that will plot it using matplotlib, which will take some effort.

Here is an example of a plot I've made using the shap values (features are confidential so all erased) enter image description here

You can see a 97% prediction to be label=1 and each feature and how much it added or negate from the log-loss, for that specific sample.

Visualize strengths and weaknesses of a sample from pre-trained model

Related Q&A

Scrapy get result in shell but not in script

How to find a source when a website uses javascript

How to print a list of dicts as an aligned table?

abstract classes in python: Enforcing type

Convert image array to original svs format

Printing bytestring via variable

Loop and arrays of strings in python

2 Dendrograms + Heatmap from condensed correlationmatrix with scipy

Iterator example from Dive Into Python 3

Getting a 500 Internal Server Error using render_template and Flask [duplicate]